PG Backend Proposal

All of lore.kernel.org
 help / color / mirror / Atom feed

* PG Backend Proposal
@ 2013-08-01 16:42 Loic Dachary
  2013-08-01 17:14 ` Samuel Just
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Loic Dachary @ 2013-08-01 16:42 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2340 bytes --]

Hi Sam,

When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number. 

That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation. 

I'm puzzled by:

CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.

because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.

I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.

I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path. 

https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 16:42 PG Backend Proposal Loic Dachary
@ 2013-08-01 17:14 ` Samuel Just
  2013-08-01 17:14 ` Loic Dachary
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Samuel Just @ 2013-08-01 17:14 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

DELETE can always be rolled forward, but there may be other operations
in the log that can't be (like an append).  So we need to be able to
roll it back (I think)  perform_write, read, try_rollback probably
don't matter to backfill, scrubbing.  You are correct, we need to
include the chunk number in the object as well!
-Sam

On Thu, Aug 1, 2013 at 9:42 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sam,
>
> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>
> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>
> I'm puzzled by:
>
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>
> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>
> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>
> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 16:42 PG Backend Proposal Loic Dachary
  2013-08-01 17:14 ` Samuel Just
@ 2013-08-01 17:14 ` Loic Dachary
  2013-08-01 23:54   ` Loic Dachary
  2013-08-05 10:29 ` Loic Dachary
  2013-08-05 14:18 ` Loic Dachary
  3 siblings, 1 reply; 12+ messages in thread
From: Loic Dachary @ 2013-08-01 17:14 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2567 bytes --]



On 01/08/2013 18:42, Loic Dachary wrote:
> Hi Sam,
> 
> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number. 
> 
> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation. 
> 
> I'm puzzled by:

I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.

:-)

> 
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
> 
> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
> 
> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
> 
> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path. 
> 
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 17:14 ` Loic Dachary
@ 2013-08-01 23:54   ` Loic Dachary
  2013-08-02  1:34     ` Samuel Just
  0 siblings, 1 reply; 12+ messages in thread
From: Loic Dachary @ 2013-08-01 23:54 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4461 bytes --]

Hi Sam,

I'm under the impression that
https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.

The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.

With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.

If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.

When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.

When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.

The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).

Cheers

On 01/08/2013 19:14, Loic Dachary wrote:
> 
> 
> On 01/08/2013 18:42, Loic Dachary wrote:
>> Hi Sam,
>>
>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number. 
>>
>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation. 
>>
>> I'm puzzled by:
> 
> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
> 
> :-)
> 
>>
>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>
>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>
>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>
>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path. 
>>
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 23:54   ` Loic Dachary
@ 2013-08-02  1:34     ` Samuel Just
  2013-08-02  3:39       ` Sage Weil
  2013-08-02  7:39       ` Loic Dachary
  0 siblings, 2 replies; 12+ messages in thread
From: Samuel Just @ 2013-08-02  1:34 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

I think there are some tricky edge cases with the above approach.  You
might end up with two pg replicas in the same acting set which happen
for reasons of history to have the same chunk for one or more objects.
 That would have to be detected and repaired even though the object
would be missing from neither replica (and might not even be in the pg
log).  The erasure_code_rank would have to be somehow maintained
through recovery (do we remember the original holder of a particular
chunk in case it ever comes back?).

The chunk rank doesn't *need* to match the acting set position, but
there are some good reasons to arrange for that to be the case:
1) Otherwise, we need something else to assign the chunk ranks
2) This way, a new primary can determine which osds hold which
replicas of which chunk rank by looking at past osd maps.

It seems to me that given an OSDMap and an object, we should know
immediately where all chunks should be stored since a future primary
may need to do that without access to the objects themselves.

Importantly, while it may be possible for an acting set transition
like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
mode which will cause replacement to behave well for erasure codes:

initial: [0,1,2]
0 fails: [3,1,2]
2 fails: [3,1,4]
0 recovers: [0,1,4]

We do, however, need to decouple primariness from position in the
acting set so that backfill can work well.
-Sam

On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sam,
>
> I'm under the impression that
> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>
> The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
>
> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
>
> If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
>
> When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
>
> When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
>
> The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
>
> Cheers
>
> On 01/08/2013 19:14, Loic Dachary wrote:
>>
>>
>> On 01/08/2013 18:42, Loic Dachary wrote:
>>> Hi Sam,
>>>
>>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>>>
>>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>>>
>>> I'm puzzled by:
>>
>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
>>
>> :-)
>>
>>>
>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>>
>>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>>
>>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>>
>>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>>>
>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>
>>> Cheers
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-02  1:34     ` Samuel Just
@ 2013-08-02  3:39       ` Sage Weil
  2013-08-02  7:39       ` Loic Dachary
  1 sibling, 0 replies; 12+ messages in thread
From: Sage Weil @ 2013-08-02  3:39 UTC (permalink / raw)
  To: Samuel Just; +Cc: Loic Dachary, Ceph Development

On Thu, 1 Aug 2013, Samuel Just wrote:
> I think there are some tricky edge cases with the above approach.  You
> might end up with two pg replicas in the same acting set which happen
> for reasons of history to have the same chunk for one or more objects.
>  That would have to be detected and repaired even though the object
> would be missing from neither replica (and might not even be in the pg
> log).  The erasure_code_rank would have to be somehow maintained
> through recovery (do we remember the original holder of a particular
> chunk in case it ever comes back?).
> 
> The chunk rank doesn't *need* to match the acting set position, but
> there are some good reasons to arrange for that to be the case:
> 1) Otherwise, we need something else to assign the chunk ranks
> 2) This way, a new primary can determine which osds hold which
> replicas of which chunk rank by looking at past osd maps.
> 
> It seems to me that given an OSDMap and an object, we should know
> immediately where all chunks should be stored since a future primary
> may need to do that without access to the objects themselves.
> 
> Importantly, while it may be possible for an acting set transition
> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
> mode which will cause replacement to behave well for erasure codes:
> 
> initial: [0,1,2]
> 0 fails: [3,1,2]
> 2 fails: [3,1,4]
> 0 recovers: [0,1,4]
> 
> We do, however, need to decouple primariness from position in the
> acting set so that backfill can work well.

BTW, this reminds me: this might also be a good time to add the ability in 
the OSDMap to probabilistically reorder acting sets (and adjust 
primariness) for any mapping so that we can cheaply shift read traffic 
around (e.g., based on normal statistical skew, or temorary workload 
balance).  Currently the only way to do this also involves moving data.

s


> -Sam
> 
> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
> > Hi Sam,
> >
> > I'm under the impression that
> > https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> > assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
> >
> > The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
> >
> > With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
> >
> > If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
> >
> > When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
> >
> > When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
> >
> > The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
> >
> > Cheers
> >
> > On 01/08/2013 19:14, Loic Dachary wrote:
> >>
> >>
> >> On 01/08/2013 18:42, Loic Dachary wrote:
> >>> Hi Sam,
> >>>
> >>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
> >>>
> >>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
> >>>
> >>> I'm puzzled by:
> >>
> >> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
> >>
> >> :-)
> >>
> >>>
> >>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
> >>>
> >>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
> >>>
> >>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
> >>>
> >>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
> >>>
> >>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> >>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
> >>>
> >>> Cheers
> >>>
> >>
> >
> > --
> > Lo?c Dachary, Artisan Logiciel Libre
> > All that is necessary for the triumph of evil is that good people do nothing.
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-02  1:34     ` Samuel Just
  2013-08-02  3:39       ` Sage Weil
@ 2013-08-02  7:39       ` Loic Dachary
  2013-08-02 15:10         ` Loic Dachary
  1 sibling, 1 reply; 12+ messages in thread
From: Loic Dachary @ 2013-08-02  7:39 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 8231 bytes --]

Hi Sam,

I think I understand and paraphrasing you to make sure I do. We may save bandwidth because chunks are not moved as much if their position is not tied to the position of the OSD containing them in the acting set. But this is mitigated by the use of the indep crush mode. And it may require to handle tricky edge cases. In addition, you think that being able to know which OSD contains which chunk by using only the OSDMap and the (v)hobject_t is going to simplify the design.

For the record:

Back in April Sage suggested that

"- those PGs use the parity ('INDEP') crush mode so that placement is intelligent"

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14579

"The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around."

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14582

and I assume that's what you refer to when you write "CRUSH has a mode which will cause replacement to behave well for erasure codes:

 initial: [0,1,2]
 0 fails: [3,1,2]
 2 fails: [3,1,4]
 0 recovers: [0,1,4]

I understand this is implemented here:

https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/crush/mapper.c#L523

and will determine to order of the acting set 

https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/osd/OSDMap.cc#L998

when called by the monitor when creating or updating a PG

https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/mon/PGMonitor.cc#L814

Cheers

On 02/08/2013 03:34, Samuel Just wrote:
> I think there are some tricky edge cases with the above approach.  You
> might end up with two pg replicas in the same acting set which happen
> for reasons of history to have the same chunk for one or more objects.
>  That would have to be detected and repaired even though the object
> would be missing from neither replica (and might not even be in the pg
> log).  The erasure_code_rank would have to be somehow maintained
> through recovery (do we remember the original holder of a particular
> chunk in case it ever comes back?).
> 
> The chunk rank doesn't *need* to match the acting set position, but
> there are some good reasons to arrange for that to be the case:
> 1) Otherwise, we need something else to assign the chunk ranks
> 2) This way, a new primary can determine which osds hold which
> replicas of which chunk rank by looking at past osd maps.
> 
> It seems to me that given an OSDMap and an object, we should know
> immediately where all chunks should be stored since a future primary
> may need to do that without access to the objects themselves.
> 
> Importantly, while it may be possible for an acting set transition
> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
> mode which will cause replacement to behave well for erasure codes:
> 
> initial: [0,1,2]
> 0 fails: [3,1,2]
> 2 fails: [3,1,4]
> 0 recovers: [0,1,4]
> 
> We do, however, need to decouple primariness from position in the
> acting set so that backfill can work well.
> -Sam
> 
> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Sam,
>>
>> I'm under the impression that
>> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>>
>> The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
>>
>> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
>>
>> If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
>>
>> When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
>>
>> When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
>>
>> The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
>>
>> Cheers
>>
>> On 01/08/2013 19:14, Loic Dachary wrote:
>>>
>>>
>>> On 01/08/2013 18:42, Loic Dachary wrote:
>>>> Hi Sam,
>>>>
>>>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>>>>
>>>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>>>>
>>>> I'm puzzled by:
>>>
>>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
>>>
>>> :-)
>>>
>>>>
>>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>>>
>>>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>>>
>>>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>>>
>>>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>>>>
>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>>
>>>> Cheers
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-02  7:39       ` Loic Dachary
@ 2013-08-02 15:10         ` Loic Dachary
  2013-08-02 17:11           ` Samuel Just
  0 siblings, 1 reply; 12+ messages in thread
From: Loic Dachary @ 2013-08-02 15:10 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 8914 bytes --]

Hi Sam,

> - coll_t needs to include a chunk_id_t.
https://github.com/athanatos/ceph/blob/2234bdf7fc30738363160d598ae8b4d6f75e1dd1/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions

That would be for sanity check ? Since the rank of the chunk ( chunk_id_t ) matches the position in the acting set and a history of osdmaps is kept, would this be used when loading the pg from disk to make sure it matches the expected chunk_id_t ?

Cheers

On 02/08/2013 09:39, Loic Dachary wrote:
> Hi Sam,
> 
> I think I understand and paraphrasing you to make sure I do. We may save bandwidth because chunks are not moved as much if their position is not tied to the position of the OSD containing them in the acting set. But this is mitigated by the use of the indep crush mode. And it may require to handle tricky edge cases. In addition, you think that being able to know which OSD contains which chunk by using only the OSDMap and the (v)hobject_t is going to simplify the design.
> 
> For the record:
> 
> Back in April Sage suggested that
> 
> "- those PGs use the parity ('INDEP') crush mode so that placement is intelligent"
> 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14579
> 
> "The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around."
> 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14582
> 
> and I assume that's what you refer to when you write "CRUSH has a mode which will cause replacement to behave well for erasure codes:
> 
>  initial: [0,1,2]
>  0 fails: [3,1,2]
>  2 fails: [3,1,4]
>  0 recovers: [0,1,4]
> 
> I understand this is implemented here:
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/crush/mapper.c#L523
> 
> and will determine to order of the acting set 
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/osd/OSDMap.cc#L998
> 
> when called by the monitor when creating or updating a PG
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/mon/PGMonitor.cc#L814
> 
> Cheers
> 
> On 02/08/2013 03:34, Samuel Just wrote:
>> I think there are some tricky edge cases with the above approach.  You
>> might end up with two pg replicas in the same acting set which happen
>> for reasons of history to have the same chunk for one or more objects.
>>  That would have to be detected and repaired even though the object
>> would be missing from neither replica (and might not even be in the pg
>> log).  The erasure_code_rank would have to be somehow maintained
>> through recovery (do we remember the original holder of a particular
>> chunk in case it ever comes back?).
>>
>> The chunk rank doesn't *need* to match the acting set position, but
>> there are some good reasons to arrange for that to be the case:
>> 1) Otherwise, we need something else to assign the chunk ranks
>> 2) This way, a new primary can determine which osds hold which
>> replicas of which chunk rank by looking at past osd maps.
>>
>> It seems to me that given an OSDMap and an object, we should know
>> immediately where all chunks should be stored since a future primary
>> may need to do that without access to the objects themselves.
>>
>> Importantly, while it may be possible for an acting set transition
>> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
>> mode which will cause replacement to behave well for erasure codes:
>>
>> initial: [0,1,2]
>> 0 fails: [3,1,2]
>> 2 fails: [3,1,4]
>> 0 recovers: [0,1,4]
>>
>> We do, however, need to decouple primariness from position in the
>> acting set so that backfill can work well.
>> -Sam
>>
>> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi Sam,
>>>
>>> I'm under the impression that
>>> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>>> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>>>
>>> The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
>>>
>>> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
>>>
>>> If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
>>>
>>> When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
>>>
>>> When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
>>>
>>> The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
>>>
>>> Cheers
>>>
>>> On 01/08/2013 19:14, Loic Dachary wrote:
>>>>
>>>>
>>>> On 01/08/2013 18:42, Loic Dachary wrote:
>>>>> Hi Sam,
>>>>>
>>>>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>>>>>
>>>>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>>>>>
>>>>> I'm puzzled by:
>>>>
>>>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
>>>>
>>>> :-)
>>>>
>>>>>
>>>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>>>>
>>>>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>>>>
>>>>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>>>>
>>>>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>>>>>
>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>>>
>>>>> Cheers
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-02 15:10         ` Loic Dachary
@ 2013-08-02 17:11           ` Samuel Just
  2013-08-05 12:36             ` Loic Dachary
  0 siblings, 1 reply; 12+ messages in thread
From: Samuel Just @ 2013-08-02 17:11 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

The reason for the chunk_id_t in the coll_t is to handle a tricky edge case:
[0,1,2]
[3,1,2]
..time passes..
[3,0,2]

This should be exceedingly rare, but a single osd might end up with
copies of two different chunks of the same pg.

When an osd joins an acting set with a preexisting copy of the pg and
can be brought up to date with logs, we must know which chunk each
object in the replica is without having to scan the PG (more to the
point, we need to know that each chunk matches the chunk of the osd
which it replaced).  If a pg replica can store any chunk, we need a
mechanism to ensure that.  It seems simpler to force all objects
within a replica to be the same chunk and furthermore to tie that
chunk to the position in the acting set.
-Sam

On Fri, Aug 2, 2013 at 8:10 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sam,
>
>> - coll_t needs to include a chunk_id_t.
> https://github.com/athanatos/ceph/blob/2234bdf7fc30738363160d598ae8b4d6f75e1dd1/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>
> That would be for sanity check ? Since the rank of the chunk ( chunk_id_t ) matches the position in the acting set and a history of osdmaps is kept, would this be used when loading the pg from disk to make sure it matches the expected chunk_id_t ?
>
> Cheers
>
> On 02/08/2013 09:39, Loic Dachary wrote:
>> Hi Sam,
>>
>> I think I understand and paraphrasing you to make sure I do. We may save bandwidth because chunks are not moved as much if their position is not tied to the position of the OSD containing them in the acting set. But this is mitigated by the use of the indep crush mode. And it may require to handle tricky edge cases. In addition, you think that being able to know which OSD contains which chunk by using only the OSDMap and the (v)hobject_t is going to simplify the design.
>>
>> For the record:
>>
>> Back in April Sage suggested that
>>
>> "- those PGs use the parity ('INDEP') crush mode so that placement is intelligent"
>>
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14579
>>
>> "The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around."
>>
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14582
>>
>> and I assume that's what you refer to when you write "CRUSH has a mode which will cause replacement to behave well for erasure codes:
>>
>>  initial: [0,1,2]
>>  0 fails: [3,1,2]
>>  2 fails: [3,1,4]
>>  0 recovers: [0,1,4]
>>
>> I understand this is implemented here:
>>
>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/crush/mapper.c#L523
>>
>> and will determine to order of the acting set
>>
>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/osd/OSDMap.cc#L998
>>
>> when called by the monitor when creating or updating a PG
>>
>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/mon/PGMonitor.cc#L814
>>
>> Cheers
>>
>> On 02/08/2013 03:34, Samuel Just wrote:
>>> I think there are some tricky edge cases with the above approach.  You
>>> might end up with two pg replicas in the same acting set which happen
>>> for reasons of history to have the same chunk for one or more objects.
>>>  That would have to be detected and repaired even though the object
>>> would be missing from neither replica (and might not even be in the pg
>>> log).  The erasure_code_rank would have to be somehow maintained
>>> through recovery (do we remember the original holder of a particular
>>> chunk in case it ever comes back?).
>>>
>>> The chunk rank doesn't *need* to match the acting set position, but
>>> there are some good reasons to arrange for that to be the case:
>>> 1) Otherwise, we need something else to assign the chunk ranks
>>> 2) This way, a new primary can determine which osds hold which
>>> replicas of which chunk rank by looking at past osd maps.
>>>
>>> It seems to me that given an OSDMap and an object, we should know
>>> immediately where all chunks should be stored since a future primary
>>> may need to do that without access to the objects themselves.
>>>
>>> Importantly, while it may be possible for an acting set transition
>>> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
>>> mode which will cause replacement to behave well for erasure codes:
>>>
>>> initial: [0,1,2]
>>> 0 fails: [3,1,2]
>>> 2 fails: [3,1,4]
>>> 0 recovers: [0,1,4]
>>>
>>> We do, however, need to decouple primariness from position in the
>>> acting set so that backfill can work well.
>>> -Sam
>>>
>>> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> Hi Sam,
>>>>
>>>> I'm under the impression that
>>>> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>>>> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>>>>
>>>> The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
>>>>
>>>> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
>>>>
>>>> If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
>>>>
>>>> When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
>>>>
>>>> When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
>>>>
>>>> The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
>>>>
>>>> Cheers
>>>>
>>>> On 01/08/2013 19:14, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 01/08/2013 18:42, Loic Dachary wrote:
>>>>>> Hi Sam,
>>>>>>
>>>>>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>>>>>>
>>>>>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>>>>>>
>>>>>> I'm puzzled by:
>>>>>
>>>>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
>>>>>
>>>>> :-)
>>>>>
>>>>>>
>>>>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>>>>>
>>>>>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>>>>>
>>>>>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>>>>>
>>>>>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>>>>>>
>>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 16:42 PG Backend Proposal Loic Dachary
  2013-08-01 17:14 ` Samuel Just
  2013-08-01 17:14 ` Loic Dachary
@ 2013-08-05 10:29 ` Loic Dachary
  2013-08-05 14:18 ` Loic Dachary
  3 siblings, 0 replies; 12+ messages in thread
From: Loic Dachary @ 2013-08-05 10:29 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]

Hi Sam,

* snapshots/clones:

I assume we don't want to support snapshots/clones for erasure coded PG. 

If automatic tiering is implemented, snaphoting an object would be possible when a replicated PG is tiered with an erasure coded PG. The erasure coded PG would be used only for demoted objects coming from the replicated PG (if they are not read for given period of time, for instance). 

But what happens with objects that have snapshots ? The tiering policy could be to never demote an object with snapshots/clones. And to automatically promote an object when a snapshot/clone is created. 

* watchers

The on disk info about watchers is included in object_info_t and will be persisted on each chunk in the OI_ATTR. Since there will be more chunks ( typically from 5 to 15 ) than replicas ( typically from 2 to 3 ) it means it will use more disk space but I'm under the impression that the average number of watchers per object is big enough for this to be a concern. The in core structure and the associated logic does seem dependent on the resilience policy. It could be moved entirely to the PGBackend implementation if it's common to both.

* PGBackend + PGBackendInterface

It would probably help testing if the proposed PGBackend interface 

https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/src/osd/PGBackend.h

was isolated in an abstract PGBackendInterface. ReplicatedPGBackend could inherit from PGBackendInterface and PGBackend ( which would contain the common watcher related code, PGRegistry etc. ). That would allow unit test code to synthetize behavior by provding PGBackendInterface to PG. 

https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-02 17:11           ` Samuel Just
@ 2013-08-05 12:36             ` Loic Dachary
  0 siblings, 0 replies; 12+ messages in thread
From: Loic Dachary @ 2013-08-05 12:36 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 12149 bytes --]

Hi Sam,

Now I understand the rationale :-) What I still don't understand is why it should be in coll_t. It is the name of the directory

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/osd_types.h#L394

which is listed when loading the pgs

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/OSD.cc#L1908

and parsed 

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/osd_types.cc#L297

into pg_t

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/osd_types.cc#L170

which contains the pool, the seed ( I think for pg splitting purposes ), and preferred ( it seems to be about localized PG but I don't know anything about them and I assume it's not important at this moment ). 

Do you mean adding chunk_id_t to pg_t instead of coll_t ? It would mean that the chunk_id would have to be encoded in the directory name in which the PG objects are stored. And therefore that the coll_t would have to be renamed each time the PG changes position in the acting set. 

When loading a pg, 

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/OSD.cc#L1954
https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/PG.cc#L2396

it gets pg_info_t from disk 

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/PG.cc#L2363

as well as the past intervals

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/PG.cc#L2371

which includes the acting set

https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/osd_types.h#L1180

for each epoch. Would it make sense to store the chunk_id in the pg_info_t or even compute it from the position of the OSD in the previous acting set ?

Cheers

On 02/08/2013 19:11, Samuel Just wrote:
> The reason for the chunk_id_t in the coll_t is to handle a tricky edge case:
> [0,1,2]
> [3,1,2]
> ..time passes..
> [3,0,2]
> 
> This should be exceedingly rare, but a single osd might end up with
> copies of two different chunks of the same pg.
> 
> When an osd joins an acting set with a preexisting copy of the pg and
> can be brought up to date with logs, we must know which chunk each
> object in the replica is without having to scan the PG (more to the
> point, we need to know that each chunk matches the chunk of the osd
> which it replaced).  If a pg replica can store any chunk, we need a
> mechanism to ensure that.  It seems simpler to force all objects
> within a replica to be the same chunk and furthermore to tie that
> chunk to the position in the acting set.
> -Sam
> 
> On Fri, Aug 2, 2013 at 8:10 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Sam,
>>
>>> - coll_t needs to include a chunk_id_t.
>> https://github.com/athanatos/ceph/blob/2234bdf7fc30738363160d598ae8b4d6f75e1dd1/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>>
>> That would be for sanity check ? Since the rank of the chunk ( chunk_id_t ) matches the position in the acting set and a history of osdmaps is kept, would this be used when loading the pg from disk to make sure it matches the expected chunk_id_t ?
>>
>> Cheers
>>
>> On 02/08/2013 09:39, Loic Dachary wrote:
>>> Hi Sam,
>>>
>>> I think I understand and paraphrasing you to make sure I do. We may save bandwidth because chunks are not moved as much if their position is not tied to the position of the OSD containing them in the acting set. But this is mitigated by the use of the indep crush mode. And it may require to handle tricky edge cases. In addition, you think that being able to know which OSD contains which chunk by using only the OSDMap and the (v)hobject_t is going to simplify the design.
>>>
>>> For the record:
>>>
>>> Back in April Sage suggested that
>>>
>>> "- those PGs use the parity ('INDEP') crush mode so that placement is intelligent"
>>>
>>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14579
>>>
>>> "The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around."
>>>
>>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14582
>>>
>>> and I assume that's what you refer to when you write "CRUSH has a mode which will cause replacement to behave well for erasure codes:
>>>
>>>  initial: [0,1,2]
>>>  0 fails: [3,1,2]
>>>  2 fails: [3,1,4]
>>>  0 recovers: [0,1,4]
>>>
>>> I understand this is implemented here:
>>>
>>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/crush/mapper.c#L523
>>>
>>> and will determine to order of the acting set
>>>
>>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/osd/OSDMap.cc#L998
>>>
>>> when called by the monitor when creating or updating a PG
>>>
>>> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/mon/PGMonitor.cc#L814
>>>
>>> Cheers
>>>
>>> On 02/08/2013 03:34, Samuel Just wrote:
>>>> I think there are some tricky edge cases with the above approach.  You
>>>> might end up with two pg replicas in the same acting set which happen
>>>> for reasons of history to have the same chunk for one or more objects.
>>>>  That would have to be detected and repaired even though the object
>>>> would be missing from neither replica (and might not even be in the pg
>>>> log).  The erasure_code_rank would have to be somehow maintained
>>>> through recovery (do we remember the original holder of a particular
>>>> chunk in case it ever comes back?).
>>>>
>>>> The chunk rank doesn't *need* to match the acting set position, but
>>>> there are some good reasons to arrange for that to be the case:
>>>> 1) Otherwise, we need something else to assign the chunk ranks
>>>> 2) This way, a new primary can determine which osds hold which
>>>> replicas of which chunk rank by looking at past osd maps.
>>>>
>>>> It seems to me that given an OSDMap and an object, we should know
>>>> immediately where all chunks should be stored since a future primary
>>>> may need to do that without access to the objects themselves.
>>>>
>>>> Importantly, while it may be possible for an acting set transition
>>>> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
>>>> mode which will cause replacement to behave well for erasure codes:
>>>>
>>>> initial: [0,1,2]
>>>> 0 fails: [3,1,2]
>>>> 2 fails: [3,1,4]
>>>> 0 recovers: [0,1,4]
>>>>
>>>> We do, however, need to decouple primariness from position in the
>>>> acting set so that backfill can work well.
>>>> -Sam
>>>>
>>>> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>> Hi Sam,
>>>>>
>>>>> I'm under the impression that
>>>>> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>>>>> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>>>>>
>>>>> The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.
>>>>>
>>>>> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.
>>>>>
>>>>> If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.
>>>>>
>>>>> When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.
>>>>>
>>>>> When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.
>>>>>
>>>>> The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/08/2013 19:14, Loic Dachary wrote:
>>>>>>
>>>>>>
>>>>>> On 01/08/2013 18:42, Loic Dachary wrote:
>>>>>>> Hi Sam,
>>>>>>>
>>>>>>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number.
>>>>>>>
>>>>>>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation.
>>>>>>>
>>>>>>> I'm puzzled by:
>>>>>>
>>>>>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
>>>>>>
>>>>>> :-)
>>>>>>
>>>>>>>
>>>>>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>>>>>>
>>>>>>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>>>>>>
>>>>>>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>>>>>>
>>>>>>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path.
>>>>>>>
>>>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PG Backend Proposal
  2013-08-01 16:42 PG Backend Proposal Loic Dachary
                   ` (2 preceding siblings ...)
  2013-08-05 10:29 ` Loic Dachary
@ 2013-08-05 14:18 ` Loic Dachary
  3 siblings, 0 replies; 12+ messages in thread
From: Loic Dachary @ 2013-08-05 14:18 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 750 bytes --]

Hi Sam,

In case you don't get notified, I proposed a few changes to your erasure_coding document. Basically links to the tickets you created last friday.

https://github.com/ceph/ceph/pull/482

I also reworked the coding tasks of the blueprint

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Erasure_coded_storage_backend_%28step_2%29

to also link to them. 

It would be convenient if 

https://github.com/ceph/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst

was in master instead of a branch to create permalinks that will survive the deletion of this branch. 

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-08-05 14:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-01 16:42 PG Backend Proposal Loic Dachary
2013-08-01 17:14 ` Samuel Just
2013-08-01 17:14 ` Loic Dachary
2013-08-01 23:54   ` Loic Dachary
2013-08-02  1:34     ` Samuel Just
2013-08-02  3:39       ` Sage Weil
2013-08-02  7:39       ` Loic Dachary
2013-08-02 15:10         ` Loic Dachary
2013-08-02 17:11           ` Samuel Just
2013-08-05 12:36             ` Loic Dachary
2013-08-05 10:29 ` Loic Dachary
2013-08-05 14:18 ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.