All of lore.kernel.org
 help / color / mirror / Atom feed
From: Loic Dachary <loic@dachary.org>
To: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: erasure coding (sorry)
Date: Mon, 22 Apr 2013 20:09:47 +0200	[thread overview]
Message-ID: <51757CEB.4090701@dachary.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1304220805100.14056@cobra.newdream.net>

[-- Attachment #1: Type: text/plain, Size: 6560 bytes --]

Hi Sage,

On 04/22/2013 05:09 PM, Sage Weil wrote:
> On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:
>>
>>> Hi Christopher,
>>>
>>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.
>>
>> Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.  
>>
>> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.  
>>
>> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?
> 
> It would go to osd18, the first item in the sequence that CRUSH generates.
>            
> As Loic observes, not having control of placement from above the librados
> level makes this more or less a non-started.  The only thing that might   
> work at that layer is to set up ~5 or more pools, each with a distinct set
> of OSDs, and put each shard/fragment in a different pool.  I don't think  
> that is a particularly good approach.
> 
> If we are going to do parity encoding (and I think we should!), I think we
> should fully integrate it into the OSD.
>            
> The simplest approach:
>            
>  - we create a new PG type for 'parity' or 'erasure' or whatever (type    
>    fields already exist)
>  - those PGs use the parity ('INDEP') crush mode so that placement is
>    intelligent

I assume you do not mean CEPH_FEATURE_INDEP_PG_MAP as used in

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L5237

but CRUSH_RULE_CHOOSE_INDEP as used in

https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L331

when firstn == 0 because it was set in

https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L523

I see that it would be simpler to write

   step choose indep 0 type row

and then rely on intelligent placement. Is there a reason why it would not be possible to use firstn instead of indep ?

>  - all reads and writes go to the 'primary'               
>  - the primary does the shard encoding and distributes the write pieces to
>    the other replicas

Although I understand how that would work when a PG receives a CEPH_OSD_OP_WRITEFULL

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L2504

It may be inconvenient and expensive to recompute the parity encoded version if an object is written with a series of CEPH_OSD_OP_WRITE. The simplest way would be to decode the existing object, modify it according to what CEPH_OSD_OP_WRITE requires, encode it.

>  - same for reads
>            
> There will be a pile of patches to move code around between PG and 
> ReplicatedPG, which will be annoying, but hopefully not too painful.  The 
> class structure and data types were set up with this in mind long ago.
> 
> Several key challenges:
> 
>  - come up with a scheme for internal naming to keep shards distinct
>  - safely rewriting a stripe when there is a partial overwrite.  probably 
>    want to write new stripes to distinct new objects (cloning old data as 
>    needed) and clean up the old ones once enough copies are present.

Do you mean RBD stripes ? 

>  - recovery logic

If recovery is done from the scheduled scrubber in the ErasureCodedPG , I'm not sure if OSD.cc must be modified or is truly independent of the PG type 

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L3818

I'll keep looking, thanks a lot for the hints :-)

Cheers

> sage
> 
> 
>>
>> 	Thx
>> 	Christopher
>>
>>
>>
>>>
>>> Am I missing something ?
>>>
>>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
>>>>
>>>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>>>> Mark Nelson <mark.nelson@inktank.com> wrote:
>>>>
>>>>>>
>>>>>
>>>>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
>>>>> but unfortunately it seems everything related to the cleversafe open source project
>>>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>>>>>
>>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes
>>>>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
>>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
>>>>> code and/or allow some integration with replication based objects, without aiming for full integration or
>>>>> full support of the rados api, based on some tradeoffs.
>>>>>
>>>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
>>>>
>>>> Greetings - it does now - see what you all think?
>>>>
>>>> 	Christopher
>>>>
>>>>>
>>>>> Dieter
>>>>
>>>>
>>>> --
>>>> ???
>>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>>>> Check my calendar availability: https://tungle.me/cdl
>>>
>>> -- 
>>> Lo?c Dachary, Artisan Logiciel Libre
>>
>>
>> --
>> ???
>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>> Check my calendar availability: https://tungle.me/cdl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

  reply	other threads:[~2013-04-22 18:09 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-18 20:28 erasure coding (sorry) Plaetinck, Dieter
2013-04-18 20:47 ` Sage Weil
2013-04-18 21:08   ` Josh Durgin
2013-04-18 21:09     ` Mark Nelson
2013-04-18 21:31       ` Plaetinck, Dieter
2013-04-19  0:33         ` Christopher LILJENSTOLPE
2013-04-22  7:23         ` Christopher LILJENSTOLPE
2013-04-22  8:10           ` Loic Dachary
2013-04-22 14:08             ` Christopher LILJENSTOLPE
2013-04-22 15:09               ` Sage Weil
2013-04-22 18:09                 ` Loic Dachary [this message]
2013-04-22 18:31                   ` Sage Weil
2013-04-24  4:35                 ` Christopher LILJENSTOLPE
2013-04-18 21:24     ` Noah Watkins
2013-04-18 21:26       ` Sage Weil
2013-04-19  0:47         ` Christopher LILJENSTOLPE
2013-04-21  2:37           ` Loic Dachary
2013-04-19  0:34       ` Christopher LILJENSTOLPE
2013-04-19  0:29     ` Christopher LILJENSTOLPE

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51757CEB.4090701@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.