erasure coding (sorry)

All of lore.kernel.org
 help / color / mirror / Atom feed

* erasure coding (sorry)
@ 2013-04-18 20:28 Plaetinck, Dieter
  2013-04-18 20:47 ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Plaetinck, Dieter @ 2013-04-18 20:28 UTC (permalink / raw)
  To: ceph-devel

sorry to bring this up again, googling revealed some people don't like the subject [anymore].

but I'm working on a new +- 3PB cluster for storage of immutable files.
and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
For this use case, my impression is erasure coding would make a lot of sense
(though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
inbound traffic would be minimal)

I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.

thanks,
Dieter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 20:28 erasure coding (sorry) Plaetinck, Dieter
@ 2013-04-18 20:47 ` Sage Weil
  2013-04-18 21:08   ` Josh Durgin
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2013-04-18 20:47 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: ceph-devel

On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
> sorry to bring this up again, googling revealed some people don't like the subject [anymore].
> 
> but I'm working on a new +- 3PB cluster for storage of immutable files.
> and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
> For this use case, my impression is erasure coding would make a lot of sense
> (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
> inbound traffic would be minimal)
> 
> I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
> if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.

We would love to do it, but it is not a priority at the moment (things 
like multi-site replication are in much higher demand).  That of course 
doesn't prevent someone outside of Inktank from working on it :)

The main caveat is that it will be complicate.  For an initial 
implementation, the full breadth of the rados API probably wouldn't be 
support for erasure/parity encoded pools (thinkgs like rados classes and 
the omap key/value api get tricky when you start talking about parity). 
But for many (or even most) use cases, objects are just bytes, and those 
restrictions are just fine.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 20:47 ` Sage Weil
@ 2013-04-18 21:08   ` Josh Durgin
  2013-04-18 21:09     ` Mark Nelson
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Josh Durgin @ 2013-04-18 21:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: Plaetinck, Dieter, ceph-devel, cdl, danm

On 04/18/2013 01:47 PM, Sage Weil wrote:
> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>> sorry to bring this up again, googling revealed some people don't like the subject [anymore].
>>
>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>> and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
>> For this use case, my impression is erasure coding would make a lot of sense
>> (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
>> inbound traffic would be minimal)
>>
>> I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
>> if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.
>
> We would love to do it, but it is not a priority at the moment (things
> like multi-site replication are in much higher demand).  That of course
> doesn't prevent someone outside of Inktank from working on it :)
>
> The main caveat is that it will be complicate.  For an initial
> implementation, the full breadth of the rados API probably wouldn't be
> support for erasure/parity encoded pools (thinkgs like rados classes and
> the omap key/value api get tricky when you start talking about parity).
> But for many (or even most) use cases, objects are just bytes, and those
> restrictions are just fine.

I talked to some folks interested in doing a more limited form of this
yesterday. They started a blueprint [1]. One of their ideas was to have
erasure coding done by a separate process (or thread perhaps). It would
use erasure coding on an object and then use librados to store the
rasure-encoded pieces in a separate pool, and finally leave a marker in
place of the original object in the first pool.

When the osd detected this marker, it would proxy the request to the
erasure coding thread/process which would service the request on the
second pool for reads, and potentially make writes move the data back to
the first pool in a tiering sort of scenario.

I might have misremembered some details, but I think it's an
interesting way to get many of the benefits of erasure coding with a 
relatively small amount of work compared to a fully native osd solution.

Josh

[1] 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:08   ` Josh Durgin
@ 2013-04-18 21:09     ` Mark Nelson
  2013-04-18 21:31       ` Plaetinck, Dieter
  2013-04-18 21:24     ` Noah Watkins
  2013-04-19  0:29     ` Christopher LILJENSTOLPE
  2 siblings, 1 reply; 19+ messages in thread
From: Mark Nelson @ 2013-04-18 21:09 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Plaetinck, Dieter, ceph-devel, cdl, danm

On 04/18/2013 04:08 PM, Josh Durgin wrote:
> On 04/18/2013 01:47 PM, Sage Weil wrote:
>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>>> sorry to bring this up again, googling revealed some people don't
>>> like the subject [anymore].
>>>
>>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>>> and it would be either all cold data, or mostly cold. 150MB avg
>>> filesize, max size 5GB (for now)
>>> For this use case, my impression is erasure coding would make a lot
>>> of sense
>>> (though I'm not sure about the computational overhead on storing and
>>> loading objects..? outbound traffic would peak at 6 Gbps, but I can
>>> make it way less and still keep a large cluster, by taking away the
>>> small set of hot files.
>>> inbound traffic would be minimal)
>>>
>>> I know that the answer a while ago was "no plans to implement erasure
>>> coding", has this changed?
>>> if not, is anyone aware of a similar system that does support it? I
>>> found QFS but that's meant for batch processing, has a single
>>> 'namenode' etc.
>>
>> We would love to do it, but it is not a priority at the moment (things
>> like multi-site replication are in much higher demand).  That of course
>> doesn't prevent someone outside of Inktank from working on it :)
>>
>> The main caveat is that it will be complicate.  For an initial
>> implementation, the full breadth of the rados API probably wouldn't be
>> support for erasure/parity encoded pools (thinkgs like rados classes and
>> the omap key/value api get tricky when you start talking about parity).
>> But for many (or even most) use cases, objects are just bytes, and those
>> restrictions are just fine.
>
> I talked to some folks interested in doing a more limited form of this
> yesterday. They started a blueprint [1]. One of their ideas was to have
> erasure coding done by a separate process (or thread perhaps). It would
> use erasure coding on an object and then use librados to store the
> rasure-encoded pieces in a separate pool, and finally leave a marker in
> place of the original object in the first pool.
>
> When the osd detected this marker, it would proxy the request to the
> erasure coding thread/process which would service the request on the
> second pool for reads, and potentially make writes move the data back to
> the first pool in a tiering sort of scenario.
>
> I might have misremembered some details, but I think it's an
> interesting way to get many of the benefits of erasure coding with a
> relatively small amount of work compared to a fully native osd solution.
>
> Josh

Neat. :)

>
> [1]
> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:08   ` Josh Durgin
  2013-04-18 21:09     ` Mark Nelson
@ 2013-04-18 21:24     ` Noah Watkins
  2013-04-18 21:26       ` Sage Weil
  2013-04-19  0:34       ` Christopher LILJENSTOLPE
  2013-04-19  0:29     ` Christopher LILJENSTOLPE
  2 siblings, 2 replies; 19+ messages in thread
From: Noah Watkins @ 2013-04-18 21:24 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Plaetinck, Dieter, ceph-devel, cdl, danm


On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@inktank.com> wrote:

> I talked to some folks interested in doing a more limited form of this
> yesterday. They started a blueprint [1]. One of their ideas was to have
> erasure coding done by a separate process (or thread perhaps). It would
> use erasure coding on an object and then use librados to store the
> rasure-encoded pieces in a separate pool, and finally leave a marker in
> place of the original object in the first pool.

This sounds at a high-level similar to work out of Microsoft:

  https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf

The basic idea is to replicate first, then erasure code in the background.

- Noah

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:24     ` Noah Watkins
@ 2013-04-18 21:26       ` Sage Weil
  2013-04-19  0:47         ` Christopher LILJENSTOLPE
  2013-04-19  0:34       ` Christopher LILJENSTOLPE
  1 sibling, 1 reply; 19+ messages in thread
From: Sage Weil @ 2013-04-18 21:26 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Josh Durgin, Plaetinck, Dieter, ceph-devel, cdl, danm

On Thu, 18 Apr 2013, Noah Watkins wrote:
> On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> 
> > I talked to some folks interested in doing a more limited form of this
> > yesterday. They started a blueprint [1]. One of their ideas was to have
> > erasure coding done by a separate process (or thread perhaps). It would
> > use erasure coding on an object and then use librados to store the
> > rasure-encoded pieces in a separate pool, and finally leave a marker in
> > place of the original object in the first pool.
> 
> This sounds at a high-level similar to work out of Microsoft:
> 
>   https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
> 
> The basic idea is to replicate first, then erasure code in the background.

FWIW, I think a useful (and generic) concept to add to rados would be a 
redirect symlink sort of thing that says "oh, this object is over there is 
that other pool", such that client requests will be transparently 
redirected or proxied.  This will enable generic tiering type operations, 
and probably simplify/enable migration without a lot of additional 
complexity on the client side.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:09     ` Mark Nelson
@ 2013-04-18 21:31       ` Plaetinck, Dieter
  2013-04-19  0:33         ` Christopher LILJENSTOLPE
  2013-04-22  7:23         ` Christopher LILJENSTOLPE
  0 siblings, 2 replies; 19+ messages in thread
From: Plaetinck, Dieter @ 2013-04-18 21:31 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Josh Durgin, Sage Weil, ceph-devel, cdl, danm

On Thu, 18 Apr 2013 16:09:52 -0500
Mark Nelson <mark.nelson@inktank.com> wrote:

> On 04/18/2013 04:08 PM, Josh Durgin wrote:
> > On 04/18/2013 01:47 PM, Sage Weil wrote:
> >> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
> >>> sorry to bring this up again, googling revealed some people don't
> >>> like the subject [anymore].
> >>>
> >>> but I'm working on a new +- 3PB cluster for storage of immutable files.
> >>> and it would be either all cold data, or mostly cold. 150MB avg
> >>> filesize, max size 5GB (for now)
> >>> For this use case, my impression is erasure coding would make a lot
> >>> of sense
> >>> (though I'm not sure about the computational overhead on storing and
> >>> loading objects..? outbound traffic would peak at 6 Gbps, but I can
> >>> make it way less and still keep a large cluster, by taking away the
> >>> small set of hot files.
> >>> inbound traffic would be minimal)
> >>>
> >>> I know that the answer a while ago was "no plans to implement erasure
> >>> coding", has this changed?
> >>> if not, is anyone aware of a similar system that does support it? I
> >>> found QFS but that's meant for batch processing, has a single
> >>> 'namenode' etc.
> >>
> >> We would love to do it, but it is not a priority at the moment (things
> >> like multi-site replication are in much higher demand).  That of course
> >> doesn't prevent someone outside of Inktank from working on it :)
> >>
> >> The main caveat is that it will be complicate.  For an initial
> >> implementation, the full breadth of the rados API probably wouldn't be
> >> support for erasure/parity encoded pools (thinkgs like rados classes and
> >> the omap key/value api get tricky when you start talking about parity).
> >> But for many (or even most) use cases, objects are just bytes, and those
> >> restrictions are just fine.
> >
> > I talked to some folks interested in doing a more limited form of this
> > yesterday. They started a blueprint [1]. One of their ideas was to have
> > erasure coding done by a separate process (or thread perhaps). It would
> > use erasure coding on an object and then use librados to store the
> > rasure-encoded pieces in a separate pool, and finally leave a marker in
> > place of the original object in the first pool.
> >
> > When the osd detected this marker, it would proxy the request to the
> > erasure coding thread/process which would service the request on the
> > second pool for reads, and potentially make writes move the data back to
> > the first pool in a tiering sort of scenario.
> >
> > I might have misremembered some details, but I think it's an
> > interesting way to get many of the benefits of erasure coding with a
> > relatively small amount of work compared to a fully native osd solution.
> >
> > Josh
> 
> Neat. :)
> 

@Bryan: I did come across cleversafe.  all the articles around it seemed promising,
but unfortunately it seems everything related to the cleversafe open source project
somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...

@Sage: interesting. I thought it would be more relatively simple if one assumes
the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
When building an erasure codes-based system, maybe there's ways to reuse existing ceph
code and/or allow some integration with replication based objects, without aiming for full integration or
full support of the rados api, based on some tradeoffs.

@Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)

Dieter

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:08   ` Josh Durgin
  2013-04-18 21:09     ` Mark Nelson
  2013-04-18 21:24     ` Noah Watkins
@ 2013-04-19  0:29     ` Christopher LILJENSTOLPE
  2 siblings, 0 replies; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-19  0:29 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Plaetinck, Dieter, ceph-devel, danm

[-- Attachment #1: Type: text/plain, Size: 3363 bytes --]

Supposedly, on 2013-Apr-18, at 14.08 PDT(-0700), someone claiming to be Josh Durgin scribed:

> On 04/18/2013 01:47 PM, Sage Weil wrote:
>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>>> sorry to bring this up again, googling revealed some people don't like the subject [anymore].
>>>
>>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>>> and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
>>> For this use case, my impression is erasure coding would make a lot of sense
>>> (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
>>> inbound traffic would be minimal)
>>>
>>> I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
>>> if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.
>>
>> We would love to do it, but it is not a priority at the moment (things
>> like multi-site replication are in much higher demand).  That of course
>> doesn't prevent someone outside of Inktank from working on it :)
>>
>> The main caveat is that it will be complicate.  For an initial
>> implementation, the full breadth of the rados API probably wouldn't be
>> support for erasure/parity encoded pools (thinkgs like rados classes and
>> the omap key/value api get tricky when you start talking about parity).
>> But for many (or even most) use cases, objects are just bytes, and those
>> restrictions are just fine.
>
> I talked to some folks interested in doing a more limited form of this
> yesterday. They started a blueprint [1]. One of their ideas was to have
> erasure coding done by a separate process (or thread perhaps). It would
> use erasure coding on an object and then use librados to store the
> rasure-encoded pieces in a separate pool, and finally leave a marker in
> place of the original object in the first pool.
>
> When the osd detected this marker, it would proxy the request to the
> erasure coding thread/process which would service the request on the
> second pool for reads, and potentially make writes move the data back to
> the first pool in a tiering sort of scenario.
>
> I might have misremembered some details, but I think it's an
> interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution.

Greetings,

	I'm one of those individuals :)  Our thinking is evolving on this, and I think we can keep most of the work out of the main machinery of ceph, and simply require a modified client that runs the "proxy" function on the "hot" pool OSDs. Even wondering if it could be prototyped in fuse.  I will be writing this up in the next day or two in the blueprint below.  Josh has the idea basically correct.

>
> Josh

Christopher

>
> [1] http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:31       ` Plaetinck, Dieter
@ 2013-04-19  0:33         ` Christopher LILJENSTOLPE
  2013-04-22  7:23         ` Christopher LILJENSTOLPE
  1 sibling, 0 replies; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-19  0:33 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: Mark Nelson, Josh Durgin, Sage Weil, ceph-devel, danm

[-- Attachment #1: Type: text/plain, Size: 4552 bytes --]

Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:

> On Thu, 18 Apr 2013 16:09:52 -0500
> Mark Nelson <mark.nelson@inktank.com> wrote:
>
>> On 04/18/2013 04:08 PM, Josh Durgin wrote:
>>> On 04/18/2013 01:47 PM, Sage Weil wrote:
>>>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>>>>> sorry to bring this up again, googling revealed some people don't
>>>>> like the subject [anymore].
>>>>>
>>>>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>>>>> and it would be either all cold data, or mostly cold. 150MB avg
>>>>> filesize, max size 5GB (for now)
>>>>> For this use case, my impression is erasure coding would make a lot
>>>>> of sense
>>>>> (though I'm not sure about the computational overhead on storing and
>>>>> loading objects..? outbound traffic would peak at 6 Gbps, but I can
>>>>> make it way less and still keep a large cluster, by taking away the
>>>>> small set of hot files.
>>>>> inbound traffic would be minimal)
>>>>>
>>>>> I know that the answer a while ago was "no plans to implement erasure
>>>>> coding", has this changed?
>>>>> if not, is anyone aware of a similar system that does support it? I
>>>>> found QFS but that's meant for batch processing, has a single
>>>>> 'namenode' etc.
>>>>
>>>> We would love to do it, but it is not a priority at the moment (things
>>>> like multi-site replication are in much higher demand).  That of course
>>>> doesn't prevent someone outside of Inktank from working on it :)
>>>>
>>>> The main caveat is that it will be complicate.  For an initial
>>>> implementation, the full breadth of the rados API probably wouldn't be
>>>> support for erasure/parity encoded pools (thinkgs like rados classes and
>>>> the omap key/value api get tricky when you start talking about parity).
>>>> But for many (or even most) use cases, objects are just bytes, and those
>>>> restrictions are just fine.
>>>
>>> I talked to some folks interested in doing a more limited form of this
>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>> erasure coding done by a separate process (or thread perhaps). It would
>>> use erasure coding on an object and then use librados to store the
>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>> place of the original object in the first pool.
>>>
>>> When the osd detected this marker, it would proxy the request to the
>>> erasure coding thread/process which would service the request on the
>>> second pool for reads, and potentially make writes move the data back to
>>> the first pool in a tiering sort of scenario.
>>>
>>> I might have misremembered some details, but I think it's an
>>> interesting way to get many of the benefits of erasure coding with a
>>> relatively small amount of work compared to a fully native osd solution.
>>>
>>> Josh
>>
>> Neat. :)
>>
>
> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
> but unfortunately it seems everything related to the cleversafe open source project
> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...

Yea - in a previous incarnation I looked at cleversafe to do something similar a few years ago.  It is odd that the cleversafe.org stuff did disapear.  However, tahoe-lafs also does encoding, and their package (zfec) [1] may be leverageable.

>
> @Sage: interesting. I thought it would be more relatively simple if one assumes
> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
> code and/or allow some integration with replication based objects, without aiming for full integration or
> full support of the rados api, based on some tradeoffs.

I think this might sit UNDER the rados api.  I would certainly want to leverage CRUSH to place the shards, however (great tool, no reason to re-invent the wheel).
>
> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)

Give me time :) - openstack has kept me a bit busy…  May also be a factor of "design at keyboard" :)

>
> Dieter

Christopher


[1] https://tahoe-lafs.org/trac/zfec

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:24     ` Noah Watkins
  2013-04-18 21:26       ` Sage Weil
@ 2013-04-19  0:34       ` Christopher LILJENSTOLPE
  1 sibling, 0 replies; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-19  0:34 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Josh Durgin, Sage Weil, Plaetinck, Dieter, ceph-devel, danm

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

Supposedly, on 2013-Apr-18, at 14.24 PDT(-0700), someone claiming to be Noah Watkins scribed:

> On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>
>> I talked to some folks interested in doing a more limited form of this
>> yesterday. They started a blueprint [1]. One of their ideas was to have
>> erasure coding done by a separate process (or thread perhaps). It would
>> use erasure coding on an object and then use librados to store the
>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>> place of the original object in the first pool.
>
> This sounds at a high-level similar to work out of Microsoft:

I've looked at that, and it would be somewhat similar (not completely, but borrow some ideas).

	Christopher

>
> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>
> The basic idea is to replicate first, then erasure code in the background.
>
> - Noah


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:26       ` Sage Weil
@ 2013-04-19  0:47         ` Christopher LILJENSTOLPE
  2013-04-21  2:37           ` Loic Dachary
  0 siblings, 1 reply; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-19  0:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: Noah Watkins, Josh Durgin, Plaetinck, Dieter, ceph-devel, danm

[-- Attachment #1: Type: text/plain, Size: 2914 bytes --]

Supposedly, on 2013-Apr-18, at 14.26 PDT(-0700), someone claiming to be Sage Weil scribed:

> On Thu, 18 Apr 2013, Noah Watkins wrote:
>> On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>>
>>> I talked to some folks interested in doing a more limited form of this
>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>> erasure coding done by a separate process (or thread perhaps). It would
>>> use erasure coding on an object and then use librados to store the
>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>> place of the original object in the first pool.
>>
>> This sounds at a high-level similar to work out of Microsoft:
>>
>> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>>
>> The basic idea is to replicate first, then erasure code in the background.
>
> FWIW, I think a useful (and generic) concept to add to rados would be a
> redirect symlink sort of thing that says "oh, this object is over there is
> that other pool", such that client requests will be transparently
> redirected or proxied.  This will enable generic tiering type operations,
> and probably simplify/enable migration without a lot of additional
> complexity on the client side.

More to come, but I'm starting to think of a union mount of a fuse "re-directing" overlay.  The quick idea.

On the "hot" pool, the OSD's would write to the host FS as usual.  However, that FS is actually a light-weight fuse (at least for prototype) fs that passes almost everything right down to the file system.  As the OSD hits a capacity HWM, a watcher (asynchronous process), starts "evicting" objects from the OSD.  It does that by using a modified ceph client that calls zfec and uses CRUSH to place the resulting shards in the "cool" pool.  Once those are committed, it replaces the object in the "hot" OSD with a special token. This is repeated until a LWM is reached.  When the OSD gets a read request for that object, when the fuse shim sees the token, it knows to actually do a modified client fetch from the "cool" pool.  It returns the resulting object to the original requester and (potentially) stores the object back in the "hot" OSD (if you want a cache-like performance), replacing the token.  If necessary, some other object may get, in turn, evicted if the HWM is again breached.

We would also need to modify the repair mechanism for the deep scrub in the "cool" pool to account for the repair being a re-constitution of an invalid shard, rather than a copy (as there is only one copy of a given shard).

I'll get a bit more of a write-up today, hopefully, in the wiki.

	Christopher

>
> sage

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-19  0:47         ` Christopher LILJENSTOLPE
@ 2013-04-21  2:37           ` Loic Dachary
  0 siblings, 0 replies; 19+ messages in thread
From: Loic Dachary @ 2013-04-21  2:37 UTC (permalink / raw)
  To: Christopher LILJENSTOLPE; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3469 bytes --]

Hi Christopher,

I would like to offer my help on this blueprint. In

http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

you wrote 

"At this time, Annai is more than willing to help with this, but we don't have the resources (including ceph coders) to materially contribute code. "

and I can work on the coding part.

Cheers

On 04/19/2013 02:47 AM, Christopher LILJENSTOLPE wrote:
> Supposedly, on 2013-Apr-18, at 14.26 PDT(-0700), someone claiming to be Sage Weil scribed:
> 
>> On Thu, 18 Apr 2013, Noah Watkins wrote:
>>> On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>>>
>>>> I talked to some folks interested in doing a more limited form of this
>>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>>> erasure coding done by a separate process (or thread perhaps). It would
>>>> use erasure coding on an object and then use librados to store the
>>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>>> place of the original object in the first pool.
>>>
>>> This sounds at a high-level similar to work out of Microsoft:
>>>
>>> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>>>
>>> The basic idea is to replicate first, then erasure code in the background.
>>
>> FWIW, I think a useful (and generic) concept to add to rados would be a
>> redirect symlink sort of thing that says "oh, this object is over there is
>> that other pool", such that client requests will be transparently
>> redirected or proxied.  This will enable generic tiering type operations,
>> and probably simplify/enable migration without a lot of additional
>> complexity on the client side.
> 
> More to come, but I'm starting to think of a union mount of a fuse "re-directing" overlay.  The quick idea.
> 
> On the "hot" pool, the OSD's would write to the host FS as usual.  However, that FS is actually a light-weight fuse (at least for prototype) fs that passes almost everything right down to the file system.  As the OSD hits a capacity HWM, a watcher (asynchronous process), starts "evicting" objects from the OSD.  It does that by using a modified ceph client that calls zfec and uses CRUSH to place the resulting shards in the "cool" pool.  Once those are committed, it replaces the object in the "hot" OSD with a special token. This is repeated until a LWM is reached.  When the OSD gets a read request for that object, when the fuse shim sees the token, it knows to actually do a modified client fetch from the "cool" pool.  It returns the resulting object to the original requester and (potentially) stores the object back in the "hot" OSD (if you want a cache-like performance), replacing the token.  If necessary, some other object may get, in turn, evicted if the HWM is again breached.
> 
> We would also need to modify the repair mechanism for the deep scrub in the "cool" pool to account for the repair being a re-constitution of an invalid shard, rather than a copy (as there is only one copy of a given shard).
> 
> I'll get a bit more of a write-up today, hopefully, in the wiki.
> 
> 	Christopher
> 
>>
>> sage
> 
> 
> --
> 李柯睿
> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> Check my calendar availability: https://tungle.me/cdl

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-18 21:31       ` Plaetinck, Dieter
  2013-04-19  0:33         ` Christopher LILJENSTOLPE
@ 2013-04-22  7:23         ` Christopher LILJENSTOLPE
  2013-04-22  8:10           ` Loic Dachary
  1 sibling, 1 reply; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-22  7:23 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: Mark Nelson, Josh Durgin, Sage Weil, ceph-devel, danm

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:

> On Thu, 18 Apr 2013 16:09:52 -0500
> Mark Nelson <mark.nelson@inktank.com> wrote:

>>
>
> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
> but unfortunately it seems everything related to the cleversafe open source project
> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>
> @Sage: interesting. I thought it would be more relatively simple if one assumes
> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
> code and/or allow some integration with replication based objects, without aiming for full integration or
> full support of the rados api, based on some tradeoffs.
>
> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)

Greetings - it does now - see what you all think…

	Christopher

>
> Dieter


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22  7:23         ` Christopher LILJENSTOLPE
@ 2013-04-22  8:10           ` Loic Dachary
  2013-04-22 14:08             ` Christopher LILJENSTOLPE
  0 siblings, 1 reply; 19+ messages in thread
From: Loic Dachary @ 2013-04-22  8:10 UTC (permalink / raw)
  To: Christopher LILJENSTOLPE; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1832 bytes --]

Hi Christopher,

You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.

Am I missing something ?

On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
> 
>> On Thu, 18 Apr 2013 16:09:52 -0500
>> Mark Nelson <mark.nelson@inktank.com> wrote:
> 
>>>
>>
>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
>> but unfortunately it seems everything related to the cleversafe open source project
>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>>
>> @Sage: interesting. I thought it would be more relatively simple if one assumes
>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
>> code and/or allow some integration with replication based objects, without aiming for full integration or
>> full support of the rados api, based on some tradeoffs.
>>
>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
> 
> Greetings - it does now - see what you all think…
> 
> 	Christopher
> 
>>
>> Dieter
> 
> 
> --
> 李柯睿
> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> Check my calendar availability: https://tungle.me/cdl

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22  8:10           ` Loic Dachary
@ 2013-04-22 14:08             ` Christopher LILJENSTOLPE
  2013-04-22 15:09               ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-22 14:08 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3152 bytes --]

Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:

> Hi Christopher,
>
> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.

Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.  

Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.  

One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?

	Thx
	Christopher



>
> Am I missing something ?
>
> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
>>
>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>> Mark Nelson <mark.nelson@inktank.com> wrote:
>>
>>>>
>>>
>>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
>>> but unfortunately it seems everything related to the cleversafe open source project
>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>>>
>>> @Sage: interesting. I thought it would be more relatively simple if one assumes
>>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
>>> code and/or allow some integration with replication based objects, without aiming for full integration or
>>> full support of the rados api, based on some tradeoffs.
>>>
>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
>>
>> Greetings - it does now - see what you all think…
>>
>> 	Christopher
>>
>>>
>>> Dieter
>>
>>
>> --
>> 李柯睿
>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>> Check my calendar availability: https://tungle.me/cdl
>
> -- 
> Loïc Dachary, Artisan Logiciel Libre


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22 14:08             ` Christopher LILJENSTOLPE
@ 2013-04-22 15:09               ` Sage Weil
  2013-04-22 18:09                 ` Loic Dachary
  2013-04-24  4:35                 ` Christopher LILJENSTOLPE
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2013-04-22 15:09 UTC (permalink / raw)
  To: Christopher LILJENSTOLPE; +Cc: Loic Dachary, ceph-devel

On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:
> 
> > Hi Christopher,
> >
> > You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.
> 
> Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.  
> 
> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.  
> 
> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?

It would go to osd18, the first item in the sequence that CRUSH generates.
           
As Loic observes, not having control of placement from above the librados
level makes this more or less a non-started.  The only thing that might   
work at that layer is to set up ~5 or more pools, each with a distinct set
of OSDs, and put each shard/fragment in a different pool.  I don't think  
that is a particularly good approach.

If we are going to do parity encoding (and I think we should!), I think we
should fully integrate it into the OSD.
           
The simplest approach:
           
 - we create a new PG type for 'parity' or 'erasure' or whatever (type    
   fields already exist)
 - those PGs use the parity ('INDEP') crush mode so that placement is
   intelligent
 - all reads and writes go to the 'primary'               
 - the primary does the shard encoding and distributes the write pieces to
   the other replicas
 - same for reads
           
There will be a pile of patches to move code around between PG and 
ReplicatedPG, which will be annoying, but hopefully not too painful.  The 
class structure and data types were set up with this in mind long ago.

Several key challenges:

 - come up with a scheme for internal naming to keep shards distinct
 - safely rewriting a stripe when there is a partial overwrite.  probably 
   want to write new stripes to distinct new objects (cloning old data as 
   needed) and clean up the old ones once enough copies are present.
 - recovery logic

sage


> 
> 	Thx
> 	Christopher
> 
> 
> 
> >
> > Am I missing something ?
> >
> > On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
> >> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
> >>
> >>> On Thu, 18 Apr 2013 16:09:52 -0500
> >>> Mark Nelson <mark.nelson@inktank.com> wrote:
> >>
> >>>>
> >>>
> >>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
> >>> but unfortunately it seems everything related to the cleversafe open source project
> >>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
> >>>
> >>> @Sage: interesting. I thought it would be more relatively simple if one assumes
> >>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
> >>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
> >>> code and/or allow some integration with replication based objects, without aiming for full integration or
> >>> full support of the rados api, based on some tradeoffs.
> >>>
> >>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
> >>
> >> Greetings - it does now - see what you all think?
> >>
> >> 	Christopher
> >>
> >>>
> >>> Dieter
> >>
> >>
> >> --
> >> ???
> >> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> >> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> >> Check my calendar availability: https://tungle.me/cdl
> >
> > -- 
> > Lo?c Dachary, Artisan Logiciel Libre
> 
> 
> --
> ???
> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> Check my calendar availability: https://tungle.me/cdl

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22 15:09               ` Sage Weil
@ 2013-04-22 18:09                 ` Loic Dachary
  2013-04-22 18:31                   ` Sage Weil
  2013-04-24  4:35                 ` Christopher LILJENSTOLPE
  1 sibling, 1 reply; 19+ messages in thread
From: Loic Dachary @ 2013-04-22 18:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6560 bytes --]

Hi Sage,

On 04/22/2013 05:09 PM, Sage Weil wrote:
> On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:
>>
>>> Hi Christopher,
>>>
>>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.
>>
>> Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.  
>>
>> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.  
>>
>> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?
> 
> It would go to osd18, the first item in the sequence that CRUSH generates.
>            
> As Loic observes, not having control of placement from above the librados
> level makes this more or less a non-started.  The only thing that might   
> work at that layer is to set up ~5 or more pools, each with a distinct set
> of OSDs, and put each shard/fragment in a different pool.  I don't think  
> that is a particularly good approach.
> 
> If we are going to do parity encoding (and I think we should!), I think we
> should fully integrate it into the OSD.
>            
> The simplest approach:
>            
>  - we create a new PG type for 'parity' or 'erasure' or whatever (type    
>    fields already exist)
>  - those PGs use the parity ('INDEP') crush mode so that placement is
>    intelligent

I assume you do not mean CEPH_FEATURE_INDEP_PG_MAP as used in

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L5237

but CRUSH_RULE_CHOOSE_INDEP as used in

https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L331

when firstn == 0 because it was set in

https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L523

I see that it would be simpler to write

   step choose indep 0 type row

and then rely on intelligent placement. Is there a reason why it would not be possible to use firstn instead of indep ?

>  - all reads and writes go to the 'primary'               
>  - the primary does the shard encoding and distributes the write pieces to
>    the other replicas

Although I understand how that would work when a PG receives a CEPH_OSD_OP_WRITEFULL

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L2504

It may be inconvenient and expensive to recompute the parity encoded version if an object is written with a series of CEPH_OSD_OP_WRITE. The simplest way would be to decode the existing object, modify it according to what CEPH_OSD_OP_WRITE requires, encode it.

>  - same for reads
>            
> There will be a pile of patches to move code around between PG and 
> ReplicatedPG, which will be annoying, but hopefully not too painful.  The 
> class structure and data types were set up with this in mind long ago.
> 
> Several key challenges:
> 
>  - come up with a scheme for internal naming to keep shards distinct
>  - safely rewriting a stripe when there is a partial overwrite.  probably 
>    want to write new stripes to distinct new objects (cloning old data as 
>    needed) and clean up the old ones once enough copies are present.

Do you mean RBD stripes ? 

>  - recovery logic

If recovery is done from the scheduled scrubber in the ErasureCodedPG , I'm not sure if OSD.cc must be modified or is truly independent of the PG type 

https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L3818

I'll keep looking, thanks a lot for the hints :-)

Cheers

> sage
> 
> 
>>
>> 	Thx
>> 	Christopher
>>
>>
>>
>>>
>>> Am I missing something ?
>>>
>>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
>>>>
>>>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>>>> Mark Nelson <mark.nelson@inktank.com> wrote:
>>>>
>>>>>>
>>>>>
>>>>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
>>>>> but unfortunately it seems everything related to the cleversafe open source project
>>>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>>>>>
>>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes
>>>>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
>>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
>>>>> code and/or allow some integration with replication based objects, without aiming for full integration or
>>>>> full support of the rados api, based on some tradeoffs.
>>>>>
>>>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
>>>>
>>>> Greetings - it does now - see what you all think?
>>>>
>>>> 	Christopher
>>>>
>>>>>
>>>>> Dieter
>>>>
>>>>
>>>> --
>>>> ???
>>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>>>> Check my calendar availability: https://tungle.me/cdl
>>>
>>> -- 
>>> Lo?c Dachary, Artisan Logiciel Libre
>>
>>
>> --
>> ???
>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>> Check my calendar availability: https://tungle.me/cdl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22 18:09                 ` Loic Dachary
@ 2013-04-22 18:31                   ` Sage Weil
  0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2013-04-22 18:31 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

On Mon, 22 Apr 2013, Loic Dachary wrote:
> Hi Sage,
> 
> On 04/22/2013 05:09 PM, Sage Weil wrote:
> > On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
> >> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:
> >>
> >>> Hi Christopher,
> >>>
> >>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.
> >>
> >> Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.  
> >>
> >> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.  
> >>
> >> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?
> > 
> > It would go to osd18, the first item in the sequence that CRUSH generates.
> >            
> > As Loic observes, not having control of placement from above the librados
> > level makes this more or less a non-started.  The only thing that might   
> > work at that layer is to set up ~5 or more pools, each with a distinct set
> > of OSDs, and put each shard/fragment in a different pool.  I don't think  
> > that is a particularly good approach.
> > 
> > If we are going to do parity encoding (and I think we should!), I think we
> > should fully integrate it into the OSD.
> >            
> > The simplest approach:
> >            
> >  - we create a new PG type for 'parity' or 'erasure' or whatever (type    
> >    fields already exist)
> >  - those PGs use the parity ('INDEP') crush mode so that placement is
> >    intelligent
> 
> I assume you do not mean CEPH_FEATURE_INDEP_PG_MAP as used in
> 
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L5237
> 
> but CRUSH_RULE_CHOOSE_INDEP as used in
> 
> https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L331
> 
> when firstn == 0 because it was set in
> 
> https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L523

Right.


> I see that it would be simpler to write
> 
>    step choose indep 0 type row
> 
> and then rely on intelligent placement. Is there a reason why it would not be possible to use firstn instead of indep ?

The indep placement avoids moving around a shard between ranks, because a 
mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 
fails and the shards on 2,3,4 won't need to be copied around.

> >  - all reads and writes go to the 'primary'               
> >  - the primary does the shard encoding and distributes the write pieces to
> >    the other replicas
> 
> Although I understand how that would work when a PG receives a CEPH_OSD_OP_WRITEFULL
> 
> https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L2504
> 
> It may be inconvenient and expensive to recompute the parity encoded version if an object is written with a series of CEPH_OSD_OP_WRITE. The simplest way would be to decode the existing object, modify it according to what CEPH_OSD_OP_WRITE requires, encode it.

Yeah.  Making small writes remotely efficient is a huge challenge.

OTOH, if we have some sort of symlink/redirect, and we do writes to 
replicated objects and only write erasure/parity data in full objects, 
then we avoid that complexity.

> 
> >  - same for reads
> >            
> > There will be a pile of patches to move code around between PG and 
> > ReplicatedPG, which will be annoying, but hopefully not too painful.  The 
> > class structure and data types were set up with this in mind long ago.
> > 
> > Several key challenges:
> > 
> >  - come up with a scheme for internal naming to keep shards distinct
> >  - safely rewriting a stripe when there is a partial overwrite.  probably 
> >    want to write new stripes to distinct new objects (cloning old data as 
> >    needed) and clean up the old ones once enough copies are present.
> 
> Do you mean RBD stripes ? 

I'm thinking stipes of a logical object across the object's shards.  
Mostly in terms of something like RAID4; I'm not sure what terminology is 
typically used for erasure coding systems.  I'm assuming that stripes of 
the object would be coded to so that partial object reads/updates of 
large objects are vaguely efficient...

Ideally, the implementation would have a field indicating what coding 
technique is used (parity, erasure, whatever).  Different optimizations 
are best for different coding strategies, but the minimum useful feature 
in this case wouldn't optimize anything anyway :)

> >  - recovery logic
> 
> If recovery is done from the scheduled scrubber in the ErasureCodedPG , I'm not sure if OSD.cc must be modified or is truly independent of the PG type 
> 
> https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L3818
> 
> I'll keep looking, thanks a lot for the hints :-)
> 
> Cheers
> 
> > sage
> > 
> > 
> >>
> >> 	Thx
> >> 	Christopher
> >>
> >>
> >>
> >>>
> >>> Am I missing something ?
> >>>
> >>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
> >>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
> >>>>
> >>>>> On Thu, 18 Apr 2013 16:09:52 -0500
> >>>>> Mark Nelson <mark.nelson@inktank.com> wrote:
> >>>>
> >>>>>>
> >>>>>
> >>>>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
> >>>>> but unfortunately it seems everything related to the cleversafe open source project
> >>>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
> >>>>>
> >>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes
> >>>>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
> >>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
> >>>>> code and/or allow some integration with replication based objects, without aiming for full integration or
> >>>>> full support of the rados api, based on some tradeoffs.
> >>>>>
> >>>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
> >>>>
> >>>> Greetings - it does now - see what you all think?
> >>>>
> >>>> 	Christopher
> >>>>
> >>>>>
> >>>>> Dieter
> >>>>
> >>>>
> >>>> --
> >>>> ???
> >>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> >>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> >>>> Check my calendar availability: https://tungle.me/cdl
> >>>
> >>> -- 
> >>> Lo?c Dachary, Artisan Logiciel Libre
> >>
> >>
> >> --
> >> ???
> >> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
> >> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
> >> Check my calendar availability: https://tungle.me/cdl
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Lo?c Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: erasure coding (sorry)
  2013-04-22 15:09               ` Sage Weil
  2013-04-22 18:09                 ` Loic Dachary
@ 2013-04-24  4:35                 ` Christopher LILJENSTOLPE
  1 sibling, 0 replies; 19+ messages in thread
From: Christopher LILJENSTOLPE @ 2013-04-24  4:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5543 bytes --]

Supposedly, on 2013-Apr-22, at 08.09 PDT(-0700), someone claiming to be Sage Weil scribed:

> On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed:
>>
>>> Hi Christopher,
>>>
>>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment.  In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD.  " but there is no way for a client to enforce the fact that two objects are stored in separate PG.
>>
>> Poorly worded.  The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  However, the OSDs would treat the shard as an n=1 replication and just store locally.
>>
>> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was meant to cover the alternative approaches.  I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement.
>>
>> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)?
>
> It would go to osd18, the first item in the sequence that CRUSH generates.

That's what I thought - then it is a non-starter

>
> As Loic observes, not having control of placement from above the librados
> level makes this more or less a non-started.  The only thing that might
> work at that layer is to set up ~5 or more pools, each with a distinct set
> of OSDs, and put each shard/fragment in a different pool.  I don't think
> that is a particularly good approach.

Pretty much of a kludge - I would agree

>
> If we are going to do parity encoding (and I think we should!), I think we
> should fully integrate it into the OSD.
>
> The simplest approach:
>
> - we create a new PG type for 'parity' or 'erasure' or whatever (type
> fields already exist)
> - those PGs use the parity ('INDEP') crush mode so that placement is
> intelligent
> - all reads and writes go to the 'primary'
> - the primary does the shard encoding and distributes the write pieces to
> the other replicas
> - same for reads

Yup - that's basically what I was trying to outline for the single-tier model.  I called them eOSD's.
>
> There will be a pile of patches to move code around between PG and
> ReplicatedPG, which will be annoying, but hopefully not too painful.  The
> class structure and data types were set up with this in mind long ago.
>
> Several key challenges:
>
> - come up with a scheme for internal naming to keep shards distinct
> - safely rewriting a stripe when there is a partial overwrite.  probably
> want to write new stripes to distinct new objects (cloning old data as
> needed) and clean up the old ones once enough copies are present.
> - recovery logic

Been giving this some thought - I'll try and get them into the blueprint.  Is the blueprint, as it is, reasonable to include in the design summit, knowing that it will continue to evolve?
>
> sage
>
Christopher

>
>>
>> 	Thx
>> 	Christopher
>>
>>
>>
>>>
>>> Am I missing something ?
>>>
>>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed:
>>>>
>>>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>>>> Mark Nelson <mark.nelson@inktank.com> wrote:
>>>>
>>>>>>
>>>>>
>>>>> @Bryan: I did come across cleversafe.  all the articles around it seemed promising,
>>>>> but unfortunately it seems everything related to the cleversafe open source project
>>>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite weird...
>>>>>
>>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes
>>>>> the restriction of immutable files.  I'm not familiar with those ceph specifics you're mentioning.
>>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph
>>>>> code and/or allow some integration with replication based objects, without aiming for full integration or
>>>>> full support of the rados api, based on some tradeoffs.
>>>>>
>>>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't contain any information yet :)
>>>>
>>>> Greetings - it does now - see what you all think?
>>>>
>>>> 	Christopher
>>>>
>>>>>
>>>>> Dieter
>>>>
>>>>
>>>> --
>>>> ???
>>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>>>> Check my calendar availability: https://tungle.me/cdl
>>>
>>> --
>>> Lo?c Dachary, Artisan Logiciel Libre
>>
>>
>> --
>> ???
>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>> Check my calendar availability: https://tungle.me/cdl


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-04-24  4:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-18 20:28 erasure coding (sorry) Plaetinck, Dieter
2013-04-18 20:47 ` Sage Weil
2013-04-18 21:08   ` Josh Durgin
2013-04-18 21:09     ` Mark Nelson
2013-04-18 21:31       ` Plaetinck, Dieter
2013-04-19  0:33         ` Christopher LILJENSTOLPE
2013-04-22  7:23         ` Christopher LILJENSTOLPE
2013-04-22  8:10           ` Loic Dachary
2013-04-22 14:08             ` Christopher LILJENSTOLPE
2013-04-22 15:09               ` Sage Weil
2013-04-22 18:09                 ` Loic Dachary
2013-04-22 18:31                   ` Sage Weil
2013-04-24  4:35                 ` Christopher LILJENSTOLPE
2013-04-18 21:24     ` Noah Watkins
2013-04-18 21:26       ` Sage Weil
2013-04-19  0:47         ` Christopher LILJENSTOLPE
2013-04-21  2:37           ` Loic Dachary
2013-04-19  0:34       ` Christopher LILJENSTOLPE
2013-04-19  0:29     ` Christopher LILJENSTOLPE

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.