Started developing a deduplication feature

All of lore.kernel.org
 help / color / mirror / Atom feed

* Started developing a deduplication feature
@ 2016-04-01 17:25 Marcel Lauhoff
  2016-04-01 21:31 ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Marcel Lauhoff @ 2016-04-01 17:25 UTC (permalink / raw)
  To: ceph-devel

Hi Ceph,

deduplication has been discussed on the list a couple of times.
Over the next months I'll be working on a prototype.

In short: Use a content-addressed storage pool backed by a pool
acting as storage and distributed fingerprint index.

Two pools: (1) pool that does the content addressing, (2) storage / index pool.

OSDs in the first pool readdress and chuck/reassemble objects.
They then store the new objects/chunks in a second pool.
The first pool uses a new PG backend ("CAS Backend"),
while the second can use replication or erasure coding.

The CAS backend computes fingerprints for incoming objects and
stores the fingerprint <-> original object name mapping.
It then forwards the data to a storage pool, addressing the objects by
fingerprint (the content defined name).

The storage pool therefore serves as a distributed fingerprint index.
CRUSH selects the responsible OSDs. The OSDs know their objects.

Deduplication happens when two objects/chunks have the same
fingerprint.

My current milestones:
- Develop CAS backend, fingerprinting, recipes store
- Support limited set of operations (like EC does)
- Support RBD (with/without Cache) and evaluate
- Add Chunking, Garbage Collection, ..

Currently I'm adding a new PG backend into the OSD code base. I'll
push the code the my github clone as soon as it does "something" :)

~irq0

--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-01 17:25 Started developing a deduplication feature Marcel Lauhoff
@ 2016-04-01 21:31 ` Sage Weil
  2016-04-04 12:38   ` Marcel Lauhoff
  2016-04-28 21:08   ` Allen Samuels
  0 siblings, 2 replies; 8+ messages in thread
From: Sage Weil @ 2016-04-01 21:31 UTC (permalink / raw)
  To: Marcel Lauhoff; +Cc: ceph-devel

Hi Marcel,

On Fri, 1 Apr 2016, Marcel Lauhoff wrote:
> Hi Ceph,
> 
> deduplication has been discussed on the list a couple of times.
> Over the next months I'll be working on a prototype.
> 
> In short: Use a content-addressed storage pool backed by a pool
> acting as storage and distributed fingerprint index.
> 
> Two pools: (1) pool that does the content addressing, (2) storage / 
> index pool.
> 
> OSDs in the first pool readdress and chuck/reassemble objects.
> They then store the new objects/chunks in a second pool.

I think this is the right architecture for dedup in Ceph, and matches the 
ideas we've been kicking around.

> The first pool uses a new PG backend ("CAS Backend"),
> while the second can use replication or erasure coding.
> 
> The CAS backend computes fingerprints for incoming objects and
> stores the fingerprint <-> original object name mapping.
> It then forwards the data to a storage pool, addressing the objects by
> fingerprint (the content defined name).
> 
> The storage pool therefore serves as a distributed fingerprint index.
> CRUSH selects the responsible OSDs. The OSDs know their objects.
> 
> Deduplication happens when two objects/chunks have the same
> fingerprint.

This is a little different, though.

The plan so far has been to match this up with the next stage of tiering.  
We'll add the ability for and object to be a 'redirect' and store a bit of 
metadata indicating where to look next.  That might be a simple as "go 
look in this cold RADOS pool over there," or a URL into another storage 
system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS 
chunks in another rados pool.

The original thought was that this would just be a regular ReplicatedPG, 
not a new pool type.  I haven't thought about what we'd gain by having a 
new pool type.  One thing we get by using the existing pool is that we're 
not forced to do the demotion/dedup immediately--we can just store the 
object normally, and dedup it later when we decide it's cold.

For the CAS pool, the idea would be to use the refcount class, or 
something like it, so that you'd say "write object $hash" and if the 
object already exists it'd increment the ref count.  Similarly, when you 
delete the logical object, you do a refcount 'put' on each chunk, and the 
chunk would only go away when the last ref did too.  (In practice we need 
to be careful to avoid leaked refs in the case of failures; this would 
probably be done by having a 'deduping' and 'deleting' state on the 
logical object and named references.

> My current milestones:
> - Develop CAS backend, fingerprinting, recipes store
> - Support limited set of operations (like EC does)
> - Support RBD (with/without Cache) and evaluate
> - Add Chunking, Garbage Collection, ..
> 
> Currently I'm adding a new PG backend into the OSD code base. I'll
> push the code the my github clone as soon as it does "something" :)

This would be a good thing to discuss during the Ceph Developer Monthly 
call next Wednesday:

	http://tracker.ceph.com/projects/ceph/wiki/Planning
	http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-01 21:31 ` Sage Weil
@ 2016-04-04 12:38   ` Marcel Lauhoff
  2016-04-08 15:01     ` Marcel Lauhoff
  2016-04-28 21:08   ` Allen Samuels
  1 sibling, 1 reply; 8+ messages in thread
From: Marcel Lauhoff @ 2016-04-04 12:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


Hi Sage,

Sage Weil <sage@newdream.net> writes:
> On Fri, 1 Apr 2016, Marcel Lauhoff wrote:
>> The first pool uses a new PG backend ("CAS Backend"),
>> while the second can use replication or erasure coding.
>>
>> The CAS backend computes fingerprints for incoming objects and
>> stores the fingerprint <-> original object name mapping.
>> It then forwards the data to a storage pool, addressing the objects by
>> fingerprint (the content defined name).
>>
>> The storage pool therefore serves as a distributed fingerprint index.
>> CRUSH selects the responsible OSDs. The OSDs know their objects.
>>
>> Deduplication happens when two objects/chunks have the same
>> fingerprint.
>
> This is a little different, though.
>
> The plan so far has been to match this up with the next stage of tiering.
> We'll add the ability for and object to be a 'redirect' and store a bit of
> metadata indicating where to look next.  That might be a simple as "go
> look in this cold RADOS pool over there," or a URL into another storage
> system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS
> chunks in another rados pool.

"OSD - tiering - object redirects" [42]?
As I understood the design, it is "client driven": Clients accessing a
redirected object get a reply "try again" + metadata from the primary
OSD.

What I'm proposing does not change the client. Plus, all the redirection
and dedup magic happens on the OSDs. Therefore the additional round
trip stays in the Ceph cluster.

On the other hand, the "layered pool" approach adds additional PGs and
load to the OSDs that the clients could do.

> The original thought was that this would just be a regular ReplicatedPG,
> not a new pool type.  I haven't thought about what we'd gain by having a
> new pool type.  One thing we get by using the existing pool is that we're
> not forced to do the demotion/dedup immediately--we can just store the
> object normally, and dedup it later when we decide it's cold.

Which also means that you could dedup an existing pool after a software upgrade..
Still, the common drawback/counter argument of/against offline dedup:
You can't factor in the deduplication ratio and end up having to buy more storage.

> For the CAS pool, the idea would be to use the refcount class, or
> something like it, so that you'd say "write object $hash" and if the
> object already exists it'd increment the ref count.  Similarly, when you
> delete the logical object, you do a refcount 'put' on each chunk, and the
> chunk would only go away when the last ref did too.  (In practice we need
> to be careful to avoid leaked refs in the case of failures; this would
> probably be done by having a 'deduping' and 'deleting' state on the
> logical object and named references.

Sound good. Maybe even with write-once-then-read-only objects like, for
example, Venti?

>> My current milestones:
>> - Develop CAS backend, fingerprinting, recipes store
>> - Support limited set of operations (like EC does)
>> - Support RBD (with/without Cache) and evaluate
>> - Add Chunking, Garbage Collection, ..
>>
>> Currently I'm adding a new PG backend into the OSD code base. I'll
>> push the code the my github clone as soon as it does "something" :)
>
> This would be a good thing to discuss during the Ceph Developer Monthly
> call next Wednesday:
>
> 	http://tracker.ceph.com/projects/ceph/wiki/Planning
> 	http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016

Added. See you Wednesday


~irq0

[42] http://tracker.ceph.com/projects/ceph/wiki/Osd_-_tiering_-_object_redirects

--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-04 12:38   ` Marcel Lauhoff
@ 2016-04-08 15:01     ` Marcel Lauhoff
  2016-04-08 15:18       ` Sage Weil
  2016-04-08 21:50       ` Shinobu Kinjo
  0 siblings, 2 replies; 8+ messages in thread
From: Marcel Lauhoff @ 2016-04-08 15:01 UTC (permalink / raw)
  To: ceph-devel

Hi list,
short recap of the dedup topic from the CDM on Wednesday:

The main change from the original mail is not to add a PG backend, but
rather use Object Redirects (Tiering v2).
Another backend would have to implement its own replication for
recipes and increase the OSD code base just for dedup. Redirects are
useful beyond deduplication.

The CAS pool design was refined: An object class should handle the ref
counting and content addressing. The pool should also only
allow access through this object class to prevent collisions with
regular objects and support immutable objects.

There was also the idea of client-side deduplication by using metadata
that clients like RGW store. This would save the additional round trip
that object redirects add.

I'll be working on the CAS pool first, since there
is ongoing refactoring in the ReplicatedPG code base. I'll work out a
more detailed design document for the CAS pool soon.

~irq0
--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-08 15:01     ` Marcel Lauhoff
@ 2016-04-08 15:18       ` Sage Weil
  2016-04-08 21:50       ` Shinobu Kinjo
  1 sibling, 0 replies; 8+ messages in thread
From: Sage Weil @ 2016-04-08 15:18 UTC (permalink / raw)
  To: Marcel Lauhoff; +Cc: ceph-devel

On Fri, 8 Apr 2016, Marcel Lauhoff wrote:
> Hi list,
> short recap of the dedup topic from the CDM on Wednesday:
> 
> The main change from the original mail is not to add a PG backend, but
> rather use Object Redirects (Tiering v2).

Let's call this 'Tiering v2', since it won't be based on redirects.  
(These were extremely problematic in cache tiering because they prevent us 
from maintaining an ordering; we now proxy ops instead.)

Some point soonish we should recreate/update that old blueprint with the 
new design.

> Another backend would have to implement its own replication for
> recipes and increase the OSD code base just for dedup. Redirects are
> useful beyond deduplication.
> 
> The CAS pool design was refined: An object class should handle the ref
> counting and content addressing.

cls_refcount might be sufficient here; if not it's probably a starting 
point.

> The pool should also only
> allow access through this object class to prevent collisions with
> regular objects and support immutable objects.

We decided we can just do this via the cephx capabilities by granting 
the client(s) access to the appropriate class only.

> There was also the idea of client-side deduplication by using metadata
> that clients like RGW store. This would save the additional round trip
> that object redirects add.

Presumably the (first) user here is radosgw, which is already writing more 
or less immutable chunks and could easily dump them in a CAS pool instead 
of a normal replicated pool.

> I'll be working on the CAS pool first, since there
> is ongoing refactoring in the ReplicatedPG code base. I'll work out a
> more detailed design document for the CAS pool soon.

Sounds great.  Thanks, Marcel!
sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-08 15:01     ` Marcel Lauhoff
  2016-04-08 15:18       ` Sage Weil
@ 2016-04-08 21:50       ` Shinobu Kinjo
  2016-04-12  9:35         ` Marcel Lauhoff
  1 sibling, 1 reply; 8+ messages in thread
From: Shinobu Kinjo @ 2016-04-08 21:50 UTC (permalink / raw)
  To: Marcel Lauhoff; +Cc: Ceph Development

On Sat, Apr 9, 2016 at 12:01 AM, Marcel Lauhoff <lauhoff@uni-mainz.de> wrote:
>
> Hi list,
> short recap of the dedup topic from the CDM on Wednesday:
>
>
> The main change from the original mail is not to add a PG backend, but
> rather use Object Redirects (Tiering v2).
> Another backend would have to implement its own replication for
> recipes and increase the OSD code base just for dedup. Redirects are
> useful beyond deduplication.
>
>
> The CAS pool design was refined: An object class should handle the ref
> counting and content addressing. The pool should also only
> allow access through this object class to prevent collisions with
> regular objects and support immutable objects.
>
>
> There was also the idea of client-side deduplication by using metadata
> that clients like RGW store. This would save the additional round trip
> that object redirects add.

Sounds good.

>
>
> I'll be working on the CAS pool first, since there
> is ongoing refactoring in the ReplicatedPG code base. I'll work out a
> more detailed design document for the CAS pool soon.

I will be able to test it against if you don't mind.

>
>
> ~irq0
> --
> Marcel Lauhoff
> Mail: lauhoff@uni-mainz.de
> XMPP: mlauhoff@jabber.uni-mainz.de
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers,
Shinobu

-- 
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Started developing a deduplication feature
  2016-04-08 21:50       ` Shinobu Kinjo
@ 2016-04-12  9:35         ` Marcel Lauhoff
  0 siblings, 0 replies; 8+ messages in thread
From: Marcel Lauhoff @ 2016-04-12  9:35 UTC (permalink / raw)
  To: skinjo; +Cc: Ceph Development


Shinobu Kinjo <shinobu.kj@gmail.com> writes:

> On Sat, Apr 9, 2016 at 12:01 AM, Marcel Lauhoff <lauhoff@uni-mainz.de> wrote:
>> I'll be working on the CAS pool first, since there
>> is ongoing refactoring in the ReplicatedPG code base. I'll work out a
>> more detailed design document for the CAS pool soon.
>
> I will be able to test it against if you don't mind.

Sounds good. What do you have in mind?


~irq0
--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Started developing a deduplication feature
  2016-04-01 21:31 ` Sage Weil
  2016-04-04 12:38   ` Marcel Lauhoff
@ 2016-04-28 21:08   ` Allen Samuels
  1 sibling, 0 replies; 8+ messages in thread
From: Allen Samuels @ 2016-04-28 21:08 UTC (permalink / raw)
  To: Sage Weil, Marcel Lauhoff; +Cc: ceph-devel@vger.kernel.org

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Friday, April 01, 2016 4:31 PM
> To: Marcel Lauhoff <lauhoff@uni-mainz.de>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Started developing a deduplication feature
> 
> Hi Marcel,
> 
> On Fri, 1 Apr 2016, Marcel Lauhoff wrote:
> > Hi Ceph,
> >
> > deduplication has been discussed on the list a couple of times.
> > Over the next months I'll be working on a prototype.
> >
> > In short: Use a content-addressed storage pool backed by a pool acting
> > as storage and distributed fingerprint index.
> >
> > Two pools: (1) pool that does the content addressing, (2) storage /
> > index pool.
> >
> > OSDs in the first pool readdress and chuck/reassemble objects.
> > They then store the new objects/chunks in a second pool.
> 
> I think this is the right architecture for dedup in Ceph, and matches the ideas
> we've been kicking around.
> 
> > The first pool uses a new PG backend ("CAS Backend"), while the second
> > can use replication or erasure coding.
> >
> > The CAS backend computes fingerprints for incoming objects and stores
> > the fingerprint <-> original object name mapping.
> > It then forwards the data to a storage pool, addressing the objects by
> > fingerprint (the content defined name).
> >
> > The storage pool therefore serves as a distributed fingerprint index.
> > CRUSH selects the responsible OSDs. The OSDs know their objects.
> >
> > Deduplication happens when two objects/chunks have the same
> > fingerprint.
> 
> This is a little different, though.
> 
> The plan so far has been to match this up with the next stage of tiering.
> We'll add the ability for and object to be a 'redirect' and store a bit of
> metadata indicating where to look next.  That might be a simple as "go look in
> this cold RADOS pool over there," or a URL into another storage system (e.g.,
> a tape archive), or.. a complicated mapping of bytes to CAS chunks in another
> rados pool.
> 
> The original thought was that this would just be a regular ReplicatedPG, not a
> new pool type.  I haven't thought about what we'd gain by having a new pool
> type.  One thing we get by using the existing pool is that we're not forced to
> do the demotion/dedup immediately--we can just store the object normally,
> and dedup it later when we decide it's cold.

To me, using a replicated pool to store the chunks significantly degrades the value of deduplication.
Also, the usage of a standard RADOS object for each chunk will severely degrade performance for small chunk sizes at large data scales.

The advantage of a new pool type is that you can create a metadata structure that's better crafted to this use case and that uses erasure coding to really get the full value out of deduplication.

Lots more work of course :(

> 
> For the CAS pool, the idea would be to use the refcount class, or something
> like it, so that you'd say "write object $hash" and if the object already exists
> it'd increment the ref count.  Similarly, when you delete the logical object,
> you do a refcount 'put' on each chunk, and the chunk would only go away
> when the last ref did too.  (In practice we need to be careful to avoid leaked
> refs in the case of failures; this would probably be done by having a
> 'deduping' and 'deleting' state on the logical object and named references.
> 
> > My current milestones:
> > - Develop CAS backend, fingerprinting, recipes store
> > - Support limited set of operations (like EC does)
> > - Support RBD (with/without Cache) and evaluate
> > - Add Chunking, Garbage Collection, ..
> >
> > Currently I'm adding a new PG backend into the OSD code base. I'll
> > push the code the my github clone as soon as it does "something" :)
> 
> This would be a good thing to discuss during the Ceph Developer Monthly call
> next Wednesday:
> 
> 	http://tracker.ceph.com/projects/ceph/wiki/Planning
> 	http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-04-28 21:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-01 17:25 Started developing a deduplication feature Marcel Lauhoff
2016-04-01 21:31 ` Sage Weil
2016-04-04 12:38   ` Marcel Lauhoff
2016-04-08 15:01     ` Marcel Lauhoff
2016-04-08 15:18       ` Sage Weil
2016-04-08 21:50       ` Shinobu Kinjo
2016-04-12  9:35         ` Marcel Lauhoff
2016-04-28 21:08   ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.