* Started developing a deduplication feature
@ 2016-04-01 17:25 Marcel Lauhoff
2016-04-01 21:31 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Marcel Lauhoff @ 2016-04-01 17:25 UTC (permalink / raw)
To: ceph-devel
Hi Ceph,
deduplication has been discussed on the list a couple of times.
Over the next months I'll be working on a prototype.
In short: Use a content-addressed storage pool backed by a pool
acting as storage and distributed fingerprint index.
Two pools: (1) pool that does the content addressing, (2) storage / index pool.
OSDs in the first pool readdress and chuck/reassemble objects.
They then store the new objects/chunks in a second pool.
The first pool uses a new PG backend ("CAS Backend"),
while the second can use replication or erasure coding.
The CAS backend computes fingerprints for incoming objects and
stores the fingerprint <-> original object name mapping.
It then forwards the data to a storage pool, addressing the objects by
fingerprint (the content defined name).
The storage pool therefore serves as a distributed fingerprint index.
CRUSH selects the responsible OSDs. The OSDs know their objects.
Deduplication happens when two objects/chunks have the same
fingerprint.
My current milestones:
- Develop CAS backend, fingerprinting, recipes store
- Support limited set of operations (like EC does)
- Support RBD (with/without Cache) and evaluate
- Add Chunking, Garbage Collection, ..
Currently I'm adding a new PG backend into the OSD code base. I'll
push the code the my github clone as soon as it does "something" :)
~irq0
--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: Started developing a deduplication feature 2016-04-01 17:25 Started developing a deduplication feature Marcel Lauhoff @ 2016-04-01 21:31 ` Sage Weil 2016-04-04 12:38 ` Marcel Lauhoff 2016-04-28 21:08 ` Allen Samuels 0 siblings, 2 replies; 8+ messages in thread From: Sage Weil @ 2016-04-01 21:31 UTC (permalink / raw) To: Marcel Lauhoff; +Cc: ceph-devel Hi Marcel, On Fri, 1 Apr 2016, Marcel Lauhoff wrote: > Hi Ceph, > > deduplication has been discussed on the list a couple of times. > Over the next months I'll be working on a prototype. > > In short: Use a content-addressed storage pool backed by a pool > acting as storage and distributed fingerprint index. > > Two pools: (1) pool that does the content addressing, (2) storage / > index pool. > > OSDs in the first pool readdress and chuck/reassemble objects. > They then store the new objects/chunks in a second pool. I think this is the right architecture for dedup in Ceph, and matches the ideas we've been kicking around. > The first pool uses a new PG backend ("CAS Backend"), > while the second can use replication or erasure coding. > > The CAS backend computes fingerprints for incoming objects and > stores the fingerprint <-> original object name mapping. > It then forwards the data to a storage pool, addressing the objects by > fingerprint (the content defined name). > > The storage pool therefore serves as a distributed fingerprint index. > CRUSH selects the responsible OSDs. The OSDs know their objects. > > Deduplication happens when two objects/chunks have the same > fingerprint. This is a little different, though. The plan so far has been to match this up with the next stage of tiering. We'll add the ability for and object to be a 'redirect' and store a bit of metadata indicating where to look next. That might be a simple as "go look in this cold RADOS pool over there," or a URL into another storage system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS chunks in another rados pool. The original thought was that this would just be a regular ReplicatedPG, not a new pool type. I haven't thought about what we'd gain by having a new pool type. One thing we get by using the existing pool is that we're not forced to do the demotion/dedup immediately--we can just store the object normally, and dedup it later when we decide it's cold. For the CAS pool, the idea would be to use the refcount class, or something like it, so that you'd say "write object $hash" and if the object already exists it'd increment the ref count. Similarly, when you delete the logical object, you do a refcount 'put' on each chunk, and the chunk would only go away when the last ref did too. (In practice we need to be careful to avoid leaked refs in the case of failures; this would probably be done by having a 'deduping' and 'deleting' state on the logical object and named references. > My current milestones: > - Develop CAS backend, fingerprinting, recipes store > - Support limited set of operations (like EC does) > - Support RBD (with/without Cache) and evaluate > - Add Chunking, Garbage Collection, .. > > Currently I'm adding a new PG backend into the OSD code base. I'll > push the code the my github clone as soon as it does "something" :) This would be a good thing to discuss during the Ceph Developer Monthly call next Wednesday: http://tracker.ceph.com/projects/ceph/wiki/Planning http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016 sage ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Started developing a deduplication feature 2016-04-01 21:31 ` Sage Weil @ 2016-04-04 12:38 ` Marcel Lauhoff 2016-04-08 15:01 ` Marcel Lauhoff 2016-04-28 21:08 ` Allen Samuels 1 sibling, 1 reply; 8+ messages in thread From: Marcel Lauhoff @ 2016-04-04 12:38 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel Hi Sage, Sage Weil <sage@newdream.net> writes: > On Fri, 1 Apr 2016, Marcel Lauhoff wrote: >> The first pool uses a new PG backend ("CAS Backend"), >> while the second can use replication or erasure coding. >> >> The CAS backend computes fingerprints for incoming objects and >> stores the fingerprint <-> original object name mapping. >> It then forwards the data to a storage pool, addressing the objects by >> fingerprint (the content defined name). >> >> The storage pool therefore serves as a distributed fingerprint index. >> CRUSH selects the responsible OSDs. The OSDs know their objects. >> >> Deduplication happens when two objects/chunks have the same >> fingerprint. > > This is a little different, though. > > The plan so far has been to match this up with the next stage of tiering. > We'll add the ability for and object to be a 'redirect' and store a bit of > metadata indicating where to look next. That might be a simple as "go > look in this cold RADOS pool over there," or a URL into another storage > system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS > chunks in another rados pool. "OSD - tiering - object redirects" [42]? As I understood the design, it is "client driven": Clients accessing a redirected object get a reply "try again" + metadata from the primary OSD. What I'm proposing does not change the client. Plus, all the redirection and dedup magic happens on the OSDs. Therefore the additional round trip stays in the Ceph cluster. On the other hand, the "layered pool" approach adds additional PGs and load to the OSDs that the clients could do. > The original thought was that this would just be a regular ReplicatedPG, > not a new pool type. I haven't thought about what we'd gain by having a > new pool type. One thing we get by using the existing pool is that we're > not forced to do the demotion/dedup immediately--we can just store the > object normally, and dedup it later when we decide it's cold. Which also means that you could dedup an existing pool after a software upgrade.. Still, the common drawback/counter argument of/against offline dedup: You can't factor in the deduplication ratio and end up having to buy more storage. > For the CAS pool, the idea would be to use the refcount class, or > something like it, so that you'd say "write object $hash" and if the > object already exists it'd increment the ref count. Similarly, when you > delete the logical object, you do a refcount 'put' on each chunk, and the > chunk would only go away when the last ref did too. (In practice we need > to be careful to avoid leaked refs in the case of failures; this would > probably be done by having a 'deduping' and 'deleting' state on the > logical object and named references. Sound good. Maybe even with write-once-then-read-only objects like, for example, Venti? >> My current milestones: >> - Develop CAS backend, fingerprinting, recipes store >> - Support limited set of operations (like EC does) >> - Support RBD (with/without Cache) and evaluate >> - Add Chunking, Garbage Collection, .. >> >> Currently I'm adding a new PG backend into the OSD code base. I'll >> push the code the my github clone as soon as it does "something" :) > > This would be a good thing to discuss during the Ceph Developer Monthly > call next Wednesday: > > http://tracker.ceph.com/projects/ceph/wiki/Planning > http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016 Added. See you Wednesday ~irq0 [42] http://tracker.ceph.com/projects/ceph/wiki/Osd_-_tiering_-_object_redirects -- Marcel Lauhoff Mail: lauhoff@uni-mainz.de XMPP: mlauhoff@jabber.uni-mainz.de ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Started developing a deduplication feature 2016-04-04 12:38 ` Marcel Lauhoff @ 2016-04-08 15:01 ` Marcel Lauhoff 2016-04-08 15:18 ` Sage Weil 2016-04-08 21:50 ` Shinobu Kinjo 0 siblings, 2 replies; 8+ messages in thread From: Marcel Lauhoff @ 2016-04-08 15:01 UTC (permalink / raw) To: ceph-devel Hi list, short recap of the dedup topic from the CDM on Wednesday: The main change from the original mail is not to add a PG backend, but rather use Object Redirects (Tiering v2). Another backend would have to implement its own replication for recipes and increase the OSD code base just for dedup. Redirects are useful beyond deduplication. The CAS pool design was refined: An object class should handle the ref counting and content addressing. The pool should also only allow access through this object class to prevent collisions with regular objects and support immutable objects. There was also the idea of client-side deduplication by using metadata that clients like RGW store. This would save the additional round trip that object redirects add. I'll be working on the CAS pool first, since there is ongoing refactoring in the ReplicatedPG code base. I'll work out a more detailed design document for the CAS pool soon. ~irq0 -- Marcel Lauhoff Mail: lauhoff@uni-mainz.de XMPP: mlauhoff@jabber.uni-mainz.de ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Started developing a deduplication feature 2016-04-08 15:01 ` Marcel Lauhoff @ 2016-04-08 15:18 ` Sage Weil 2016-04-08 21:50 ` Shinobu Kinjo 1 sibling, 0 replies; 8+ messages in thread From: Sage Weil @ 2016-04-08 15:18 UTC (permalink / raw) To: Marcel Lauhoff; +Cc: ceph-devel On Fri, 8 Apr 2016, Marcel Lauhoff wrote: > Hi list, > short recap of the dedup topic from the CDM on Wednesday: > > The main change from the original mail is not to add a PG backend, but > rather use Object Redirects (Tiering v2). Let's call this 'Tiering v2', since it won't be based on redirects. (These were extremely problematic in cache tiering because they prevent us from maintaining an ordering; we now proxy ops instead.) Some point soonish we should recreate/update that old blueprint with the new design. > Another backend would have to implement its own replication for > recipes and increase the OSD code base just for dedup. Redirects are > useful beyond deduplication. > > The CAS pool design was refined: An object class should handle the ref > counting and content addressing. cls_refcount might be sufficient here; if not it's probably a starting point. > The pool should also only > allow access through this object class to prevent collisions with > regular objects and support immutable objects. We decided we can just do this via the cephx capabilities by granting the client(s) access to the appropriate class only. > There was also the idea of client-side deduplication by using metadata > that clients like RGW store. This would save the additional round trip > that object redirects add. Presumably the (first) user here is radosgw, which is already writing more or less immutable chunks and could easily dump them in a CAS pool instead of a normal replicated pool. > I'll be working on the CAS pool first, since there > is ongoing refactoring in the ReplicatedPG code base. I'll work out a > more detailed design document for the CAS pool soon. Sounds great. Thanks, Marcel! sage ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Started developing a deduplication feature 2016-04-08 15:01 ` Marcel Lauhoff 2016-04-08 15:18 ` Sage Weil @ 2016-04-08 21:50 ` Shinobu Kinjo 2016-04-12 9:35 ` Marcel Lauhoff 1 sibling, 1 reply; 8+ messages in thread From: Shinobu Kinjo @ 2016-04-08 21:50 UTC (permalink / raw) To: Marcel Lauhoff; +Cc: Ceph Development On Sat, Apr 9, 2016 at 12:01 AM, Marcel Lauhoff <lauhoff@uni-mainz.de> wrote: > > Hi list, > short recap of the dedup topic from the CDM on Wednesday: > > > The main change from the original mail is not to add a PG backend, but > rather use Object Redirects (Tiering v2). > Another backend would have to implement its own replication for > recipes and increase the OSD code base just for dedup. Redirects are > useful beyond deduplication. > > > The CAS pool design was refined: An object class should handle the ref > counting and content addressing. The pool should also only > allow access through this object class to prevent collisions with > regular objects and support immutable objects. > > > There was also the idea of client-side deduplication by using metadata > that clients like RGW store. This would save the additional round trip > that object redirects add. Sounds good. > > > I'll be working on the CAS pool first, since there > is ongoing refactoring in the ReplicatedPG code base. I'll work out a > more detailed design document for the CAS pool soon. I will be able to test it against if you don't mind. > > > ~irq0 > -- > Marcel Lauhoff > Mail: lauhoff@uni-mainz.de > XMPP: mlauhoff@jabber.uni-mainz.de > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Shinobu -- Email: shinobu@linux.com GitHub: shinobu-x Blog: Life with Distributed Computational System based on OpenSource ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Started developing a deduplication feature 2016-04-08 21:50 ` Shinobu Kinjo @ 2016-04-12 9:35 ` Marcel Lauhoff 0 siblings, 0 replies; 8+ messages in thread From: Marcel Lauhoff @ 2016-04-12 9:35 UTC (permalink / raw) To: skinjo; +Cc: Ceph Development Shinobu Kinjo <shinobu.kj@gmail.com> writes: > On Sat, Apr 9, 2016 at 12:01 AM, Marcel Lauhoff <lauhoff@uni-mainz.de> wrote: >> I'll be working on the CAS pool first, since there >> is ongoing refactoring in the ReplicatedPG code base. I'll work out a >> more detailed design document for the CAS pool soon. > > I will be able to test it against if you don't mind. Sounds good. What do you have in mind? ~irq0 -- Marcel Lauhoff Mail: lauhoff@uni-mainz.de XMPP: mlauhoff@jabber.uni-mainz.de ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Started developing a deduplication feature 2016-04-01 21:31 ` Sage Weil 2016-04-04 12:38 ` Marcel Lauhoff @ 2016-04-28 21:08 ` Allen Samuels 1 sibling, 0 replies; 8+ messages in thread From: Allen Samuels @ 2016-04-28 21:08 UTC (permalink / raw) To: Sage Weil, Marcel Lauhoff; +Cc: ceph-devel@vger.kernel.org > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Sage Weil > Sent: Friday, April 01, 2016 4:31 PM > To: Marcel Lauhoff <lauhoff@uni-mainz.de> > Cc: ceph-devel@vger.kernel.org > Subject: Re: Started developing a deduplication feature > > Hi Marcel, > > On Fri, 1 Apr 2016, Marcel Lauhoff wrote: > > Hi Ceph, > > > > deduplication has been discussed on the list a couple of times. > > Over the next months I'll be working on a prototype. > > > > In short: Use a content-addressed storage pool backed by a pool acting > > as storage and distributed fingerprint index. > > > > Two pools: (1) pool that does the content addressing, (2) storage / > > index pool. > > > > OSDs in the first pool readdress and chuck/reassemble objects. > > They then store the new objects/chunks in a second pool. > > I think this is the right architecture for dedup in Ceph, and matches the ideas > we've been kicking around. > > > The first pool uses a new PG backend ("CAS Backend"), while the second > > can use replication or erasure coding. > > > > The CAS backend computes fingerprints for incoming objects and stores > > the fingerprint <-> original object name mapping. > > It then forwards the data to a storage pool, addressing the objects by > > fingerprint (the content defined name). > > > > The storage pool therefore serves as a distributed fingerprint index. > > CRUSH selects the responsible OSDs. The OSDs know their objects. > > > > Deduplication happens when two objects/chunks have the same > > fingerprint. > > This is a little different, though. > > The plan so far has been to match this up with the next stage of tiering. > We'll add the ability for and object to be a 'redirect' and store a bit of > metadata indicating where to look next. That might be a simple as "go look in > this cold RADOS pool over there," or a URL into another storage system (e.g., > a tape archive), or.. a complicated mapping of bytes to CAS chunks in another > rados pool. > > The original thought was that this would just be a regular ReplicatedPG, not a > new pool type. I haven't thought about what we'd gain by having a new pool > type. One thing we get by using the existing pool is that we're not forced to > do the demotion/dedup immediately--we can just store the object normally, > and dedup it later when we decide it's cold. To me, using a replicated pool to store the chunks significantly degrades the value of deduplication. Also, the usage of a standard RADOS object for each chunk will severely degrade performance for small chunk sizes at large data scales. The advantage of a new pool type is that you can create a metadata structure that's better crafted to this use case and that uses erasure coding to really get the full value out of deduplication. Lots more work of course :( > > For the CAS pool, the idea would be to use the refcount class, or something > like it, so that you'd say "write object $hash" and if the object already exists > it'd increment the ref count. Similarly, when you delete the logical object, > you do a refcount 'put' on each chunk, and the chunk would only go away > when the last ref did too. (In practice we need to be careful to avoid leaked > refs in the case of failures; this would probably be done by having a > 'deduping' and 'deleting' state on the logical object and named references. > > > My current milestones: > > - Develop CAS backend, fingerprinting, recipes store > > - Support limited set of operations (like EC does) > > - Support RBD (with/without Cache) and evaluate > > - Add Chunking, Garbage Collection, .. > > > > Currently I'm adding a new PG backend into the OSD code base. I'll > > push the code the my github clone as soon as it does "something" :) > > This would be a good thing to discuss during the Ceph Developer Monthly call > next Wednesday: > > http://tracker.ceph.com/projects/ceph/wiki/Planning > http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016 > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-04-28 21:08 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-01 17:25 Started developing a deduplication feature Marcel Lauhoff 2016-04-01 21:31 ` Sage Weil 2016-04-04 12:38 ` Marcel Lauhoff 2016-04-08 15:01 ` Marcel Lauhoff 2016-04-08 15:18 ` Sage Weil 2016-04-08 21:50 ` Shinobu Kinjo 2016-04-12 9:35 ` Marcel Lauhoff 2016-04-28 21:08 ` Allen Samuels
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.