From: Marcel Lauhoff <lauhoff@uni-mainz.de>
To: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Started developing a deduplication feature
Date: Mon, 4 Apr 2016 14:38:45 +0200 [thread overview]
Message-ID: <87wpodv99m.fsf@uni-mainz.de> (raw)
In-Reply-To: <alpine.DEB.2.11.1604011723500.22014@cpach.fuggernut.com>
Hi Sage,
Sage Weil <sage@newdream.net> writes:
> On Fri, 1 Apr 2016, Marcel Lauhoff wrote:
>> The first pool uses a new PG backend ("CAS Backend"),
>> while the second can use replication or erasure coding.
>>
>> The CAS backend computes fingerprints for incoming objects and
>> stores the fingerprint <-> original object name mapping.
>> It then forwards the data to a storage pool, addressing the objects by
>> fingerprint (the content defined name).
>>
>> The storage pool therefore serves as a distributed fingerprint index.
>> CRUSH selects the responsible OSDs. The OSDs know their objects.
>>
>> Deduplication happens when two objects/chunks have the same
>> fingerprint.
>
> This is a little different, though.
>
> The plan so far has been to match this up with the next stage of tiering.
> We'll add the ability for and object to be a 'redirect' and store a bit of
> metadata indicating where to look next. That might be a simple as "go
> look in this cold RADOS pool over there," or a URL into another storage
> system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS
> chunks in another rados pool.
"OSD - tiering - object redirects" [42]?
As I understood the design, it is "client driven": Clients accessing a
redirected object get a reply "try again" + metadata from the primary
OSD.
What I'm proposing does not change the client. Plus, all the redirection
and dedup magic happens on the OSDs. Therefore the additional round
trip stays in the Ceph cluster.
On the other hand, the "layered pool" approach adds additional PGs and
load to the OSDs that the clients could do.
> The original thought was that this would just be a regular ReplicatedPG,
> not a new pool type. I haven't thought about what we'd gain by having a
> new pool type. One thing we get by using the existing pool is that we're
> not forced to do the demotion/dedup immediately--we can just store the
> object normally, and dedup it later when we decide it's cold.
Which also means that you could dedup an existing pool after a software upgrade..
Still, the common drawback/counter argument of/against offline dedup:
You can't factor in the deduplication ratio and end up having to buy more storage.
> For the CAS pool, the idea would be to use the refcount class, or
> something like it, so that you'd say "write object $hash" and if the
> object already exists it'd increment the ref count. Similarly, when you
> delete the logical object, you do a refcount 'put' on each chunk, and the
> chunk would only go away when the last ref did too. (In practice we need
> to be careful to avoid leaked refs in the case of failures; this would
> probably be done by having a 'deduping' and 'deleting' state on the
> logical object and named references.
Sound good. Maybe even with write-once-then-read-only objects like, for
example, Venti?
>> My current milestones:
>> - Develop CAS backend, fingerprinting, recipes store
>> - Support limited set of operations (like EC does)
>> - Support RBD (with/without Cache) and evaluate
>> - Add Chunking, Garbage Collection, ..
>>
>> Currently I'm adding a new PG backend into the OSD code base. I'll
>> push the code the my github clone as soon as it does "something" :)
>
> This would be a good thing to discuss during the Ceph Developer Monthly
> call next Wednesday:
>
> http://tracker.ceph.com/projects/ceph/wiki/Planning
> http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016
Added. See you Wednesday
~irq0
[42] http://tracker.ceph.com/projects/ceph/wiki/Osd_-_tiering_-_object_redirects
--
Marcel Lauhoff
Mail: lauhoff@uni-mainz.de
XMPP: mlauhoff@jabber.uni-mainz.de
next prev parent reply other threads:[~2016-04-04 12:38 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-01 17:25 Started developing a deduplication feature Marcel Lauhoff
2016-04-01 21:31 ` Sage Weil
2016-04-04 12:38 ` Marcel Lauhoff [this message]
2016-04-08 15:01 ` Marcel Lauhoff
2016-04-08 15:18 ` Sage Weil
2016-04-08 21:50 ` Shinobu Kinjo
2016-04-12 9:35 ` Marcel Lauhoff
2016-04-28 21:08 ` Allen Samuels
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wpodv99m.fsf@uni-mainz.de \
--to=lauhoff@uni-mainz.de \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.