From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcel Lauhoff Subject: Started developing a deduplication feature Date: Fri, 1 Apr 2016 19:25:57 +0200 Message-ID: <8737r5w89m.fsf@uni-mainz.de> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from mailgate-01.zdv.uni-mainz.de ([134.93.178.241]:5657 "EHLO mailgate-01.zdv.uni-mainz.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751281AbcDARfz (ORCPT ); Fri, 1 Apr 2016 13:35:55 -0400 Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org Hi Ceph, deduplication has been discussed on the list a couple of times. Over the next months I'll be working on a prototype. In short: Use a content-addressed storage pool backed by a pool acting as storage and distributed fingerprint index. Two pools: (1) pool that does the content addressing, (2) storage / index pool. OSDs in the first pool readdress and chuck/reassemble objects. They then store the new objects/chunks in a second pool. The first pool uses a new PG backend ("CAS Backend"), while the second can use replication or erasure coding. The CAS backend computes fingerprints for incoming objects and stores the fingerprint <-> original object name mapping. It then forwards the data to a storage pool, addressing the objects by fingerprint (the content defined name). The storage pool therefore serves as a distributed fingerprint index. CRUSH selects the responsible OSDs. The OSDs know their objects. Deduplication happens when two objects/chunks have the same fingerprint. My current milestones: - Develop CAS backend, fingerprinting, recipes store - Support limited set of operations (like EC does) - Support RBD (with/without Cache) and evaluate - Add Chunking, Garbage Collection, .. Currently I'm adding a new PG backend into the OSD code base. I'll push the code the my github clone as soon as it does "something" :) ~irq0 -- Marcel Lauhoff Mail: lauhoff@uni-mainz.de XMPP: mlauhoff@jabber.uni-mainz.de