From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: erasure coding (sorry) Date: Thu, 18 Apr 2013 14:08:02 -0700 Message-ID: <517060B2.80706@inktank.com> References: <20130418162842.0c61d1e2@dieter-t420s> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-da0-f49.google.com ([209.85.210.49]:53874 "EHLO mail-da0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751827Ab3DRVIG (ORCPT ); Thu, 18 Apr 2013 17:08:06 -0400 Received: by mail-da0-f49.google.com with SMTP id t11so1585176daj.36 for ; Thu, 18 Apr 2013 14:08:05 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: "Plaetinck, Dieter" , ceph-devel@vger.kernel.org, cdl@asgaard.org, danm@annaisystems.com On 04/18/2013 01:47 PM, Sage Weil wrote: > On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: >> sorry to bring this up again, googling revealed some people don't like the subject [anymore]. >> >> but I'm working on a new +- 3PB cluster for storage of immutable files. >> and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now) >> For this use case, my impression is erasure coding would make a lot of sense >> (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files. >> inbound traffic would be minimal) >> >> I know that the answer a while ago was "no plans to implement erasure coding", has this changed? >> if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc. > > We would love to do it, but it is not a priority at the moment (things > like multi-site replication are in much higher demand). That of course > doesn't prevent someone outside of Inktank from working on it :) > > The main caveat is that it will be complicate. For an initial > implementation, the full breadth of the rados API probably wouldn't be > support for erasure/parity encoded pools (thinkgs like rados classes and > the omap key/value api get tricky when you start talking about parity). > But for many (or even most) use cases, objects are just bytes, and those > restrictions are just fine. I talked to some folks interested in doing a more limited form of this yesterday. They started a blueprint [1]. One of their ideas was to have erasure coding done by a separate process (or thread perhaps). It would use erasure coding on an object and then use librados to store the rasure-encoded pieces in a separate pool, and finally leave a marker in place of the original object in the first pool. When the osd detected this marker, it would proxy the request to the erasure coding thread/process which would service the request on the second pool for reads, and potentially make writes move the data back to the first pool in a tiering sort of scenario. I might have misremembered some details, but I think it's an interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution. Josh [1] http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend