From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f54.google.com ([209.85.215.54]:35472 "EHLO mail-la0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750833AbbJEEtz (ORCPT ); Mon, 5 Oct 2015 00:49:55 -0400 Received: by lackq9 with SMTP id kq9so10338045lac.2 for ; Sun, 04 Oct 2015 21:49:54 -0700 (PDT) Subject: Re: raid5 + HDFS To: Martin Tippmann References: <560F0014.9020905@sics.se> <560FEB23.4040004@sics.se> Cc: linux-btrfs@vger.kernel.org From: Jim Dowling Message-ID: <5612016F.3040206@sics.se> Date: Mon, 5 Oct 2015 06:49:51 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2015-10-05 00:00, Martin Tippmann wrote: > 2015-10-03 16:50 GMT+02:00 Jim Dowling >: > > As you point out, hdfs does its own checksumming of blocks, which > is needed as blocks are transferred over the network. So, yes it > is double checksumming if you will. > > We are keeping the data node as it is. The only change needed will > be to identify a block device as an "archive" device or a normal > device. We're interested in archive devices for this work. > The bigger picture is that Apache HDFS are going towards striping > blocks over different data nodes, losing data locality. We are > investigating btrfs/raid5 for archived data. It's workload would > be much lower > than standard. > > > Hi, thanks for the clarification! > > [snip] > > So the idea is to erasure code twice, checksum twice. Overall > overhead will be about 50%, half of this for raid5, half hdfs > erasure coding. > Hypothesis: For cold storage data with normal at most one active > job per data node, jobs will read/write data faster, improving > performance, particularly over 10GbE > > > btrfs RAID5 should do the job - I don't think the checksumming is > really a problem as it's CRC32C that modern Intel CPUs provide an > instruction for. > > If the performance is not as great you could try doing btrfs on top of > mdraid RAID5 - mdraid should be more optimized than btrfs at that this > point. If you don't need btrfs snapshots and subvolumes you could > implement the HDFS snapshotting using the upcoming XFS reflink support > - that provides CoW semantics - should be working with HDFS blocks if > you cp --reflink them for Snapshots. > > From numbers that got posted here a while ago mdraid + XFS is at the > moment are quite bit faster than btrfs - XFS provides Metadata > checksumming (no duplication through) so you could spare at least the > double checksumming of data. However using mdraid has some caveats as > it's able to grow or shrink once configured. > > HTH > Martin > Thanks for the tips Martin. We have a bit more research to do before we get started.