Re: raid5 + HDFS - Jim Dowling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jim Dowling <jdowling@sics.se>
To: Martin Tippmann <martin.tippmann@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: raid5 + HDFS
Date: Mon, 5 Oct 2015 06:49:51 +0200	[thread overview]
Message-ID: <5612016F.3040206@sics.se> (raw)
In-Reply-To: <CABL_Pd8Dag2toHtn2_uSkotZoLPktaC=TOZkLiPNjQDevTRcEw@mail.gmail.com>



On 2015-10-05 00:00, Martin Tippmann wrote:
> 2015-10-03 16:50 GMT+02:00 Jim Dowling <jdowling@sics.se 
> <mailto:jdowling@sics.se>>:
>
>     As you point out, hdfs does its own checksumming of blocks, which
>     is needed as blocks are transferred over the network. So, yes it
>     is double checksumming if you will.
>
>     We are keeping the data node as it is. The only change needed will
>     be to identify a block device as an "archive" device or a normal
>     device. We're interested in archive devices for this work.
>     The bigger picture is that Apache HDFS are going towards striping
>     blocks over different data nodes, losing data locality. We are
>     investigating  btrfs/raid5 for archived data. It's workload would
>     be much lower
>     than standard.
>
>
> Hi, thanks for the clarification!
>
> [snip]
>
>     So the idea is to erasure code twice, checksum twice. Overall
>     overhead will be about 50%, half of this for raid5, half hdfs
>     erasure coding.
>     Hypothesis: For cold storage data with normal at most one active
>     job per data node, jobs will read/write data faster, improving
>     performance, particularly over 10GbE
>
>
> btrfs RAID5 should do the job - I don't think the checksumming is 
> really a problem as it's CRC32C that modern Intel CPUs provide an 
> instruction for.
>
> If the performance is not as great you could try doing btrfs on top of 
> mdraid RAID5 - mdraid should be more optimized than btrfs at that this 
> point. If you don't need btrfs snapshots and subvolumes you could 
> implement the HDFS snapshotting using the upcoming XFS reflink support 
> - that provides CoW semantics - should be working with HDFS blocks if 
> you cp --reflink them for Snapshots.
>
> From numbers that got posted here a while ago mdraid + XFS is at the 
> moment are quite bit faster than btrfs - XFS provides Metadata 
> checksumming (no duplication through) so you could spare at least the 
> double checksumming of data. However using mdraid has some caveats as 
> it's able to grow or shrink once configured.
>
> HTH
> Martin
>
Thanks for the tips Martin. We have a bit more research to do before we 
get started.

     prev parent reply	other threads:[~2015-10-05  4:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-02 22:07 raid5 + HDFS Jim Dowling
2015-10-02 23:51 ` Martin Tippmann
     [not found]   ` <CAHT4m=XMsr=o0hVCXohi-cT=51qnB0mR_pK8108rhBQ1uQ_RBA@mail.gmail.com>
2015-10-03 14:50     ` Jim Dowling
     [not found]       ` <CABL_Pd8Dag2toHtn2_uSkotZoLPktaC=TOZkLiPNjQDevTRcEw@mail.gmail.com>
2015-10-05  4:49         ` Jim Dowling [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5612016F.3040206@sics.se \
    --to=jdowling@sics.se \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=martin.tippmann@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.