From: Jim Dowling <jdowling@sics.se>
To: Martin Tippmann <martin.tippmann@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: raid5 + HDFS
Date: Mon, 5 Oct 2015 06:49:51 +0200 [thread overview]
Message-ID: <5612016F.3040206@sics.se> (raw)
In-Reply-To: <CABL_Pd8Dag2toHtn2_uSkotZoLPktaC=TOZkLiPNjQDevTRcEw@mail.gmail.com>
On 2015-10-05 00:00, Martin Tippmann wrote:
> 2015-10-03 16:50 GMT+02:00 Jim Dowling <jdowling@sics.se
> <mailto:jdowling@sics.se>>:
>
> As you point out, hdfs does its own checksumming of blocks, which
> is needed as blocks are transferred over the network. So, yes it
> is double checksumming if you will.
>
> We are keeping the data node as it is. The only change needed will
> be to identify a block device as an "archive" device or a normal
> device. We're interested in archive devices for this work.
> The bigger picture is that Apache HDFS are going towards striping
> blocks over different data nodes, losing data locality. We are
> investigating btrfs/raid5 for archived data. It's workload would
> be much lower
> than standard.
>
>
> Hi, thanks for the clarification!
>
> [snip]
>
> So the idea is to erasure code twice, checksum twice. Overall
> overhead will be about 50%, half of this for raid5, half hdfs
> erasure coding.
> Hypothesis: For cold storage data with normal at most one active
> job per data node, jobs will read/write data faster, improving
> performance, particularly over 10GbE
>
>
> btrfs RAID5 should do the job - I don't think the checksumming is
> really a problem as it's CRC32C that modern Intel CPUs provide an
> instruction for.
>
> If the performance is not as great you could try doing btrfs on top of
> mdraid RAID5 - mdraid should be more optimized than btrfs at that this
> point. If you don't need btrfs snapshots and subvolumes you could
> implement the HDFS snapshotting using the upcoming XFS reflink support
> - that provides CoW semantics - should be working with HDFS blocks if
> you cp --reflink them for Snapshots.
>
> From numbers that got posted here a while ago mdraid + XFS is at the
> moment are quite bit faster than btrfs - XFS provides Metadata
> checksumming (no duplication through) so you could spare at least the
> double checksumming of data. However using mdraid has some caveats as
> it's able to grow or shrink once configured.
>
> HTH
> Martin
>
Thanks for the tips Martin. We have a bit more research to do before we
get started.
prev parent reply other threads:[~2015-10-05 4:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-02 22:07 raid5 + HDFS Jim Dowling
2015-10-02 23:51 ` Martin Tippmann
[not found] ` <CAHT4m=XMsr=o0hVCXohi-cT=51qnB0mR_pK8108rhBQ1uQ_RBA@mail.gmail.com>
2015-10-03 14:50 ` Jim Dowling
[not found] ` <CABL_Pd8Dag2toHtn2_uSkotZoLPktaC=TOZkLiPNjQDevTRcEw@mail.gmail.com>
2015-10-05 4:49 ` Jim Dowling [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5612016F.3040206@sics.se \
--to=jdowling@sics.se \
--cc=linux-btrfs@vger.kernel.org \
--cc=martin.tippmann@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).