From: Igor Fedotov <ifedotov@mirantis.com>
To: Gregory Farnum <gfarnum@redhat.com>
Cc: Sage Weil <sage@newdream.net>, ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding Data-At-Rest compression support to Ceph
Date: Fri, 25 Sep 2015 16:16:23 +0300 [thread overview]
Message-ID: <56054927.2040008@mirantis.com> (raw)
In-Reply-To: <CAJ4mKGaJY31=WbTPFfpWkNho0zSb__dYMbjN6W9jeQTKiGvxyw@mail.gmail.com>
On 24.09.2015 21:10, Gregory Farnum wrote:
> On Thu, Sep 24, 2015 at 8:13 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> On 23.09.2015 21:03, Gregory Farnum wrote:
>> Okay, that's acceptable, but that metadata then gets pretty large.
>> You would need to store an offset, for each chunk in the PG, and for
>> each individual write. (And even then you'd have to read an entire
>> write at a time to make sure you get the data requested, even if they
>> only want a small portion of it.) If you're doing it this way, then
>> realize we've also got a problem with recovery: we can't lose those
>> offsets. Which means they need to be preserved at all costs. So that
>> means for each stripe unit you'd store them on the primary (for easy
>> access) and on the replica (so they have the same lifecycle as the
>> data they're mapping), which means the replicas need to be
>> compression-aware. Which is good, since I think they'd need to be
>> compression-aware for scrubbing and things as well. And then when you
>> lose the primary the next guy who's reconstructing would need to, uh,
>> ask each shard for the uncompressed version of the data?
You are absolutely right about metadata importance and replicas
compression-awareness. The great thing here is that it's absolutely
similar to current EC pool implementation. Each append to EC pool
updates some specific metadata (hash info) that are propagated to all
replicas. And each replica is able to restore EC encoded data when
primary is lost. IMO such replica simply becomes a new primary.
And yes - reconstructing entity collects shards from multiple OSDs.
Moreover primary does the same during regular read. Thus all this
mechanics already exists for EC pools.
>> If we were going to limit this to EC pools I think we should just do
>> it at the replica in the FileStore or something, transparently to the
>> wire and recovery protocols. While the compression would help on
>> 1GigE networks, on 10GigE I think the CPU costs of compression
>> outweigh any bandwidth efficiencies we'd get...
This is definitely worth to consider but one thing to mention here. In
general from CPU loading perspective there is no much difference where
compression is performed: at primary OSD or at replica node. Each
replica node can be a primary for some other object thus its' CPU can be
utilized for that compression.
E.g.
There are three nodes: node1, node2, node3.
There are three objects written to EC pool.
They have different primaries: node1, node2, node3 respectively.
and all nodes are used for objects to store resulting EC shards.
Original disposition after EC:
obj1 -> shard1_1, shard1_2, shard1_3. (performed at node1)
obj2 -> shard2_1, shard2_2, shard2_3. (performed at node2)
obj3 -> shard3_1, shard3_2, shard3_3. (performed at node3)
Stored data disposition can be:
node1: shard1_1, shard2_2, shard3_3
node2: shard1_3, shard2_1, shard3_2
node3: shard1_2 & shard2_3, shard3_1
Thus each node has to deal with 3 shards - no matter where you have
compression functionality: each node has to compress 3 shards.
If compression is done at primary node1 compresses shards1_1, 1_2, 1_3
If compression is at replica node 1 compresses shard1_1, shard2_2, shard3_3
The same applies to other nodes.
As a result you will have similar CPU load distribution among nodes
under Ceph cluster load for both compression approaches.
Actually compression at primary before EC even has some benefit: each
object has two shards prior to EC thus you need to compress less data.
>>
>> Well I agree that have compression support for both replicated and EC pools
>> is better.
>> But random access ( and probably other advanced features ) requires much
>> more complex data handling that also brings additional overhead. Actually I
>> suppose EC pools have such limitations due to these reasons. Thus my
>> original idea was to simplify compression implementation from one side and
>> make it in-line with EC usage from another. The latter makes sense since
>> compression and EC have pretty the same reasons for implementation.
> Well, EC pools still support random reads, I think? Or at least
> reading along stripes, which for the purpose of this discussion is
> almost the same.
Yeah, random reads are possible for EC pools. But they aren't the major
issue IMO. That's random writes that causes a head ache.
On such write one should decompress existing block, merge it with new
data, compress it again and then save to a disk given the fact that
block size has changed.
Or implement a sort of journal where new writes are saved separately and
then data reconstruction from this journal is required on read. As well
as some garbage collection... AFAIK ZBD I mentioned before works in this
way.
see
http://users.ics.forth.gr/~bilas/pdffiles/makatos-snapi10.pdf
> In any case, as Sam said I think judging these proposals well will
> require actually going through the data structure and algorithms
> design work for each one and comparing. Unfortunately I've no time to
> do that, but I'd definitely like to see two real approaches
> well-sketched-out before any work is spent on coding one. -Greg
Got it, will try to prepare some draft...
Thanks,
Igor
next prev parent reply other threads:[~2015-09-25 13:16 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47 ` Igor Fedotov
2015-09-23 13:15 ` Sage Weil
2015-09-23 14:05 ` Gregory Farnum
2015-09-23 15:26 ` Igor Fedotov
2015-09-23 17:31 ` Samuel Just
2015-09-24 15:34 ` Igor Fedotov
2015-09-23 18:03 ` Gregory Farnum
2015-09-24 15:13 ` Igor Fedotov
2015-09-24 15:34 ` Sage Weil
2015-09-24 15:41 ` HEWLETT, Paul (Paul)
2015-09-24 16:00 ` Igor Fedotov
2015-09-24 15:56 ` Igor Fedotov
2015-09-24 16:03 ` Sage Weil
2015-09-24 16:14 ` Igor Fedotov
2015-09-24 16:25 ` Igor Fedotov
2015-09-24 17:36 ` Robert LeBlanc
2015-09-24 17:53 ` Samuel Just
2015-09-25 11:59 ` Igor Fedotov
2015-09-25 14:14 ` Sage Weil
2015-09-28 16:56 ` Igor Fedotov
2015-09-24 18:10 ` Gregory Farnum
2015-09-25 13:16 ` Igor Fedotov [this message]
2015-09-23 14:08 ` Igor Fedotov
2015-09-23 14:37 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56054927.2040008@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=ceph-devel@vger.kernel.org \
--cc=gfarnum@redhat.com \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.