Re: Adding Data-At-Rest compression support to Ceph

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Igor Fedotov <ifedotov@mirantis.com>
To: Gregory Farnum <gfarnum@redhat.com>
Cc: Sage Weil <sage@newdream.net>, ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding Data-At-Rest compression support to Ceph
Date: Thu, 24 Sep 2015 18:13:34 +0300	[thread overview]
Message-ID: <5604131E.2030408@mirantis.com> (raw)
In-Reply-To: <CAJ4mKGZLc1AzAbhEKpjSdUd21dXWgVxiLjjETHuP+EwVCA8EoA@mail.gmail.com>

On 23.09.2015 21:03, Gregory Farnum wrote:
> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>>
>>>> The idea of making the primary responsible for object compression
>>>> really concerns me. It means for instance that a single random access
>>>> will likely require access to multiple objects, and breaks many of the
>>>> optimizations we have right now or in the pipeline (for instance:
>>>> direct client access).
>> Could you please elaborate why multiple objects access is required on single
>> random access?
> It sounds to me like you were planning to take an incoming object
> write, compress it, and then chunk it. If you do that, the symbols
> ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
> to reside in the first object and need to be fetched for each read in
> other objects.
Gregory,
do you mean a kind of compressor dictionary under symbols "abcdefgh = 
a", etc here.
And your assumption is that such dictionary is made on the first write, 
saved and reused by any subsequent reads, right?
I think that's not the case - it's better to compress each write 
independently.  Thus there is no need to access "dictionary" object ( 
i.e. the first object with these symbols) on every read operation,. The 
latter uses compressed block data only.
Yes, this might affect total compression ratio but thinks that's acceptabl.
>> In my opinion we need to access absolutely the same object set as before: in
>> EC pool each appended block is spitted into multiple shards that go to
>> respective OSDs. In general case one has to retrieve a set of adjacent
>> shards from several OSDs on single read request.
> Usually we just need to get the object info from the primary and then
> read whichever object has the data for the requested region. If the
> region spans a stripe boundary we might need to get two, but often we
> don't...
With independent block compression mentioned above the scenario is the 
same. The only thing we need to find proper compressed block is a 
mapping from original data offset to the compressed ones. We can store 
this as object metadata. Thus we need object metadata on each read only.
>> In case of compression the
>> only difference is in data range that compressed shard set occupy. I.e. we
>> simply need to translate requested data range to the actually stored one and
>> retrieve that data from OSDs. What's missed?
>>> And apparently only the EC pool will support
>>> compression, which is frustrating for all the replicated pool users
>>> out there...
>> In my opinion  replicated pool users should consider EC pool usage first if
>> they care about space saving. They automatically gain 50% space saving this
>> way. Compression brings even more saving but that's rather the second step
>> on this way.
> EC pools have important limitations that replicated pools don't, like
> not working for object classes or allowing random overwrites. You can
> stick a replicated cache pool in front but that comes with another
> whole can of worms. Anybody with a large enough proportion of active
> data won't find that solution suitable but might still want to reduce
> space required where they can, like with local compression.
Well I agree that have compression support for both replicated and EC 
pools is better.
But random access ( and probably other advanced features ) requires much 
more complex data handling that also brings additional overhead. 
Actually I suppose EC pools have such limitations due to these reasons. 
Thus my original idea was to simplify compression implementation from 
one side and make it  in-line with EC usage from another. The latter 
makes sense since compression and EC  have pretty the same reasons for 
implementation.

And just for the sake of my education could you please mention or point 
out existing issues in cache+EC pools usage.
How widely are EC pools used in production at all? Or that's rather 
experimental/secondary option?
>>> Is there some reason we don't just want to apply encryption across an
>>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>>> (for reasons named above) but there are other mechanisms like inline
>>> block device compression that I think are supposed to work pretty
>>> well.
>> If I understand the idea of inline block device compression correctly it has
>> some of drawbacks similar to FS compression approach. Ones to mention:
>> * Less flexibility - per device compression only, no way to have per-pool
>> compression. No control on the compression process.
> What would the use case be here? I can imagine not wanting to slow
> down your cache pools with it or something (although realistically I
> don't think that's a concern unless the sheer CPU usage is a problem
> with frequent writes), but those would be on separate OSDs/volumes
> anyway
Well I can imagine the need to have compression for some specific 
backing pools ( e.g. with seldom accessed or highly compressible data) 
and disable it for others, e.g. where original data is non-compressible 
( e.g. either already compressed  or encrypted).
Potentially we can even have some option to control compression on 
per-object basis and provide some hints for clients to enable it for 
specific use cases.
Another feature that might be useful - the ability to disable/re-enable 
compression during OSD life-cycle. E.g. when Administrator realizes that 
it's not appropriate for his use case. I doubt that's easy to do when 
compression is performed at device level.

> Plus block device compression is also able to include all the *other*
> stuff that doesn't fit inside the object proper (xattrs and omap).
Yes, that's a good point but I suppose nothing prevents us from 
compressing metadata by ourselves too.
>> * Potentially higher overhead when operating- There is no way to bypass
>> non-compressible data processing, e.g. shards with Erasure codes.
> My information theory intuition has never been very good, but I don't
> think the coded chunks are any less compressible than the data they're
> coding for, in general...
Yes, my bad. I played with EC a bit - generated chunks are pretty 
regular. I expected something absolutely random like encrypted data.

> ...I should note that I'm under the impression that transparent
> compression already exists at some level which can be stacked with
> regular filesystems, but I'm not finding it now, so maybe I'm
> misinformed and the tradeoffs are a little different than I thought.
I found some mentions about RBD device that performs inline compression. 
But pretty limited information present on the Net makes me think that 
this solution is far from production usage.

> But I still don't like the idea of doing it on a primary just for EC
> pools – I think if we were going to take that approach it'd be easier
> to compress somewhere before it reaches the EC/replicated split?
As I mentioned above the main reasons that pushed me to merge 
compression with EC pools are similar handling issues and their missions 
( space for cpu)  they provide.
Moving compression to any different place raises many-many complications..

Anyway will try to make some summary on the suggested approaches and 
their Pros and Cons.

Thanks,
Igor

PS. Gregory I highly appreciate your feedback.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-09-24 15:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47   ` Igor Fedotov
2015-09-23 13:15     ` Sage Weil
2015-09-23 14:05       ` Gregory Farnum
2015-09-23 15:26         ` Igor Fedotov
2015-09-23 17:31           ` Samuel Just
2015-09-24 15:34             ` Igor Fedotov
2015-09-23 18:03           ` Gregory Farnum
2015-09-24 15:13             ` Igor Fedotov [this message]
2015-09-24 15:34               ` Sage Weil
2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
2015-09-24 16:00                   ` Igor Fedotov
2015-09-24 15:56                 ` Igor Fedotov
2015-09-24 16:03                   ` Sage Weil
2015-09-24 16:14                     ` Igor Fedotov
2015-09-24 16:25                     ` Igor Fedotov
2015-09-24 17:36                       ` Robert LeBlanc
2015-09-24 17:53                         ` Samuel Just
2015-09-25 11:59                           ` Igor Fedotov
2015-09-25 14:14                             ` Sage Weil
2015-09-28 16:56                               ` Igor Fedotov
2015-09-24 18:10               ` Gregory Farnum
2015-09-25 13:16                 ` Igor Fedotov
2015-09-23 14:08       ` Igor Fedotov
2015-09-23 14:37         ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5604131E.2030408@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.