Re: Adding Data-At-Rest compression support to Ceph

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Igor Fedotov <ifedotov@mirantis.com>
To: Samuel Just <sjust@redhat.com>
Cc: Gregory Farnum <gfarnum@redhat.com>,
	Sage Weil <sage@newdream.net>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding Data-At-Rest compression support to Ceph
Date: Thu, 24 Sep 2015 18:34:22 +0300	[thread overview]
Message-ID: <560417FE.5000403@mirantis.com> (raw)
In-Reply-To: <CAN=+7FXXLVn=8kLJcD0ZC-2nfN8n4HL1V1_XuyjzpsLZOXG2dQ@mail.gmail.com>

Samuel,
I completely agree about the need to have a blueprint before the 
implementation. But I think we should fix what approach to use ( when 
and how to perform the compression) first.
I'll summarize existing suggestions and their Pros and Cons shortly. 
Thus we'll be able to discuss them more productively.

Regarding performing the compression at the client side - I'm afraid 
it's not that easy given the fact that we have multiple clients with 
different use patterns and random data access.

Thanks,
Igor.

On 23.09.2015 20:31, Samuel Just wrote:
> I think before moving forward with any sort of implementation, the
> design would need to be pretty much completely mapped out --
> particularly how the offset mapping will be handled and stored.  The
> right thing to do would be to produce a blueprint and submit it to the
> list.  I also would vastly prefer to do it on the client side if
> possible.  Certainly, radosgw could do the compression just as easily
> as the osds (except for the load on the radosgw heads, I suppose).
> -Sam
>
> On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>>
>> On 23.09.2015 17:05, Gregory Farnum wrote:
>>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>>>> Hi Sage,
>>>>> thanks a lot for your feedback.
>>>>>
>>>>> Regarding issues with offset mapping and stripe size exposure.
>>>>> What's about the idea to apply compression in two-tier (cache+backing
>>>>> storage)
>>>>> model only ?
>>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>>> making it a feature of the EC pool means we can also address EC pool
>>>> users
>>>> like radosgw.
>>>>
>>>>> I doubt single-tier one is widely used for EC pools since there is no
>>>>> random
>>>>> write support in such mode. Thus this might be an acceptable limitation.
>>>>> At the same time it seems that appends caused by cached object flush
>>>>> have
>>>>> fixed block size (8Mb by default). And object is totally rewritten on
>>>>> the next
>>>>> flush if any. This makes offset mapping less tricky.
>>>>> Decompression should be applied in any model though as cache tier
>>>>> shutdown and
>>>>> subsequent compressed data access is possibly  a valid use case.
>>>> Yeah, we need to handle random reads either way, so I think the offset
>>>> mapping is going to be needed anyway.
>>> The idea of making the primary responsible for object compression
>>> really concerns me. It means for instance that a single random access
>>> will likely require access to multiple objects, and breaks many of the
>>> optimizations we have right now or in the pipeline (for instance:
>>> direct client access).
>> Could you please elaborate why multiple objects access is required on single
>> random access?
>> In my opinion we need to access absolutely the same object set as before: in
>> EC pool each appended block is spitted into multiple shards that go to
>> respective OSDs. In general case one has to retrieve a set of adjacent
>> shards from several OSDs on single read request. In case of compression the
>> only difference is in data range that compressed shard set occupy. I.e. we
>> simply need to translate requested data range to the actually stored one and
>> retrieve that data from OSDs. What's missed?
>>> And apparently only the EC pool will support
>>> compression, which is frustrating for all the replicated pool users
>>> out there...
>> In my opinion  replicated pool users should consider EC pool usage first if
>> they care about space saving. They automatically gain 50% space saving this
>> way. Compression brings even more saving but that's rather the second step
>> on this way.
>>> Is there some reason we don't just want to apply encryption across an
>>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>>> (for reasons named above) but there are other mechanisms like inline
>>> block device compression that I think are supposed to work pretty
>>> well.
>> If I understand the idea of inline block device compression correctly it has
>> some of drawbacks similar to FS compression approach. Ones to mention:
>> * Less flexibility - per device compression only, no way to have per-pool
>> compression. No control on the compression process.
>> * Potentially higher overhead when operating- There is no way to bypass
>> non-compressible data processing, e.g. shards with Erasure codes.
>> * Potentially higher overhead for recovery on OSD death - one needs to
>> decompress data at working OSDs and compress it at new OSD. That's not
>> necessary if compression takes place prior to EC though.
>>
>>> The only thing that doesn't get us that I can see mentioned here
>>> is the over-the-wire compression — and Haomai already has patches for
>>> that, which should be a lot easier to validate and will work at all
>>> levels of the stack!
>>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-09-24 15:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47   ` Igor Fedotov
2015-09-23 13:15     ` Sage Weil
2015-09-23 14:05       ` Gregory Farnum
2015-09-23 15:26         ` Igor Fedotov
2015-09-23 17:31           ` Samuel Just
2015-09-24 15:34             ` Igor Fedotov [this message]
2015-09-23 18:03           ` Gregory Farnum
2015-09-24 15:13             ` Igor Fedotov
2015-09-24 15:34               ` Sage Weil
2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
2015-09-24 16:00                   ` Igor Fedotov
2015-09-24 15:56                 ` Igor Fedotov
2015-09-24 16:03                   ` Sage Weil
2015-09-24 16:14                     ` Igor Fedotov
2015-09-24 16:25                     ` Igor Fedotov
2015-09-24 17:36                       ` Robert LeBlanc
2015-09-24 17:53                         ` Samuel Just
2015-09-25 11:59                           ` Igor Fedotov
2015-09-25 14:14                             ` Sage Weil
2015-09-28 16:56                               ` Igor Fedotov
2015-09-24 18:10               ` Gregory Farnum
2015-09-25 13:16                 ` Igor Fedotov
2015-09-23 14:08       ` Igor Fedotov
2015-09-23 14:37         ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=560417FE.5000403@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=sage@newdream.net \
    --cc=sjust@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.