CEPH filesystem development
 help / color / mirror / Atom feed
From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Adding Data-At-Rest compression support to Ceph
Date: Wed, 23 Sep 2015 17:08:41 +0300	[thread overview]
Message-ID: <5602B269.1090806@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1509230613410.11876@cobra.newdream.net>

Sage,

so you are saying that radosgw tend to use EC pools directly without 
caching, right?

I agree that we need offset mapping anyway.

And the difference between cache writes and direct writes is mainly in 
block size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher 
overhead for both offset mapping and compression. But I agree - no real 
difference from implementation point of view.
OK, let's try to handle both use cases.

So what do you think - can proceed with this feature implementation or 
we need more discussion on that?

Thanks,
Igor.

On 23.09.2015 16:15, Sage Weil wrote:
> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>> Hi Sage,
>> thanks a lot for your feedback.
>>
>> Regarding issues with offset mapping and stripe size exposure.
>> What's about the idea to apply compression in two-tier (cache+backing storage)
>> model only ?
> I'm not sure we win anything by making it a two-tier only thing... simply
> making it a feature of the EC pool means we can also address EC pool users
> like radosgw.
>
>> I doubt single-tier one is widely used for EC pools since there is no random
>> write support in such mode. Thus this might be an acceptable limitation.
>> At the same time it seems that appends caused by cached object flush have
>> fixed block size (8Mb by default). And object is totally rewritten on the next
>> flush if any. This makes offset mapping less tricky.
>> Decompression should be applied in any model though as cache tier shutdown and
>> subsequent compressed data access is possibly  a valid use case.
> Yeah, we need to handle random reads either way, so I think the offset
> mapping is going to be needed anyway.  And I don't think there is any
> real difference from teh EC pool's perspective between a direct user
> like radosgw and the cache tier writing objects--in both cases it's
> doing appends and deletes.
>
> sage
>
>
>> Thanks,
>> Igor
>>
>> On 22.09.2015 22:11, Sage Weil wrote:
>>> On Tue, 22 Sep 2015, Igor Fedotov wrote:
>>>> Hi guys,
>>>>
>>>> I can find some talks about adding compression support to Ceph. Let me
>>>> share
>>>> some thoughts and proposals on that too.
>>>>
>>>> First of all I?d like to consider several major implementation options
>>>> separately. IMHO this makes sense since they have different applicability,
>>>> value and implementation specifics. Besides that less parts are easier for
>>>> both understanding and implementation.
>>>>
>>>>     * Data-At-Rest Compression. This is about compressing basic data volume
>>>> kept
>>>> by the Ceph backing tier. The main reason for that is data store costs
>>>> reduction. One can find similar approach introduced by Erasure Coding Pool
>>>> implementation - cluster capacity increases (i.e. storage cost reduces) at
>>>> the
>>>> expense of additional computations. This is especially effective when
>>>> combined
>>>> with the high-performance cache tier.
>>>>     *  Intermediate Data Compression. This case is about applying
>>>> compression
>>>> for intermediate data like system journals, caches etc. The intention is
>>>> to
>>>> improve expensive storage resource  utilization (e.g. solid state drives
>>>> or
>>>> RAM ). At the same time the idea to apply compression ( feature that
>>>> undoubtedly introduces additional overhead ) to the crucial heavy-duty
>>>> components probably looks contradictory.
>>>>     *  Exchange Data ?ompression. This one to be applied to messages
>>>> transported
>>>> between client and storage cluster components as well as internal cluster
>>>> traffic. The rationale for that might be the desire to improve cluster
>>>> run-time characteristics, e.g. limited data bandwidth caused by the
>>>> network or
>>>> storage devices throughput. The potential drawback is client overburdening
>>>> -
>>>> client computation resources might become a bottleneck since they take
>>>> most of
>>>> compression/decompression tasks.
>>>>
>>>> Obviously it would be great to have support for all the above cases, e.g.
>>>> object compression takes place at the client and cluster components handle
>>>> that naturally during the object life-cycle. Unfortunately significant
>>>> complexities arise on this way. Most of them are related to partial object
>>>> access, both reading and writing. It looks like huge development (
>>>> redesigning, refactoring and new code development ) and testing efforts
>>>> are
>>>> required on this way. It?s hard to estimate the value of such aggregated
>>>> support at the current moment too.
>>>> Thus the approach I?m suggesting is to drive the progress eventually and
>>>> consider cases separately. At the moment my proposal is to add
>>>> Data-At-Rest
>>>> compression to Erasure Coded pools as the most definite one from both
>>>> implementation and value points of view.
>>>>
>>>> How we can do that.
>>>>
>>>> Ceph Cluster Architecture suggests two-tier storage model for production
>>>> usage. Cache tier built on high-performance expensive storage devices
>>>> provides
>>>> performance. Storage tier with low-cost less-efficient devices provides
>>>> cost-effectiveness and capacity. Cache tier is supposed to use ordinary
>>>> data
>>>> replication while storage one can use erasure coding (EC) for effective
>>>> and
>>>> reliable data keeping. EC provides less store costs with the same
>>>> reliability
>>>> comparing to data replication approach at the expenses of additional
>>>> computations. Thus Ceph already has some trade off between capacity and
>>>> computation efforts. Actually Data-At-Rest compression is exactly about
>>>> the
>>>> same. Moreover one can tie EC and Data-At-Rest compression together to
>>>> achieve
>>>> even better storage effectiveness.
>>>> There are two possible ways on adding Data-At-Rest compression:
>>>>     *  Use data compression built into a file system beyond the Ceph.
>>>>     *  Add compression to Ceph OSD.
>>>>
>>>> At first glance Option 1. looks pretty attractive but there are some
>>>> drawbacks
>>>> for this approach. Here they are:
>>>>     *  File System lock-in. BTRFS is the only file system supporting
>>>> transparent
>>>> compression among ones recommended for Ceph usage.
>>>> Moreover
>>>> AFAIK it?s still not recommended for production usage, see:
>>>> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>>>>      *  Limited flexibility - one can use compression methods and policies
>>>> supported by FS only.
>>>>      *  Data compression depends on volume or mount point properties (and
>>>> is
>>>> bound to OSD). Without additional support Ceph lacks the ability to have
>>>> different compression policies for different pools residing at the same
>>>> OSD.
>>>>      *  File Compression Control isn?t standardized among file systems. If
>>>> (or
>>>> when) new compression-equipped File System appears Ceph might require
>>>> corresponding changes to handle that properly.
>>>>
>>>> Having compression at OSD helps to eliminate these drawbacks.
>>>> As mentioned above Data-At-Rest compression purposes are pretty the same
>>>> as
>>>> for Erasure Coding. It looks quite easy to add compression support to EC
>>>> pools. This way one can have even more storage space for higher CPU load.
>>>> Additional Pros for combining compression and erasure coding are:
>>>>     *  Both EC and compression have complexities in partial writing. EC
>>>> pools
>>>> don?t have partial write support (data append only) and the solution for
>>>> that
>>>> is cache tier insertion.  Thus we can transparently reuse the same
>>>> approach in
>>>> case of compression.
>>>>     *  Compression becomes a pool property thus Ceph users will have direct
>>>> control what pools to apply compression with.
>>>>     *  Original write performance isn?t impacted by the compression for
>>>> two-tier
>>>> model - write data goes to the cache uncompressed and there is no
>>>> corresponding compression latency. Actual compression happens in
>>>> background
>>>> when backing storage filling takes place.
>>>>     *  There is an additional benefit in network bandwidth saving when
>>>> primary
>>>> OSD performs a compression as resulting object shards for replication are
>>>> less.
>>>>     *  Data-at-rest compression can also bring an additional performance
>>>> improvement for HDD-based storage. Reducing the amount of data written to
>>>> slow
>>>> media can provide a net performance improvement even taking into account
>>>> the
>>>> compression overhead.
>>> I think this approach makes a lot of sense.  The tricky bit will be
>>> storing the additional metadata that maps logical offsets to compressed
>>> offsets.
>>>
>>>> Some implementation notes:
>>>>
>>>> The suggested approach is to perform data compression prior to Erasure
>>>> Coding
>>>> to reduce data portion passed to coding and avoid the need to introduce
>>>> additional means to disable EC-generated chunks compression.
>>> At first glance, the compress-before-ec approach sounds attractive: the
>>> complex EC striping stuff doesn't need to change, and we just need to map
>>> logical offsets to compressed offsets before doing the EC read/reconstruct
>>> as we normally would.  The problem is with appends: the EC stripe size
>>> is exposed to the user and they write in those increments.  So if we
>>> compress before we pass it to EC, then we need to have variable stripe
>>> sizes for each write (depending on how well it compressed).  The upshot
>>> here is that if we end up support variable EC stripe sizes we *could*
>>> allow librados appends of any size (not just the stripe size as we
>>> currently do).  I'm not sure how important/useful that is...
>>>
>>> On the other hand, ec-before-compression still means we need to map coded
>>> stripe offsets to compressed offsets.. and you're right that it puts a bit
>>> more data through the EC transform.
>>>
>>> Either way, it will be a reasonably complex change.
>>>
>>>> Data-At-Rest compression should support plugin architecture to enable
>>>> multiple
>>>> compression backends.
>>> Haomai has started some simple compression infrastructure to support
>>> compression over the wire; see
>>>
>>> 	https://github.com/ceph/ceph/pull/5116
>>>
>>> We should reuse or extend the plugin interface there to cover both users.
>>>
>>>> Compression engine should mark stored objects with some tags to indicate
>>>> if
>>>> compression took place and what algorithm was used.
>>>> To avoid (reduce) backing storage CPU overload caused by
>>>> compression/decompression ( e.g. this can happen during massive reads ) we
>>>> can
>>>> introduce additional means to detect such situations and temporary disable
>>>> compression for current write requests. Since there is way to mark objects
>>>> as
>>>> compressed/uncompressed this produces almost no issues for future
>>>> handling.
>>>> Hardware compression support usage, e.g. Intel QuickAssist can be an
>>>> additional helper for this issue.
>>> Great to see this moving forward!
>>> sage
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


  parent reply	other threads:[~2015-09-23 14:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47   ` Igor Fedotov
2015-09-23 13:15     ` Sage Weil
2015-09-23 14:05       ` Gregory Farnum
2015-09-23 15:26         ` Igor Fedotov
2015-09-23 17:31           ` Samuel Just
2015-09-24 15:34             ` Igor Fedotov
2015-09-23 18:03           ` Gregory Farnum
2015-09-24 15:13             ` Igor Fedotov
2015-09-24 15:34               ` Sage Weil
2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
2015-09-24 16:00                   ` Igor Fedotov
2015-09-24 15:56                 ` Igor Fedotov
2015-09-24 16:03                   ` Sage Weil
2015-09-24 16:14                     ` Igor Fedotov
2015-09-24 16:25                     ` Igor Fedotov
2015-09-24 17:36                       ` Robert LeBlanc
2015-09-24 17:53                         ` Samuel Just
2015-09-25 11:59                           ` Igor Fedotov
2015-09-25 14:14                             ` Sage Weil
2015-09-28 16:56                               ` Igor Fedotov
2015-09-24 18:10               ` Gregory Farnum
2015-09-25 13:16                 ` Igor Fedotov
2015-09-23 14:08       ` Igor Fedotov [this message]
2015-09-23 14:37         ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5602B269.1090806@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox