From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Adding Data-At-Rest compression support to Ceph
Date: Wed, 23 Sep 2015 17:08:41 +0300 [thread overview]
Message-ID: <5602B269.1090806@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1509230613410.11876@cobra.newdream.net>
Sage,
so you are saying that radosgw tend to use EC pools directly without
caching, right?
I agree that we need offset mapping anyway.
And the difference between cache writes and direct writes is mainly in
block size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher
overhead for both offset mapping and compression. But I agree - no real
difference from implementation point of view.
OK, let's try to handle both use cases.
So what do you think - can proceed with this feature implementation or
we need more discussion on that?
Thanks,
Igor.
On 23.09.2015 16:15, Sage Weil wrote:
> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>> Hi Sage,
>> thanks a lot for your feedback.
>>
>> Regarding issues with offset mapping and stripe size exposure.
>> What's about the idea to apply compression in two-tier (cache+backing storage)
>> model only ?
> I'm not sure we win anything by making it a two-tier only thing... simply
> making it a feature of the EC pool means we can also address EC pool users
> like radosgw.
>
>> I doubt single-tier one is widely used for EC pools since there is no random
>> write support in such mode. Thus this might be an acceptable limitation.
>> At the same time it seems that appends caused by cached object flush have
>> fixed block size (8Mb by default). And object is totally rewritten on the next
>> flush if any. This makes offset mapping less tricky.
>> Decompression should be applied in any model though as cache tier shutdown and
>> subsequent compressed data access is possibly a valid use case.
> Yeah, we need to handle random reads either way, so I think the offset
> mapping is going to be needed anyway. And I don't think there is any
> real difference from teh EC pool's perspective between a direct user
> like radosgw and the cache tier writing objects--in both cases it's
> doing appends and deletes.
>
> sage
>
>
>> Thanks,
>> Igor
>>
>> On 22.09.2015 22:11, Sage Weil wrote:
>>> On Tue, 22 Sep 2015, Igor Fedotov wrote:
>>>> Hi guys,
>>>>
>>>> I can find some talks about adding compression support to Ceph. Let me
>>>> share
>>>> some thoughts and proposals on that too.
>>>>
>>>> First of all I?d like to consider several major implementation options
>>>> separately. IMHO this makes sense since they have different applicability,
>>>> value and implementation specifics. Besides that less parts are easier for
>>>> both understanding and implementation.
>>>>
>>>> * Data-At-Rest Compression. This is about compressing basic data volume
>>>> kept
>>>> by the Ceph backing tier. The main reason for that is data store costs
>>>> reduction. One can find similar approach introduced by Erasure Coding Pool
>>>> implementation - cluster capacity increases (i.e. storage cost reduces) at
>>>> the
>>>> expense of additional computations. This is especially effective when
>>>> combined
>>>> with the high-performance cache tier.
>>>> * Intermediate Data Compression. This case is about applying
>>>> compression
>>>> for intermediate data like system journals, caches etc. The intention is
>>>> to
>>>> improve expensive storage resource utilization (e.g. solid state drives
>>>> or
>>>> RAM ). At the same time the idea to apply compression ( feature that
>>>> undoubtedly introduces additional overhead ) to the crucial heavy-duty
>>>> components probably looks contradictory.
>>>> * Exchange Data ?ompression. This one to be applied to messages
>>>> transported
>>>> between client and storage cluster components as well as internal cluster
>>>> traffic. The rationale for that might be the desire to improve cluster
>>>> run-time characteristics, e.g. limited data bandwidth caused by the
>>>> network or
>>>> storage devices throughput. The potential drawback is client overburdening
>>>> -
>>>> client computation resources might become a bottleneck since they take
>>>> most of
>>>> compression/decompression tasks.
>>>>
>>>> Obviously it would be great to have support for all the above cases, e.g.
>>>> object compression takes place at the client and cluster components handle
>>>> that naturally during the object life-cycle. Unfortunately significant
>>>> complexities arise on this way. Most of them are related to partial object
>>>> access, both reading and writing. It looks like huge development (
>>>> redesigning, refactoring and new code development ) and testing efforts
>>>> are
>>>> required on this way. It?s hard to estimate the value of such aggregated
>>>> support at the current moment too.
>>>> Thus the approach I?m suggesting is to drive the progress eventually and
>>>> consider cases separately. At the moment my proposal is to add
>>>> Data-At-Rest
>>>> compression to Erasure Coded pools as the most definite one from both
>>>> implementation and value points of view.
>>>>
>>>> How we can do that.
>>>>
>>>> Ceph Cluster Architecture suggests two-tier storage model for production
>>>> usage. Cache tier built on high-performance expensive storage devices
>>>> provides
>>>> performance. Storage tier with low-cost less-efficient devices provides
>>>> cost-effectiveness and capacity. Cache tier is supposed to use ordinary
>>>> data
>>>> replication while storage one can use erasure coding (EC) for effective
>>>> and
>>>> reliable data keeping. EC provides less store costs with the same
>>>> reliability
>>>> comparing to data replication approach at the expenses of additional
>>>> computations. Thus Ceph already has some trade off between capacity and
>>>> computation efforts. Actually Data-At-Rest compression is exactly about
>>>> the
>>>> same. Moreover one can tie EC and Data-At-Rest compression together to
>>>> achieve
>>>> even better storage effectiveness.
>>>> There are two possible ways on adding Data-At-Rest compression:
>>>> * Use data compression built into a file system beyond the Ceph.
>>>> * Add compression to Ceph OSD.
>>>>
>>>> At first glance Option 1. looks pretty attractive but there are some
>>>> drawbacks
>>>> for this approach. Here they are:
>>>> * File System lock-in. BTRFS is the only file system supporting
>>>> transparent
>>>> compression among ones recommended for Ceph usage.
>>>> Moreover
>>>> AFAIK it?s still not recommended for production usage, see:
>>>> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>>>> * Limited flexibility - one can use compression methods and policies
>>>> supported by FS only.
>>>> * Data compression depends on volume or mount point properties (and
>>>> is
>>>> bound to OSD). Without additional support Ceph lacks the ability to have
>>>> different compression policies for different pools residing at the same
>>>> OSD.
>>>> * File Compression Control isn?t standardized among file systems. If
>>>> (or
>>>> when) new compression-equipped File System appears Ceph might require
>>>> corresponding changes to handle that properly.
>>>>
>>>> Having compression at OSD helps to eliminate these drawbacks.
>>>> As mentioned above Data-At-Rest compression purposes are pretty the same
>>>> as
>>>> for Erasure Coding. It looks quite easy to add compression support to EC
>>>> pools. This way one can have even more storage space for higher CPU load.
>>>> Additional Pros for combining compression and erasure coding are:
>>>> * Both EC and compression have complexities in partial writing. EC
>>>> pools
>>>> don?t have partial write support (data append only) and the solution for
>>>> that
>>>> is cache tier insertion. Thus we can transparently reuse the same
>>>> approach in
>>>> case of compression.
>>>> * Compression becomes a pool property thus Ceph users will have direct
>>>> control what pools to apply compression with.
>>>> * Original write performance isn?t impacted by the compression for
>>>> two-tier
>>>> model - write data goes to the cache uncompressed and there is no
>>>> corresponding compression latency. Actual compression happens in
>>>> background
>>>> when backing storage filling takes place.
>>>> * There is an additional benefit in network bandwidth saving when
>>>> primary
>>>> OSD performs a compression as resulting object shards for replication are
>>>> less.
>>>> * Data-at-rest compression can also bring an additional performance
>>>> improvement for HDD-based storage. Reducing the amount of data written to
>>>> slow
>>>> media can provide a net performance improvement even taking into account
>>>> the
>>>> compression overhead.
>>> I think this approach makes a lot of sense. The tricky bit will be
>>> storing the additional metadata that maps logical offsets to compressed
>>> offsets.
>>>
>>>> Some implementation notes:
>>>>
>>>> The suggested approach is to perform data compression prior to Erasure
>>>> Coding
>>>> to reduce data portion passed to coding and avoid the need to introduce
>>>> additional means to disable EC-generated chunks compression.
>>> At first glance, the compress-before-ec approach sounds attractive: the
>>> complex EC striping stuff doesn't need to change, and we just need to map
>>> logical offsets to compressed offsets before doing the EC read/reconstruct
>>> as we normally would. The problem is with appends: the EC stripe size
>>> is exposed to the user and they write in those increments. So if we
>>> compress before we pass it to EC, then we need to have variable stripe
>>> sizes for each write (depending on how well it compressed). The upshot
>>> here is that if we end up support variable EC stripe sizes we *could*
>>> allow librados appends of any size (not just the stripe size as we
>>> currently do). I'm not sure how important/useful that is...
>>>
>>> On the other hand, ec-before-compression still means we need to map coded
>>> stripe offsets to compressed offsets.. and you're right that it puts a bit
>>> more data through the EC transform.
>>>
>>> Either way, it will be a reasonably complex change.
>>>
>>>> Data-At-Rest compression should support plugin architecture to enable
>>>> multiple
>>>> compression backends.
>>> Haomai has started some simple compression infrastructure to support
>>> compression over the wire; see
>>>
>>> https://github.com/ceph/ceph/pull/5116
>>>
>>> We should reuse or extend the plugin interface there to cover both users.
>>>
>>>> Compression engine should mark stored objects with some tags to indicate
>>>> if
>>>> compression took place and what algorithm was used.
>>>> To avoid (reduce) backing storage CPU overload caused by
>>>> compression/decompression ( e.g. this can happen during massive reads ) we
>>>> can
>>>> introduce additional means to detect such situations and temporary disable
>>>> compression for current write requests. Since there is way to mark objects
>>>> as
>>>> compressed/uncompressed this produces almost no issues for future
>>>> handling.
>>>> Hardware compression support usage, e.g. Intel QuickAssist can be an
>>>> additional helper for this issue.
>>> Great to see this moving forward!
>>> sage
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
next prev parent reply other threads:[~2015-09-23 14:08 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47 ` Igor Fedotov
2015-09-23 13:15 ` Sage Weil
2015-09-23 14:05 ` Gregory Farnum
2015-09-23 15:26 ` Igor Fedotov
2015-09-23 17:31 ` Samuel Just
2015-09-24 15:34 ` Igor Fedotov
2015-09-23 18:03 ` Gregory Farnum
2015-09-24 15:13 ` Igor Fedotov
2015-09-24 15:34 ` Sage Weil
2015-09-24 15:41 ` HEWLETT, Paul (Paul)
2015-09-24 16:00 ` Igor Fedotov
2015-09-24 15:56 ` Igor Fedotov
2015-09-24 16:03 ` Sage Weil
2015-09-24 16:14 ` Igor Fedotov
2015-09-24 16:25 ` Igor Fedotov
2015-09-24 17:36 ` Robert LeBlanc
2015-09-24 17:53 ` Samuel Just
2015-09-25 11:59 ` Igor Fedotov
2015-09-25 14:14 ` Sage Weil
2015-09-28 16:56 ` Igor Fedotov
2015-09-24 18:10 ` Gregory Farnum
2015-09-25 13:16 ` Igor Fedotov
2015-09-23 14:08 ` Igor Fedotov [this message]
2015-09-23 14:37 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5602B269.1090806@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox