Adding Data-At-Rest compression support to Ceph

All of lore.kernel.org
 help / color / mirror / Atom feed

* Adding Data-At-Rest compression support to Ceph
@ 2015-09-22 17:04 Igor Fedotov
  2015-09-22 19:11 ` Sage Weil
  0 siblings, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-22 17:04 UTC (permalink / raw)
  To: ceph-devel

Hi guys,

I can find some talks about adding compression support to Ceph. Let me 
share some thoughts and proposals on that too.

First of all I’d like to consider several major implementation options 
separately. IMHO this makes sense since they have different 
applicability, value and implementation specifics. Besides that less 
parts are easier for both understanding and implementation.

   * Data-At-Rest Compression. This is about compressing basic data 
volume kept by the Ceph backing tier. The main reason for that is data 
store costs reduction. One can find similar approach introduced by 
Erasure Coding Pool implementation - cluster capacity increases (i.e. 
storage cost reduces) at the expense of additional computations. This is 
especially effective when combined with the high-performance cache tier.
   *  Intermediate Data Compression. This case is about applying 
compression for intermediate data like system journals, caches etc. The 
intention is to improve expensive storage resource  utilization (e.g. 
solid state drives or RAM ). At the same time the idea to apply 
compression ( feature that undoubtedly introduces additional overhead ) 
to the crucial heavy-duty components probably looks contradictory.
   *  Exchange Data Сompression. This one to be applied to messages 
transported between client and storage cluster components as well as 
internal cluster traffic. The rationale for that might be the desire to 
improve cluster run-time characteristics, e.g. limited data bandwidth 
caused by the network or storage devices throughput. The potential 
drawback is client overburdening - client computation resources might 
become a bottleneck since they take most of compression/decompression tasks.

Obviously it would be great to have support for all the above cases, 
e.g. object compression takes place at the client and cluster components 
handle that naturally during the object life-cycle. Unfortunately 
significant  complexities arise on this way. Most of them are related to 
partial object access, both reading and writing. It looks like huge 
development ( redesigning, refactoring and new code development ) and 
testing efforts are required on this way. It’s hard to estimate the 
value of such aggregated support at the current moment too.
Thus the approach I’m suggesting is to drive the progress eventually and 
consider cases separately. At the moment my proposal is to add 
Data-At-Rest compression to Erasure Coded pools as the most definite one 
from both implementation and value points of view.

How we can do that.

Ceph Cluster Architecture suggests two-tier storage model for production 
usage. Cache tier built on high-performance expensive storage devices 
provides performance. Storage tier with low-cost less-efficient devices 
provides cost-effectiveness and capacity. Cache tier is supposed to use 
ordinary data replication while storage one can use erasure coding (EC) 
for effective and reliable data keeping. EC provides less store costs 
with the same reliability comparing to data replication approach at the 
expenses of additional computations. Thus Ceph already has some trade 
off between capacity and computation efforts. Actually Data-At-Rest 
compression is exactly about the same. Moreover one can tie EC and 
Data-At-Rest compression together to achieve even better storage 
effectiveness.
There are two possible ways on adding Data-At-Rest compression:
   *  Use data compression built into a file system beyond the Ceph.
   *  Add compression to Ceph OSD.

At first glance Option 1. looks pretty attractive but there are some 
drawbacks for this approach. Here they are:
   *  File System lock-in. BTRFS is the only file system supporting 
transparent compression among ones recommended for Ceph usage.         
          Moreover AFAIK it’s still not recommended for production 
usage, see:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
    *  Limited flexibility - one can use compression methods and 
policies supported by FS only.
    *  Data compression depends on volume or mount point properties (and 
is bound to OSD). Without additional support Ceph lacks the ability to 
have different compression policies for different pools residing at the 
same OSD.
    *  File Compression Control isn’t standardized among file systems. 
If (or when) new compression-equipped File System appears Ceph might 
require corresponding changes to handle that properly.

Having compression at OSD helps to eliminate these drawbacks.
As mentioned above Data-At-Rest compression purposes are pretty the same 
as for Erasure Coding. It looks quite easy to add compression support to 
EC pools. This way one can have even more storage space for higher CPU load.
Additional Pros for combining compression and erasure coding are:
   *  Both EC and compression have complexities in partial writing. EC 
pools don’t have partial write support (data append only) and the 
solution for that is cache tier insertion.  Thus we can transparently 
reuse the same approach in case of compression.
   *  Compression becomes a pool property thus Ceph users will have 
direct control what pools to apply compression with.
   *  Original write performance isn’t impacted by the compression for 
two-tier model - write data goes to the cache uncompressed and there is 
no corresponding compression latency. Actual compression happens in 
background when backing storage filling takes place.
   *  There is an additional benefit in network bandwidth saving when 
primary OSD performs a compression as resulting object shards for 
replication are less.
   *  Data-at-rest compression can also bring an additional performance 
improvement for HDD-based storage. Reducing the amount of data written 
to slow media can provide a net performance improvement even taking into 
account the compression overhead.

Some implementation notes:

The suggested approach is to perform data compression prior to Erasure 
Coding to reduce data portion passed to coding and avoid the need to 
introduce additional means to disable EC-generated chunks compression.
Data-At-Rest compression should support plugin architecture to enable 
multiple compression backends.
Compression engine should mark stored objects with some tags to indicate 
if compression took place and what algorithm was used.
To avoid (reduce) backing storage CPU overload caused by 
compression/decompression ( e.g. this can happen during massive reads ) 
we can introduce additional means to detect such situations and 
temporary disable compression for current write requests. Since there is 
way to mark objects as compressed/uncompressed this produces almost no 
issues for future handling. Hardware compression support usage, e.g. 
Intel QuickAssist can be an additional helper for this issue.

Any thoughts?

Thanks,
Igor.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
@ 2015-09-22 19:11 ` Sage Weil
  2015-09-23 12:47   ` Igor Fedotov
  0 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2015-09-22 19:11 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Tue, 22 Sep 2015, Igor Fedotov wrote:
> Hi guys,
> 
> I can find some talks about adding compression support to Ceph. Let me share
> some thoughts and proposals on that too.
> 
> First of all I?d like to consider several major implementation options
> separately. IMHO this makes sense since they have different applicability,
> value and implementation specifics. Besides that less parts are easier for
> both understanding and implementation.
> 
>   * Data-At-Rest Compression. This is about compressing basic data volume kept
> by the Ceph backing tier. The main reason for that is data store costs
> reduction. One can find similar approach introduced by Erasure Coding Pool
> implementation - cluster capacity increases (i.e. storage cost reduces) at the
> expense of additional computations. This is especially effective when combined
> with the high-performance cache tier.
>   *  Intermediate Data Compression. This case is about applying compression
> for intermediate data like system journals, caches etc. The intention is to
> improve expensive storage resource  utilization (e.g. solid state drives or
> RAM ). At the same time the idea to apply compression ( feature that
> undoubtedly introduces additional overhead ) to the crucial heavy-duty
> components probably looks contradictory.
>   *  Exchange Data ?ompression. This one to be applied to messages transported
> between client and storage cluster components as well as internal cluster
> traffic. The rationale for that might be the desire to improve cluster
> run-time characteristics, e.g. limited data bandwidth caused by the network or
> storage devices throughput. The potential drawback is client overburdening -
> client computation resources might become a bottleneck since they take most of
> compression/decompression tasks.
> 
> Obviously it would be great to have support for all the above cases, e.g.
> object compression takes place at the client and cluster components handle
> that naturally during the object life-cycle. Unfortunately significant
> complexities arise on this way. Most of them are related to partial object
> access, both reading and writing. It looks like huge development (
> redesigning, refactoring and new code development ) and testing efforts are
> required on this way. It?s hard to estimate the value of such aggregated
> support at the current moment too.
> Thus the approach I?m suggesting is to drive the progress eventually and
> consider cases separately. At the moment my proposal is to add Data-At-Rest
> compression to Erasure Coded pools as the most definite one from both
> implementation and value points of view.
> 
> How we can do that.
> 
> Ceph Cluster Architecture suggests two-tier storage model for production
> usage. Cache tier built on high-performance expensive storage devices provides
> performance. Storage tier with low-cost less-efficient devices provides
> cost-effectiveness and capacity. Cache tier is supposed to use ordinary data
> replication while storage one can use erasure coding (EC) for effective and
> reliable data keeping. EC provides less store costs with the same reliability
> comparing to data replication approach at the expenses of additional
> computations. Thus Ceph already has some trade off between capacity and
> computation efforts. Actually Data-At-Rest compression is exactly about the
> same. Moreover one can tie EC and Data-At-Rest compression together to achieve
> even better storage effectiveness.
> There are two possible ways on adding Data-At-Rest compression:
>   *  Use data compression built into a file system beyond the Ceph.
>   *  Add compression to Ceph OSD.
> 
> At first glance Option 1. looks pretty attractive but there are some drawbacks
> for this approach. Here they are:
>   *  File System lock-in. BTRFS is the only file system supporting transparent
> compression among ones recommended for Ceph usage.                  Moreover
> AFAIK it?s still not recommended for production usage, see:
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>    *  Limited flexibility - one can use compression methods and policies
> supported by FS only.
>    *  Data compression depends on volume or mount point properties (and is
> bound to OSD). Without additional support Ceph lacks the ability to have
> different compression policies for different pools residing at the same OSD.
>    *  File Compression Control isn?t standardized among file systems. If (or
> when) new compression-equipped File System appears Ceph might require
> corresponding changes to handle that properly.
> 
> Having compression at OSD helps to eliminate these drawbacks.
> As mentioned above Data-At-Rest compression purposes are pretty the same as
> for Erasure Coding. It looks quite easy to add compression support to EC
> pools. This way one can have even more storage space for higher CPU load.
> Additional Pros for combining compression and erasure coding are:
>   *  Both EC and compression have complexities in partial writing. EC pools
> don?t have partial write support (data append only) and the solution for that
> is cache tier insertion.  Thus we can transparently reuse the same approach in
> case of compression.
>   *  Compression becomes a pool property thus Ceph users will have direct
> control what pools to apply compression with.
>   *  Original write performance isn?t impacted by the compression for two-tier
> model - write data goes to the cache uncompressed and there is no
> corresponding compression latency. Actual compression happens in background
> when backing storage filling takes place.
>   *  There is an additional benefit in network bandwidth saving when primary
> OSD performs a compression as resulting object shards for replication are
> less.
>   *  Data-at-rest compression can also bring an additional performance
> improvement for HDD-based storage. Reducing the amount of data written to slow
> media can provide a net performance improvement even taking into account the
> compression overhead.

I think this approach makes a lot of sense.  The tricky bit will be 
storing the additional metadata that maps logical offsets to compressed 
offsets. 

> Some implementation notes:
> 
> The suggested approach is to perform data compression prior to Erasure Coding
> to reduce data portion passed to coding and avoid the need to introduce
> additional means to disable EC-generated chunks compression.

At first glance, the compress-before-ec approach sounds attractive: the 
complex EC striping stuff doesn't need to change, and we just need to map 
logical offsets to compressed offsets before doing the EC read/reconstruct 
as we normally would.  The problem is with appends: the EC stripe size 
is exposed to the user and they write in those increments.  So if we 
compress before we pass it to EC, then we need to have variable stripe 
sizes for each write (depending on how well it compressed).  The upshot 
here is that if we end up support variable EC stripe sizes we *could* 
allow librados appends of any size (not just the stripe size as we 
currently do).  I'm not sure how important/useful that is...

On the other hand, ec-before-compression still means we need to map coded 
stripe offsets to compressed offsets.. and you're right that it puts a bit 
more data through the EC transform.

Either way, it will be a reasonably complex change.

> Data-At-Rest compression should support plugin architecture to enable multiple
> compression backends.

Haomai has started some simple compression infrastructure to support 
compression over the wire; see

	https://github.com/ceph/ceph/pull/5116

We should reuse or extend the plugin interface there to cover both users.

> Compression engine should mark stored objects with some tags to indicate if
> compression took place and what algorithm was used.
> To avoid (reduce) backing storage CPU overload caused by
> compression/decompression ( e.g. this can happen during massive reads ) we can
> introduce additional means to detect such situations and temporary disable
> compression for current write requests. Since there is way to mark objects as
> compressed/uncompressed this produces almost no issues for future handling.
> Hardware compression support usage, e.g. Intel QuickAssist can be an
> additional helper for this issue.

Great to see this moving forward!
sage


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-22 19:11 ` Sage Weil
@ 2015-09-23 12:47   ` Igor Fedotov
  2015-09-23 13:15     ` Sage Weil
  0 siblings, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-23 12:47 UTC (permalink / raw)
  To: ceph-devel

Hi Sage,
thanks a lot for your feedback.

Regarding issues with offset mapping and stripe size exposure.
What's about the idea to apply compression in two-tier (cache+backing 
storage) model only ?
I doubt single-tier one is widely used for EC pools since there is no 
random write support in such mode. Thus this might be an acceptable 
limitation.
At the same time it seems that appends caused by cached object flush 
have fixed block size (8Mb by default). And object is totally rewritten 
on the next flush if any. This makes offset mapping less tricky.
Decompression should be applied in any model though as cache tier 
shutdown and subsequent compressed data access is possibly  a valid use 
case.

Thanks,
Igor

On 22.09.2015 22:11, Sage Weil wrote:
> On Tue, 22 Sep 2015, Igor Fedotov wrote:
>> Hi guys,
>>
>> I can find some talks about adding compression support to Ceph. Let me share
>> some thoughts and proposals on that too.
>>
>> First of all I?d like to consider several major implementation options
>> separately. IMHO this makes sense since they have different applicability,
>> value and implementation specifics. Besides that less parts are easier for
>> both understanding and implementation.
>>
>>    * Data-At-Rest Compression. This is about compressing basic data volume kept
>> by the Ceph backing tier. The main reason for that is data store costs
>> reduction. One can find similar approach introduced by Erasure Coding Pool
>> implementation - cluster capacity increases (i.e. storage cost reduces) at the
>> expense of additional computations. This is especially effective when combined
>> with the high-performance cache tier.
>>    *  Intermediate Data Compression. This case is about applying compression
>> for intermediate data like system journals, caches etc. The intention is to
>> improve expensive storage resource  utilization (e.g. solid state drives or
>> RAM ). At the same time the idea to apply compression ( feature that
>> undoubtedly introduces additional overhead ) to the crucial heavy-duty
>> components probably looks contradictory.
>>    *  Exchange Data ?ompression. This one to be applied to messages transported
>> between client and storage cluster components as well as internal cluster
>> traffic. The rationale for that might be the desire to improve cluster
>> run-time characteristics, e.g. limited data bandwidth caused by the network or
>> storage devices throughput. The potential drawback is client overburdening -
>> client computation resources might become a bottleneck since they take most of
>> compression/decompression tasks.
>>
>> Obviously it would be great to have support for all the above cases, e.g.
>> object compression takes place at the client and cluster components handle
>> that naturally during the object life-cycle. Unfortunately significant
>> complexities arise on this way. Most of them are related to partial object
>> access, both reading and writing. It looks like huge development (
>> redesigning, refactoring and new code development ) and testing efforts are
>> required on this way. It?s hard to estimate the value of such aggregated
>> support at the current moment too.
>> Thus the approach I?m suggesting is to drive the progress eventually and
>> consider cases separately. At the moment my proposal is to add Data-At-Rest
>> compression to Erasure Coded pools as the most definite one from both
>> implementation and value points of view.
>>
>> How we can do that.
>>
>> Ceph Cluster Architecture suggests two-tier storage model for production
>> usage. Cache tier built on high-performance expensive storage devices provides
>> performance. Storage tier with low-cost less-efficient devices provides
>> cost-effectiveness and capacity. Cache tier is supposed to use ordinary data
>> replication while storage one can use erasure coding (EC) for effective and
>> reliable data keeping. EC provides less store costs with the same reliability
>> comparing to data replication approach at the expenses of additional
>> computations. Thus Ceph already has some trade off between capacity and
>> computation efforts. Actually Data-At-Rest compression is exactly about the
>> same. Moreover one can tie EC and Data-At-Rest compression together to achieve
>> even better storage effectiveness.
>> There are two possible ways on adding Data-At-Rest compression:
>>    *  Use data compression built into a file system beyond the Ceph.
>>    *  Add compression to Ceph OSD.
>>
>> At first glance Option 1. looks pretty attractive but there are some drawbacks
>> for this approach. Here they are:
>>    *  File System lock-in. BTRFS is the only file system supporting transparent
>> compression among ones recommended for Ceph usage.                  Moreover
>> AFAIK it?s still not recommended for production usage, see:
>> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>>     *  Limited flexibility - one can use compression methods and policies
>> supported by FS only.
>>     *  Data compression depends on volume or mount point properties (and is
>> bound to OSD). Without additional support Ceph lacks the ability to have
>> different compression policies for different pools residing at the same OSD.
>>     *  File Compression Control isn?t standardized among file systems. If (or
>> when) new compression-equipped File System appears Ceph might require
>> corresponding changes to handle that properly.
>>
>> Having compression at OSD helps to eliminate these drawbacks.
>> As mentioned above Data-At-Rest compression purposes are pretty the same as
>> for Erasure Coding. It looks quite easy to add compression support to EC
>> pools. This way one can have even more storage space for higher CPU load.
>> Additional Pros for combining compression and erasure coding are:
>>    *  Both EC and compression have complexities in partial writing. EC pools
>> don?t have partial write support (data append only) and the solution for that
>> is cache tier insertion.  Thus we can transparently reuse the same approach in
>> case of compression.
>>    *  Compression becomes a pool property thus Ceph users will have direct
>> control what pools to apply compression with.
>>    *  Original write performance isn?t impacted by the compression for two-tier
>> model - write data goes to the cache uncompressed and there is no
>> corresponding compression latency. Actual compression happens in background
>> when backing storage filling takes place.
>>    *  There is an additional benefit in network bandwidth saving when primary
>> OSD performs a compression as resulting object shards for replication are
>> less.
>>    *  Data-at-rest compression can also bring an additional performance
>> improvement for HDD-based storage. Reducing the amount of data written to slow
>> media can provide a net performance improvement even taking into account the
>> compression overhead.
> I think this approach makes a lot of sense.  The tricky bit will be
> storing the additional metadata that maps logical offsets to compressed
> offsets.
>
>> Some implementation notes:
>>
>> The suggested approach is to perform data compression prior to Erasure Coding
>> to reduce data portion passed to coding and avoid the need to introduce
>> additional means to disable EC-generated chunks compression.
> At first glance, the compress-before-ec approach sounds attractive: the
> complex EC striping stuff doesn't need to change, and we just need to map
> logical offsets to compressed offsets before doing the EC read/reconstruct
> as we normally would.  The problem is with appends: the EC stripe size
> is exposed to the user and they write in those increments.  So if we
> compress before we pass it to EC, then we need to have variable stripe
> sizes for each write (depending on how well it compressed).  The upshot
> here is that if we end up support variable EC stripe sizes we *could*
> allow librados appends of any size (not just the stripe size as we
> currently do).  I'm not sure how important/useful that is...
>
> On the other hand, ec-before-compression still means we need to map coded
> stripe offsets to compressed offsets.. and you're right that it puts a bit
> more data through the EC transform.
>
> Either way, it will be a reasonably complex change.
>
>> Data-At-Rest compression should support plugin architecture to enable multiple
>> compression backends.
> Haomai has started some simple compression infrastructure to support
> compression over the wire; see
>
> 	https://github.com/ceph/ceph/pull/5116
>
> We should reuse or extend the plugin interface there to cover both users.
>
>> Compression engine should mark stored objects with some tags to indicate if
>> compression took place and what algorithm was used.
>> To avoid (reduce) backing storage CPU overload caused by
>> compression/decompression ( e.g. this can happen during massive reads ) we can
>> introduce additional means to detect such situations and temporary disable
>> compression for current write requests. Since there is way to mark objects as
>> compressed/uncompressed this produces almost no issues for future handling.
>> Hardware compression support usage, e.g. Intel QuickAssist can be an
>> additional helper for this issue.
> Great to see this moving forward!
> sage
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 12:47   ` Igor Fedotov
@ 2015-09-23 13:15     ` Sage Weil
  2015-09-23 14:05       ` Gregory Farnum
  2015-09-23 14:08       ` Igor Fedotov
  0 siblings, 2 replies; 26+ messages in thread
From: Sage Weil @ 2015-09-23 13:15 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Wed, 23 Sep 2015, Igor Fedotov wrote:
> Hi Sage,
> thanks a lot for your feedback.
> 
> Regarding issues with offset mapping and stripe size exposure.
> What's about the idea to apply compression in two-tier (cache+backing storage)
> model only ?

I'm not sure we win anything by making it a two-tier only thing... simply 
making it a feature of the EC pool means we can also address EC pool users 
like radosgw.

> I doubt single-tier one is widely used for EC pools since there is no random
> write support in such mode. Thus this might be an acceptable limitation.
> At the same time it seems that appends caused by cached object flush have
> fixed block size (8Mb by default). And object is totally rewritten on the next
> flush if any. This makes offset mapping less tricky.
> Decompression should be applied in any model though as cache tier shutdown and
> subsequent compressed data access is possibly  a valid use case.

Yeah, we need to handle random reads either way, so I think the offset 
mapping is going to be needed anyway.  And I don't think there is any 
real difference from teh EC pool's perspective between a direct user 
like radosgw and the cache tier writing objects--in both cases it's 
doing appends and deletes.

sage


> 
> Thanks,
> Igor
> 
> On 22.09.2015 22:11, Sage Weil wrote:
> > On Tue, 22 Sep 2015, Igor Fedotov wrote:
> > > Hi guys,
> > > 
> > > I can find some talks about adding compression support to Ceph. Let me
> > > share
> > > some thoughts and proposals on that too.
> > > 
> > > First of all I?d like to consider several major implementation options
> > > separately. IMHO this makes sense since they have different applicability,
> > > value and implementation specifics. Besides that less parts are easier for
> > > both understanding and implementation.
> > > 
> > >    * Data-At-Rest Compression. This is about compressing basic data volume
> > > kept
> > > by the Ceph backing tier. The main reason for that is data store costs
> > > reduction. One can find similar approach introduced by Erasure Coding Pool
> > > implementation - cluster capacity increases (i.e. storage cost reduces) at
> > > the
> > > expense of additional computations. This is especially effective when
> > > combined
> > > with the high-performance cache tier.
> > >    *  Intermediate Data Compression. This case is about applying
> > > compression
> > > for intermediate data like system journals, caches etc. The intention is
> > > to
> > > improve expensive storage resource  utilization (e.g. solid state drives
> > > or
> > > RAM ). At the same time the idea to apply compression ( feature that
> > > undoubtedly introduces additional overhead ) to the crucial heavy-duty
> > > components probably looks contradictory.
> > >    *  Exchange Data ?ompression. This one to be applied to messages
> > > transported
> > > between client and storage cluster components as well as internal cluster
> > > traffic. The rationale for that might be the desire to improve cluster
> > > run-time characteristics, e.g. limited data bandwidth caused by the
> > > network or
> > > storage devices throughput. The potential drawback is client overburdening
> > > -
> > > client computation resources might become a bottleneck since they take
> > > most of
> > > compression/decompression tasks.
> > > 
> > > Obviously it would be great to have support for all the above cases, e.g.
> > > object compression takes place at the client and cluster components handle
> > > that naturally during the object life-cycle. Unfortunately significant
> > > complexities arise on this way. Most of them are related to partial object
> > > access, both reading and writing. It looks like huge development (
> > > redesigning, refactoring and new code development ) and testing efforts
> > > are
> > > required on this way. It?s hard to estimate the value of such aggregated
> > > support at the current moment too.
> > > Thus the approach I?m suggesting is to drive the progress eventually and
> > > consider cases separately. At the moment my proposal is to add
> > > Data-At-Rest
> > > compression to Erasure Coded pools as the most definite one from both
> > > implementation and value points of view.
> > > 
> > > How we can do that.
> > > 
> > > Ceph Cluster Architecture suggests two-tier storage model for production
> > > usage. Cache tier built on high-performance expensive storage devices
> > > provides
> > > performance. Storage tier with low-cost less-efficient devices provides
> > > cost-effectiveness and capacity. Cache tier is supposed to use ordinary
> > > data
> > > replication while storage one can use erasure coding (EC) for effective
> > > and
> > > reliable data keeping. EC provides less store costs with the same
> > > reliability
> > > comparing to data replication approach at the expenses of additional
> > > computations. Thus Ceph already has some trade off between capacity and
> > > computation efforts. Actually Data-At-Rest compression is exactly about
> > > the
> > > same. Moreover one can tie EC and Data-At-Rest compression together to
> > > achieve
> > > even better storage effectiveness.
> > > There are two possible ways on adding Data-At-Rest compression:
> > >    *  Use data compression built into a file system beyond the Ceph.
> > >    *  Add compression to Ceph OSD.
> > > 
> > > At first glance Option 1. looks pretty attractive but there are some
> > > drawbacks
> > > for this approach. Here they are:
> > >    *  File System lock-in. BTRFS is the only file system supporting
> > > transparent
> > > compression among ones recommended for Ceph usage.
> > > Moreover
> > > AFAIK it?s still not recommended for production usage, see:
> > > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
> > >     *  Limited flexibility - one can use compression methods and policies
> > > supported by FS only.
> > >     *  Data compression depends on volume or mount point properties (and
> > > is
> > > bound to OSD). Without additional support Ceph lacks the ability to have
> > > different compression policies for different pools residing at the same
> > > OSD.
> > >     *  File Compression Control isn?t standardized among file systems. If
> > > (or
> > > when) new compression-equipped File System appears Ceph might require
> > > corresponding changes to handle that properly.
> > > 
> > > Having compression at OSD helps to eliminate these drawbacks.
> > > As mentioned above Data-At-Rest compression purposes are pretty the same
> > > as
> > > for Erasure Coding. It looks quite easy to add compression support to EC
> > > pools. This way one can have even more storage space for higher CPU load.
> > > Additional Pros for combining compression and erasure coding are:
> > >    *  Both EC and compression have complexities in partial writing. EC
> > > pools
> > > don?t have partial write support (data append only) and the solution for
> > > that
> > > is cache tier insertion.  Thus we can transparently reuse the same
> > > approach in
> > > case of compression.
> > >    *  Compression becomes a pool property thus Ceph users will have direct
> > > control what pools to apply compression with.
> > >    *  Original write performance isn?t impacted by the compression for
> > > two-tier
> > > model - write data goes to the cache uncompressed and there is no
> > > corresponding compression latency. Actual compression happens in
> > > background
> > > when backing storage filling takes place.
> > >    *  There is an additional benefit in network bandwidth saving when
> > > primary
> > > OSD performs a compression as resulting object shards for replication are
> > > less.
> > >    *  Data-at-rest compression can also bring an additional performance
> > > improvement for HDD-based storage. Reducing the amount of data written to
> > > slow
> > > media can provide a net performance improvement even taking into account
> > > the
> > > compression overhead.
> > I think this approach makes a lot of sense.  The tricky bit will be
> > storing the additional metadata that maps logical offsets to compressed
> > offsets.
> > 
> > > Some implementation notes:
> > > 
> > > The suggested approach is to perform data compression prior to Erasure
> > > Coding
> > > to reduce data portion passed to coding and avoid the need to introduce
> > > additional means to disable EC-generated chunks compression.
> > At first glance, the compress-before-ec approach sounds attractive: the
> > complex EC striping stuff doesn't need to change, and we just need to map
> > logical offsets to compressed offsets before doing the EC read/reconstruct
> > as we normally would.  The problem is with appends: the EC stripe size
> > is exposed to the user and they write in those increments.  So if we
> > compress before we pass it to EC, then we need to have variable stripe
> > sizes for each write (depending on how well it compressed).  The upshot
> > here is that if we end up support variable EC stripe sizes we *could*
> > allow librados appends of any size (not just the stripe size as we
> > currently do).  I'm not sure how important/useful that is...
> > 
> > On the other hand, ec-before-compression still means we need to map coded
> > stripe offsets to compressed offsets.. and you're right that it puts a bit
> > more data through the EC transform.
> > 
> > Either way, it will be a reasonably complex change.
> > 
> > > Data-At-Rest compression should support plugin architecture to enable
> > > multiple
> > > compression backends.
> > Haomai has started some simple compression infrastructure to support
> > compression over the wire; see
> > 
> > 	https://github.com/ceph/ceph/pull/5116
> > 
> > We should reuse or extend the plugin interface there to cover both users.
> > 
> > > Compression engine should mark stored objects with some tags to indicate
> > > if
> > > compression took place and what algorithm was used.
> > > To avoid (reduce) backing storage CPU overload caused by
> > > compression/decompression ( e.g. this can happen during massive reads ) we
> > > can
> > > introduce additional means to detect such situations and temporary disable
> > > compression for current write requests. Since there is way to mark objects
> > > as
> > > compressed/uncompressed this produces almost no issues for future
> > > handling.
> > > Hardware compression support usage, e.g. Intel QuickAssist can be an
> > > additional helper for this issue.
> > Great to see this moving forward!
> > sage
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 13:15     ` Sage Weil
@ 2015-09-23 14:05       ` Gregory Farnum
  2015-09-23 15:26         ` Igor Fedotov
  2015-09-23 14:08       ` Igor Fedotov
  1 sibling, 1 reply; 26+ messages in thread
From: Gregory Farnum @ 2015-09-23 14:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>> Hi Sage,
>> thanks a lot for your feedback.
>>
>> Regarding issues with offset mapping and stripe size exposure.
>> What's about the idea to apply compression in two-tier (cache+backing storage)
>> model only ?
>
> I'm not sure we win anything by making it a two-tier only thing... simply
> making it a feature of the EC pool means we can also address EC pool users
> like radosgw.
>
>> I doubt single-tier one is widely used for EC pools since there is no random
>> write support in such mode. Thus this might be an acceptable limitation.
>> At the same time it seems that appends caused by cached object flush have
>> fixed block size (8Mb by default). And object is totally rewritten on the next
>> flush if any. This makes offset mapping less tricky.
>> Decompression should be applied in any model though as cache tier shutdown and
>> subsequent compressed data access is possibly  a valid use case.
>
> Yeah, we need to handle random reads either way, so I think the offset
> mapping is going to be needed anyway.

The idea of making the primary responsible for object compression
really concerns me. It means for instance that a single random access
will likely require access to multiple objects, and breaks many of the
optimizations we have right now or in the pipeline (for instance:
direct client access). And apparently only the EC pool will support
compression, which is frustrating for all the replicated pool users
out there...

Is there some reason we don't just want to apply encryption across an
OSD store? Perhaps doing it on the filesystem level is the wrong way
(for reasons named above) but there are other mechanisms like inline
block device compression that I think are supposed to work pretty
well. The only thing that doesn't get us that I can see mentioned here
is the over-the-wire compression — and Haomai already has patches for
that, which should be a lot easier to validate and will work at all
levels of the stack!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 13:15     ` Sage Weil
  2015-09-23 14:05       ` Gregory Farnum
@ 2015-09-23 14:08       ` Igor Fedotov
  2015-09-23 14:37         ` Sage Weil
  1 sibling, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-23 14:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage,

so you are saying that radosgw tend to use EC pools directly without 
caching, right?

I agree that we need offset mapping anyway.

And the difference between cache writes and direct writes is mainly in 
block size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher 
overhead for both offset mapping and compression. But I agree - no real 
difference from implementation point of view.
OK, let's try to handle both use cases.

So what do you think - can proceed with this feature implementation or 
we need more discussion on that?

Thanks,
Igor.

On 23.09.2015 16:15, Sage Weil wrote:
> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>> Hi Sage,
>> thanks a lot for your feedback.
>>
>> Regarding issues with offset mapping and stripe size exposure.
>> What's about the idea to apply compression in two-tier (cache+backing storage)
>> model only ?
> I'm not sure we win anything by making it a two-tier only thing... simply
> making it a feature of the EC pool means we can also address EC pool users
> like radosgw.
>
>> I doubt single-tier one is widely used for EC pools since there is no random
>> write support in such mode. Thus this might be an acceptable limitation.
>> At the same time it seems that appends caused by cached object flush have
>> fixed block size (8Mb by default). And object is totally rewritten on the next
>> flush if any. This makes offset mapping less tricky.
>> Decompression should be applied in any model though as cache tier shutdown and
>> subsequent compressed data access is possibly  a valid use case.
> Yeah, we need to handle random reads either way, so I think the offset
> mapping is going to be needed anyway.  And I don't think there is any
> real difference from teh EC pool's perspective between a direct user
> like radosgw and the cache tier writing objects--in both cases it's
> doing appends and deletes.
>
> sage
>
>
>> Thanks,
>> Igor
>>
>> On 22.09.2015 22:11, Sage Weil wrote:
>>> On Tue, 22 Sep 2015, Igor Fedotov wrote:
>>>> Hi guys,
>>>>
>>>> I can find some talks about adding compression support to Ceph. Let me
>>>> share
>>>> some thoughts and proposals on that too.
>>>>
>>>> First of all I?d like to consider several major implementation options
>>>> separately. IMHO this makes sense since they have different applicability,
>>>> value and implementation specifics. Besides that less parts are easier for
>>>> both understanding and implementation.
>>>>
>>>>     * Data-At-Rest Compression. This is about compressing basic data volume
>>>> kept
>>>> by the Ceph backing tier. The main reason for that is data store costs
>>>> reduction. One can find similar approach introduced by Erasure Coding Pool
>>>> implementation - cluster capacity increases (i.e. storage cost reduces) at
>>>> the
>>>> expense of additional computations. This is especially effective when
>>>> combined
>>>> with the high-performance cache tier.
>>>>     *  Intermediate Data Compression. This case is about applying
>>>> compression
>>>> for intermediate data like system journals, caches etc. The intention is
>>>> to
>>>> improve expensive storage resource  utilization (e.g. solid state drives
>>>> or
>>>> RAM ). At the same time the idea to apply compression ( feature that
>>>> undoubtedly introduces additional overhead ) to the crucial heavy-duty
>>>> components probably looks contradictory.
>>>>     *  Exchange Data ?ompression. This one to be applied to messages
>>>> transported
>>>> between client and storage cluster components as well as internal cluster
>>>> traffic. The rationale for that might be the desire to improve cluster
>>>> run-time characteristics, e.g. limited data bandwidth caused by the
>>>> network or
>>>> storage devices throughput. The potential drawback is client overburdening
>>>> -
>>>> client computation resources might become a bottleneck since they take
>>>> most of
>>>> compression/decompression tasks.
>>>>
>>>> Obviously it would be great to have support for all the above cases, e.g.
>>>> object compression takes place at the client and cluster components handle
>>>> that naturally during the object life-cycle. Unfortunately significant
>>>> complexities arise on this way. Most of them are related to partial object
>>>> access, both reading and writing. It looks like huge development (
>>>> redesigning, refactoring and new code development ) and testing efforts
>>>> are
>>>> required on this way. It?s hard to estimate the value of such aggregated
>>>> support at the current moment too.
>>>> Thus the approach I?m suggesting is to drive the progress eventually and
>>>> consider cases separately. At the moment my proposal is to add
>>>> Data-At-Rest
>>>> compression to Erasure Coded pools as the most definite one from both
>>>> implementation and value points of view.
>>>>
>>>> How we can do that.
>>>>
>>>> Ceph Cluster Architecture suggests two-tier storage model for production
>>>> usage. Cache tier built on high-performance expensive storage devices
>>>> provides
>>>> performance. Storage tier with low-cost less-efficient devices provides
>>>> cost-effectiveness and capacity. Cache tier is supposed to use ordinary
>>>> data
>>>> replication while storage one can use erasure coding (EC) for effective
>>>> and
>>>> reliable data keeping. EC provides less store costs with the same
>>>> reliability
>>>> comparing to data replication approach at the expenses of additional
>>>> computations. Thus Ceph already has some trade off between capacity and
>>>> computation efforts. Actually Data-At-Rest compression is exactly about
>>>> the
>>>> same. Moreover one can tie EC and Data-At-Rest compression together to
>>>> achieve
>>>> even better storage effectiveness.
>>>> There are two possible ways on adding Data-At-Rest compression:
>>>>     *  Use data compression built into a file system beyond the Ceph.
>>>>     *  Add compression to Ceph OSD.
>>>>
>>>> At first glance Option 1. looks pretty attractive but there are some
>>>> drawbacks
>>>> for this approach. Here they are:
>>>>     *  File System lock-in. BTRFS is the only file system supporting
>>>> transparent
>>>> compression among ones recommended for Ceph usage.
>>>> Moreover
>>>> AFAIK it?s still not recommended for production usage, see:
>>>> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
>>>>      *  Limited flexibility - one can use compression methods and policies
>>>> supported by FS only.
>>>>      *  Data compression depends on volume or mount point properties (and
>>>> is
>>>> bound to OSD). Without additional support Ceph lacks the ability to have
>>>> different compression policies for different pools residing at the same
>>>> OSD.
>>>>      *  File Compression Control isn?t standardized among file systems. If
>>>> (or
>>>> when) new compression-equipped File System appears Ceph might require
>>>> corresponding changes to handle that properly.
>>>>
>>>> Having compression at OSD helps to eliminate these drawbacks.
>>>> As mentioned above Data-At-Rest compression purposes are pretty the same
>>>> as
>>>> for Erasure Coding. It looks quite easy to add compression support to EC
>>>> pools. This way one can have even more storage space for higher CPU load.
>>>> Additional Pros for combining compression and erasure coding are:
>>>>     *  Both EC and compression have complexities in partial writing. EC
>>>> pools
>>>> don?t have partial write support (data append only) and the solution for
>>>> that
>>>> is cache tier insertion.  Thus we can transparently reuse the same
>>>> approach in
>>>> case of compression.
>>>>     *  Compression becomes a pool property thus Ceph users will have direct
>>>> control what pools to apply compression with.
>>>>     *  Original write performance isn?t impacted by the compression for
>>>> two-tier
>>>> model - write data goes to the cache uncompressed and there is no
>>>> corresponding compression latency. Actual compression happens in
>>>> background
>>>> when backing storage filling takes place.
>>>>     *  There is an additional benefit in network bandwidth saving when
>>>> primary
>>>> OSD performs a compression as resulting object shards for replication are
>>>> less.
>>>>     *  Data-at-rest compression can also bring an additional performance
>>>> improvement for HDD-based storage. Reducing the amount of data written to
>>>> slow
>>>> media can provide a net performance improvement even taking into account
>>>> the
>>>> compression overhead.
>>> I think this approach makes a lot of sense.  The tricky bit will be
>>> storing the additional metadata that maps logical offsets to compressed
>>> offsets.
>>>
>>>> Some implementation notes:
>>>>
>>>> The suggested approach is to perform data compression prior to Erasure
>>>> Coding
>>>> to reduce data portion passed to coding and avoid the need to introduce
>>>> additional means to disable EC-generated chunks compression.
>>> At first glance, the compress-before-ec approach sounds attractive: the
>>> complex EC striping stuff doesn't need to change, and we just need to map
>>> logical offsets to compressed offsets before doing the EC read/reconstruct
>>> as we normally would.  The problem is with appends: the EC stripe size
>>> is exposed to the user and they write in those increments.  So if we
>>> compress before we pass it to EC, then we need to have variable stripe
>>> sizes for each write (depending on how well it compressed).  The upshot
>>> here is that if we end up support variable EC stripe sizes we *could*
>>> allow librados appends of any size (not just the stripe size as we
>>> currently do).  I'm not sure how important/useful that is...
>>>
>>> On the other hand, ec-before-compression still means we need to map coded
>>> stripe offsets to compressed offsets.. and you're right that it puts a bit
>>> more data through the EC transform.
>>>
>>> Either way, it will be a reasonably complex change.
>>>
>>>> Data-At-Rest compression should support plugin architecture to enable
>>>> multiple
>>>> compression backends.
>>> Haomai has started some simple compression infrastructure to support
>>> compression over the wire; see
>>>
>>> 	https://github.com/ceph/ceph/pull/5116
>>>
>>> We should reuse or extend the plugin interface there to cover both users.
>>>
>>>> Compression engine should mark stored objects with some tags to indicate
>>>> if
>>>> compression took place and what algorithm was used.
>>>> To avoid (reduce) backing storage CPU overload caused by
>>>> compression/decompression ( e.g. this can happen during massive reads ) we
>>>> can
>>>> introduce additional means to detect such situations and temporary disable
>>>> compression for current write requests. Since there is way to mark objects
>>>> as
>>>> compressed/uncompressed this produces almost no issues for future
>>>> handling.
>>>> Hardware compression support usage, e.g. Intel QuickAssist can be an
>>>> additional helper for this issue.
>>> Great to see this moving forward!
>>> sage
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 14:08       ` Igor Fedotov
@ 2015-09-23 14:37         ` Sage Weil
  0 siblings, 0 replies; 26+ messages in thread
From: Sage Weil @ 2015-09-23 14:37 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Wed, 23 Sep 2015, Igor Fedotov wrote:
> Sage,
> 
> so you are saying that radosgw tend to use EC pools directly without caching,
> right?
> 
> I agree that we need offset mapping anyway.
> 
> And the difference between cache writes and direct writes is mainly in block
> size granularity: 8 Mb vs. 4 Kb. In the latter case we have higher overhead
> for both offset mapping and compression. But I agree - no real difference from
> implementation point of view.
> OK, let's try to handle both use cases.
> 
> So what do you think - can proceed with this feature implementation or we need
> more discussion on that?

I think we should consider other options before moving forward.

Greg mentions doing this in the fs layer or even devicemapper.  That's 
attractive because it requires no work on our end.

Another option is to do this in the ObjectStore implementation.  It would 
be horribly inefficient to do in all cases, but we could provide a hint 
that all writes to an object will be appends.  This is something that 
NewStore, for example, coule probably do without too much trouble.

sage


> 
> Thanks,
> Igor.
> 
> On 23.09.2015 16:15, Sage Weil wrote:
> > On Wed, 23 Sep 2015, Igor Fedotov wrote:
> > > Hi Sage,
> > > thanks a lot for your feedback.
> > > 
> > > Regarding issues with offset mapping and stripe size exposure.
> > > What's about the idea to apply compression in two-tier (cache+backing
> > > storage)
> > > model only ?
> > I'm not sure we win anything by making it a two-tier only thing... simply
> > making it a feature of the EC pool means we can also address EC pool users
> > like radosgw.
> > 
> > > I doubt single-tier one is widely used for EC pools since there is no
> > > random
> > > write support in such mode. Thus this might be an acceptable limitation.
> > > At the same time it seems that appends caused by cached object flush have
> > > fixed block size (8Mb by default). And object is totally rewritten on the
> > > next
> > > flush if any. This makes offset mapping less tricky.
> > > Decompression should be applied in any model though as cache tier shutdown
> > > and
> > > subsequent compressed data access is possibly  a valid use case.
> > Yeah, we need to handle random reads either way, so I think the offset
> > mapping is going to be needed anyway.  And I don't think there is any
> > real difference from teh EC pool's perspective between a direct user
> > like radosgw and the cache tier writing objects--in both cases it's
> > doing appends and deletes.
> > 
> > sage
> > 
> > 
> > > Thanks,
> > > Igor
> > > 
> > > On 22.09.2015 22:11, Sage Weil wrote:
> > > > On Tue, 22 Sep 2015, Igor Fedotov wrote:
> > > > > Hi guys,
> > > > > 
> > > > > I can find some talks about adding compression support to Ceph. Let me
> > > > > share
> > > > > some thoughts and proposals on that too.
> > > > > 
> > > > > First of all I?d like to consider several major implementation options
> > > > > separately. IMHO this makes sense since they have different
> > > > > applicability,
> > > > > value and implementation specifics. Besides that less parts are easier
> > > > > for
> > > > > both understanding and implementation.
> > > > > 
> > > > >     * Data-At-Rest Compression. This is about compressing basic data
> > > > > volume
> > > > > kept
> > > > > by the Ceph backing tier. The main reason for that is data store costs
> > > > > reduction. One can find similar approach introduced by Erasure Coding
> > > > > Pool
> > > > > implementation - cluster capacity increases (i.e. storage cost
> > > > > reduces) at
> > > > > the
> > > > > expense of additional computations. This is especially effective when
> > > > > combined
> > > > > with the high-performance cache tier.
> > > > >     *  Intermediate Data Compression. This case is about applying
> > > > > compression
> > > > > for intermediate data like system journals, caches etc. The intention
> > > > > is
> > > > > to
> > > > > improve expensive storage resource  utilization (e.g. solid state
> > > > > drives
> > > > > or
> > > > > RAM ). At the same time the idea to apply compression ( feature that
> > > > > undoubtedly introduces additional overhead ) to the crucial heavy-duty
> > > > > components probably looks contradictory.
> > > > >     *  Exchange Data ?ompression. This one to be applied to messages
> > > > > transported
> > > > > between client and storage cluster components as well as internal
> > > > > cluster
> > > > > traffic. The rationale for that might be the desire to improve cluster
> > > > > run-time characteristics, e.g. limited data bandwidth caused by the
> > > > > network or
> > > > > storage devices throughput. The potential drawback is client
> > > > > overburdening
> > > > > -
> > > > > client computation resources might become a bottleneck since they take
> > > > > most of
> > > > > compression/decompression tasks.
> > > > > 
> > > > > Obviously it would be great to have support for all the above cases,
> > > > > e.g.
> > > > > object compression takes place at the client and cluster components
> > > > > handle
> > > > > that naturally during the object life-cycle. Unfortunately significant
> > > > > complexities arise on this way. Most of them are related to partial
> > > > > object
> > > > > access, both reading and writing. It looks like huge development (
> > > > > redesigning, refactoring and new code development ) and testing
> > > > > efforts
> > > > > are
> > > > > required on this way. It?s hard to estimate the value of such
> > > > > aggregated
> > > > > support at the current moment too.
> > > > > Thus the approach I?m suggesting is to drive the progress eventually
> > > > > and
> > > > > consider cases separately. At the moment my proposal is to add
> > > > > Data-At-Rest
> > > > > compression to Erasure Coded pools as the most definite one from both
> > > > > implementation and value points of view.
> > > > > 
> > > > > How we can do that.
> > > > > 
> > > > > Ceph Cluster Architecture suggests two-tier storage model for
> > > > > production
> > > > > usage. Cache tier built on high-performance expensive storage devices
> > > > > provides
> > > > > performance. Storage tier with low-cost less-efficient devices
> > > > > provides
> > > > > cost-effectiveness and capacity. Cache tier is supposed to use
> > > > > ordinary
> > > > > data
> > > > > replication while storage one can use erasure coding (EC) for
> > > > > effective
> > > > > and
> > > > > reliable data keeping. EC provides less store costs with the same
> > > > > reliability
> > > > > comparing to data replication approach at the expenses of additional
> > > > > computations. Thus Ceph already has some trade off between capacity
> > > > > and
> > > > > computation efforts. Actually Data-At-Rest compression is exactly
> > > > > about
> > > > > the
> > > > > same. Moreover one can tie EC and Data-At-Rest compression together to
> > > > > achieve
> > > > > even better storage effectiveness.
> > > > > There are two possible ways on adding Data-At-Rest compression:
> > > > >     *  Use data compression built into a file system beyond the Ceph.
> > > > >     *  Add compression to Ceph OSD.
> > > > > 
> > > > > At first glance Option 1. looks pretty attractive but there are some
> > > > > drawbacks
> > > > > for this approach. Here they are:
> > > > >     *  File System lock-in. BTRFS is the only file system supporting
> > > > > transparent
> > > > > compression among ones recommended for Ceph usage.
> > > > > Moreover
> > > > > AFAIK it?s still not recommended for production usage, see:
> > > > > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
> > > > >      *  Limited flexibility - one can use compression methods and
> > > > > policies
> > > > > supported by FS only.
> > > > >      *  Data compression depends on volume or mount point properties
> > > > > (and
> > > > > is
> > > > > bound to OSD). Without additional support Ceph lacks the ability to
> > > > > have
> > > > > different compression policies for different pools residing at the
> > > > > same
> > > > > OSD.
> > > > >      *  File Compression Control isn?t standardized among file
> > > > > systems. If
> > > > > (or
> > > > > when) new compression-equipped File System appears Ceph might require
> > > > > corresponding changes to handle that properly.
> > > > > 
> > > > > Having compression at OSD helps to eliminate these drawbacks.
> > > > > As mentioned above Data-At-Rest compression purposes are pretty the
> > > > > same
> > > > > as
> > > > > for Erasure Coding. It looks quite easy to add compression support to
> > > > > EC
> > > > > pools. This way one can have even more storage space for higher CPU
> > > > > load.
> > > > > Additional Pros for combining compression and erasure coding are:
> > > > >     *  Both EC and compression have complexities in partial writing.
> > > > > EC
> > > > > pools
> > > > > don?t have partial write support (data append only) and the solution
> > > > > for
> > > > > that
> > > > > is cache tier insertion.  Thus we can transparently reuse the same
> > > > > approach in
> > > > > case of compression.
> > > > >     *  Compression becomes a pool property thus Ceph users will have
> > > > > direct
> > > > > control what pools to apply compression with.
> > > > >     *  Original write performance isn?t impacted by the compression
> > > > > for
> > > > > two-tier
> > > > > model - write data goes to the cache uncompressed and there is no
> > > > > corresponding compression latency. Actual compression happens in
> > > > > background
> > > > > when backing storage filling takes place.
> > > > >     *  There is an additional benefit in network bandwidth saving when
> > > > > primary
> > > > > OSD performs a compression as resulting object shards for replication
> > > > > are
> > > > > less.
> > > > >     *  Data-at-rest compression can also bring an additional
> > > > > performance
> > > > > improvement for HDD-based storage. Reducing the amount of data written
> > > > > to
> > > > > slow
> > > > > media can provide a net performance improvement even taking into
> > > > > account
> > > > > the
> > > > > compression overhead.
> > > > I think this approach makes a lot of sense.  The tricky bit will be
> > > > storing the additional metadata that maps logical offsets to compressed
> > > > offsets.
> > > > 
> > > > > Some implementation notes:
> > > > > 
> > > > > The suggested approach is to perform data compression prior to Erasure
> > > > > Coding
> > > > > to reduce data portion passed to coding and avoid the need to
> > > > > introduce
> > > > > additional means to disable EC-generated chunks compression.
> > > > At first glance, the compress-before-ec approach sounds attractive: the
> > > > complex EC striping stuff doesn't need to change, and we just need to
> > > > map
> > > > logical offsets to compressed offsets before doing the EC
> > > > read/reconstruct
> > > > as we normally would.  The problem is with appends: the EC stripe size
> > > > is exposed to the user and they write in those increments.  So if we
> > > > compress before we pass it to EC, then we need to have variable stripe
> > > > sizes for each write (depending on how well it compressed).  The upshot
> > > > here is that if we end up support variable EC stripe sizes we *could*
> > > > allow librados appends of any size (not just the stripe size as we
> > > > currently do).  I'm not sure how important/useful that is...
> > > > 
> > > > On the other hand, ec-before-compression still means we need to map
> > > > coded
> > > > stripe offsets to compressed offsets.. and you're right that it puts a
> > > > bit
> > > > more data through the EC transform.
> > > > 
> > > > Either way, it will be a reasonably complex change.
> > > > 
> > > > > Data-At-Rest compression should support plugin architecture to enable
> > > > > multiple
> > > > > compression backends.
> > > > Haomai has started some simple compression infrastructure to support
> > > > compression over the wire; see
> > > > 
> > > > 	https://github.com/ceph/ceph/pull/5116
> > > > 
> > > > We should reuse or extend the plugin interface there to cover both
> > > > users.
> > > > 
> > > > > Compression engine should mark stored objects with some tags to
> > > > > indicate
> > > > > if
> > > > > compression took place and what algorithm was used.
> > > > > To avoid (reduce) backing storage CPU overload caused by
> > > > > compression/decompression ( e.g. this can happen during massive reads
> > > > > ) we
> > > > > can
> > > > > introduce additional means to detect such situations and temporary
> > > > > disable
> > > > > compression for current write requests. Since there is way to mark
> > > > > objects
> > > > > as
> > > > > compressed/uncompressed this produces almost no issues for future
> > > > > handling.
> > > > > Hardware compression support usage, e.g. Intel QuickAssist can be an
> > > > > additional helper for this issue.
> > > > Great to see this moving forward!
> > > > sage
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 14:05       ` Gregory Farnum
@ 2015-09-23 15:26         ` Igor Fedotov
  2015-09-23 17:31           ` Samuel Just
  2015-09-23 18:03           ` Gregory Farnum
  0 siblings, 2 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-23 15:26 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil; +Cc: ceph-devel



On 23.09.2015 17:05, Gregory Farnum wrote:
> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>> Hi Sage,
>>> thanks a lot for your feedback.
>>>
>>> Regarding issues with offset mapping and stripe size exposure.
>>> What's about the idea to apply compression in two-tier (cache+backing storage)
>>> model only ?
>> I'm not sure we win anything by making it a two-tier only thing... simply
>> making it a feature of the EC pool means we can also address EC pool users
>> like radosgw.
>>
>>> I doubt single-tier one is widely used for EC pools since there is no random
>>> write support in such mode. Thus this might be an acceptable limitation.
>>> At the same time it seems that appends caused by cached object flush have
>>> fixed block size (8Mb by default). And object is totally rewritten on the next
>>> flush if any. This makes offset mapping less tricky.
>>> Decompression should be applied in any model though as cache tier shutdown and
>>> subsequent compressed data access is possibly  a valid use case.
>> Yeah, we need to handle random reads either way, so I think the offset
>> mapping is going to be needed anyway.
> The idea of making the primary responsible for object compression
> really concerns me. It means for instance that a single random access
> will likely require access to multiple objects, and breaks many of the
> optimizations we have right now or in the pipeline (for instance:
> direct client access).
Could you please elaborate why multiple objects access is required on 
single random access?
In my opinion we need to access absolutely the same object set as 
before: in EC pool each appended block is spitted into multiple shards 
that go to respective OSDs. In general case one has to retrieve a set of 
adjacent shards from several OSDs on single read request. In case of 
compression the only difference is in data range that compressed shard 
set occupy. I.e. we simply need to translate requested data range to the 
actually stored one and retrieve that data from OSDs. What's missed?
> And apparently only the EC pool will support
> compression, which is frustrating for all the replicated pool users
> out there...
In my opinion  replicated pool users should consider EC pool usage first 
if they care about space saving. They automatically gain 50% space 
saving this way. Compression brings even more saving but that's rather 
the second step on this way.
> Is there some reason we don't just want to apply encryption across an
> OSD store? Perhaps doing it on the filesystem level is the wrong way
> (for reasons named above) but there are other mechanisms like inline
> block device compression that I think are supposed to work pretty
> well.
If I understand the idea of inline block device compression correctly it 
has some of drawbacks similar to FS compression approach. Ones to mention:
* Less flexibility - per device compression only, no way to have 
per-pool compression. No control on the compression process.
* Potentially higher overhead when operating- There is no way to bypass 
non-compressible data processing, e.g. shards with Erasure codes.
* Potentially higher overhead for recovery on OSD death - one needs to 
decompress data at working OSDs and compress it at new OSD. That's not 
necessary if compression takes place prior to EC though.
> The only thing that doesn't get us that I can see mentioned here
> is the over-the-wire compression — and Haomai already has patches for
> that, which should be a lot easier to validate and will work at all
> levels of the stack!
> -Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 15:26         ` Igor Fedotov
@ 2015-09-23 17:31           ` Samuel Just
  2015-09-24 15:34             ` Igor Fedotov
  2015-09-23 18:03           ` Gregory Farnum
  1 sibling, 1 reply; 26+ messages in thread
From: Samuel Just @ 2015-09-23 17:31 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Gregory Farnum, Sage Weil, ceph-devel

I think before moving forward with any sort of implementation, the
design would need to be pretty much completely mapped out --
particularly how the offset mapping will be handled and stored.  The
right thing to do would be to produce a blueprint and submit it to the
list.  I also would vastly prefer to do it on the client side if
possible.  Certainly, radosgw could do the compression just as easily
as the osds (except for the load on the radosgw heads, I suppose).
-Sam

On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>
>
> On 23.09.2015 17:05, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>
>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>>>
>>>> Hi Sage,
>>>> thanks a lot for your feedback.
>>>>
>>>> Regarding issues with offset mapping and stripe size exposure.
>>>> What's about the idea to apply compression in two-tier (cache+backing
>>>> storage)
>>>> model only ?
>>>
>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>> making it a feature of the EC pool means we can also address EC pool
>>> users
>>> like radosgw.
>>>
>>>> I doubt single-tier one is widely used for EC pools since there is no
>>>> random
>>>> write support in such mode. Thus this might be an acceptable limitation.
>>>> At the same time it seems that appends caused by cached object flush
>>>> have
>>>> fixed block size (8Mb by default). And object is totally rewritten on
>>>> the next
>>>> flush if any. This makes offset mapping less tricky.
>>>> Decompression should be applied in any model though as cache tier
>>>> shutdown and
>>>> subsequent compressed data access is possibly  a valid use case.
>>>
>>> Yeah, we need to handle random reads either way, so I think the offset
>>> mapping is going to be needed anyway.
>>
>> The idea of making the primary responsible for object compression
>> really concerns me. It means for instance that a single random access
>> will likely require access to multiple objects, and breaks many of the
>> optimizations we have right now or in the pipeline (for instance:
>> direct client access).
>
> Could you please elaborate why multiple objects access is required on single
> random access?
> In my opinion we need to access absolutely the same object set as before: in
> EC pool each appended block is spitted into multiple shards that go to
> respective OSDs. In general case one has to retrieve a set of adjacent
> shards from several OSDs on single read request. In case of compression the
> only difference is in data range that compressed shard set occupy. I.e. we
> simply need to translate requested data range to the actually stored one and
> retrieve that data from OSDs. What's missed?
>>
>> And apparently only the EC pool will support
>> compression, which is frustrating for all the replicated pool users
>> out there...
>
> In my opinion  replicated pool users should consider EC pool usage first if
> they care about space saving. They automatically gain 50% space saving this
> way. Compression brings even more saving but that's rather the second step
> on this way.
>>
>> Is there some reason we don't just want to apply encryption across an
>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>> (for reasons named above) but there are other mechanisms like inline
>> block device compression that I think are supposed to work pretty
>> well.
>
> If I understand the idea of inline block device compression correctly it has
> some of drawbacks similar to FS compression approach. Ones to mention:
> * Less flexibility - per device compression only, no way to have per-pool
> compression. No control on the compression process.
> * Potentially higher overhead when operating- There is no way to bypass
> non-compressible data processing, e.g. shards with Erasure codes.
> * Potentially higher overhead for recovery on OSD death - one needs to
> decompress data at working OSDs and compress it at new OSD. That's not
> necessary if compression takes place prior to EC though.
>
>> The only thing that doesn't get us that I can see mentioned here
>> is the over-the-wire compression — and Haomai already has patches for
>> that, which should be a lot easier to validate and will work at all
>> levels of the stack!
>> -Greg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 15:26         ` Igor Fedotov
  2015-09-23 17:31           ` Samuel Just
@ 2015-09-23 18:03           ` Gregory Farnum
  2015-09-24 15:13             ` Igor Fedotov
  1 sibling, 1 reply; 26+ messages in thread
From: Gregory Farnum @ 2015-09-23 18:03 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Sage Weil, ceph-devel

On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>
>
> On 23.09.2015 17:05, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>
>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>>>
>>>> Hi Sage,
>>>> thanks a lot for your feedback.
>>>>
>>>> Regarding issues with offset mapping and stripe size exposure.
>>>> What's about the idea to apply compression in two-tier (cache+backing
>>>> storage)
>>>> model only ?
>>>
>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>> making it a feature of the EC pool means we can also address EC pool
>>> users
>>> like radosgw.
>>>
>>>> I doubt single-tier one is widely used for EC pools since there is no
>>>> random
>>>> write support in such mode. Thus this might be an acceptable limitation.
>>>> At the same time it seems that appends caused by cached object flush
>>>> have
>>>> fixed block size (8Mb by default). And object is totally rewritten on
>>>> the next
>>>> flush if any. This makes offset mapping less tricky.
>>>> Decompression should be applied in any model though as cache tier
>>>> shutdown and
>>>> subsequent compressed data access is possibly  a valid use case.
>>>
>>> Yeah, we need to handle random reads either way, so I think the offset
>>> mapping is going to be needed anyway.
>>
>> The idea of making the primary responsible for object compression
>> really concerns me. It means for instance that a single random access
>> will likely require access to multiple objects, and breaks many of the
>> optimizations we have right now or in the pipeline (for instance:
>> direct client access).
>
> Could you please elaborate why multiple objects access is required on single
> random access?

It sounds to me like you were planning to take an incoming object
write, compress it, and then chunk it. If you do that, the symbols
("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
to reside in the first object and need to be fetched for each read in
other objects.

> In my opinion we need to access absolutely the same object set as before: in
> EC pool each appended block is spitted into multiple shards that go to
> respective OSDs. In general case one has to retrieve a set of adjacent
> shards from several OSDs on single read request.

Usually we just need to get the object info from the primary and then
read whichever object has the data for the requested region. If the
region spans a stripe boundary we might need to get two, but often we
don't...

> In case of compression the
> only difference is in data range that compressed shard set occupy. I.e. we
> simply need to translate requested data range to the actually stored one and
> retrieve that data from OSDs. What's missed?
>>
>> And apparently only the EC pool will support
>> compression, which is frustrating for all the replicated pool users
>> out there...
>
> In my opinion  replicated pool users should consider EC pool usage first if
> they care about space saving. They automatically gain 50% space saving this
> way. Compression brings even more saving but that's rather the second step
> on this way.

EC pools have important limitations that replicated pools don't, like
not working for object classes or allowing random overwrites. You can
stick a replicated cache pool in front but that comes with another
whole can of worms. Anybody with a large enough proportion of active
data won't find that solution suitable but might still want to reduce
space required where they can, like with local compression.

>> Is there some reason we don't just want to apply encryption across an
>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>> (for reasons named above) but there are other mechanisms like inline
>> block device compression that I think are supposed to work pretty
>> well.
>
> If I understand the idea of inline block device compression correctly it has
> some of drawbacks similar to FS compression approach. Ones to mention:
> * Less flexibility - per device compression only, no way to have per-pool
> compression. No control on the compression process.

What would the use case be here? I can imagine not wanting to slow
down your cache pools with it or something (although realistically I
don't think that's a concern unless the sheer CPU usage is a problem
with frequent writes), but those would be on separate OSDs/volumes
anyway
Plus block device compression is also able to include all the *other*
stuff that doesn't fit inside the object proper (xattrs and omap).

> * Potentially higher overhead when operating- There is no way to bypass
> non-compressible data processing, e.g. shards with Erasure codes.

My information theory intuition has never been very good, but I don't
think the coded chunks are any less compressible than the data they're
coding for, in general...

> * Potentially higher overhead for recovery on OSD death - one needs to
> decompress data at working OSDs and compress it at new OSD. That's not
> necessary if compression takes place prior to EC though.

Hmm, that is an interesting point. I guess I'm just not sure about the
labor and validation tradeoffs involved in obtaining this (it really
seems like the only advantage to me).


...I should note that I'm under the impression that transparent
compression already exists at some level which can be stacked with
regular filesystems, but I'm not finding it now, so maybe I'm
misinformed and the tradeoffs are a little different than I thought.
But I still don't like the idea of doing it on a primary just for EC
pools – I think if we were going to take that approach it'd be easier
to compress somewhere before it reaches the EC/replicated split?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 18:03           ` Gregory Farnum
@ 2015-09-24 15:13             ` Igor Fedotov
  2015-09-24 15:34               ` Sage Weil
  2015-09-24 18:10               ` Gregory Farnum
  0 siblings, 2 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 15:13 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

On 23.09.2015 21:03, Gregory Farnum wrote:
> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>>
>>>> The idea of making the primary responsible for object compression
>>>> really concerns me. It means for instance that a single random access
>>>> will likely require access to multiple objects, and breaks many of the
>>>> optimizations we have right now or in the pipeline (for instance:
>>>> direct client access).
>> Could you please elaborate why multiple objects access is required on single
>> random access?
> It sounds to me like you were planning to take an incoming object
> write, compress it, and then chunk it. If you do that, the symbols
> ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
> to reside in the first object and need to be fetched for each read in
> other objects.
Gregory,
do you mean a kind of compressor dictionary under symbols "abcdefgh = 
a", etc here.
And your assumption is that such dictionary is made on the first write, 
saved and reused by any subsequent reads, right?
I think that's not the case - it's better to compress each write 
independently.  Thus there is no need to access "dictionary" object ( 
i.e. the first object with these symbols) on every read operation,. The 
latter uses compressed block data only.
Yes, this might affect total compression ratio but thinks that's acceptabl.
>> In my opinion we need to access absolutely the same object set as before: in
>> EC pool each appended block is spitted into multiple shards that go to
>> respective OSDs. In general case one has to retrieve a set of adjacent
>> shards from several OSDs on single read request.
> Usually we just need to get the object info from the primary and then
> read whichever object has the data for the requested region. If the
> region spans a stripe boundary we might need to get two, but often we
> don't...
With independent block compression mentioned above the scenario is the 
same. The only thing we need to find proper compressed block is a 
mapping from original data offset to the compressed ones. We can store 
this as object metadata. Thus we need object metadata on each read only.
>> In case of compression the
>> only difference is in data range that compressed shard set occupy. I.e. we
>> simply need to translate requested data range to the actually stored one and
>> retrieve that data from OSDs. What's missed?
>>> And apparently only the EC pool will support
>>> compression, which is frustrating for all the replicated pool users
>>> out there...
>> In my opinion  replicated pool users should consider EC pool usage first if
>> they care about space saving. They automatically gain 50% space saving this
>> way. Compression brings even more saving but that's rather the second step
>> on this way.
> EC pools have important limitations that replicated pools don't, like
> not working for object classes or allowing random overwrites. You can
> stick a replicated cache pool in front but that comes with another
> whole can of worms. Anybody with a large enough proportion of active
> data won't find that solution suitable but might still want to reduce
> space required where they can, like with local compression.
Well I agree that have compression support for both replicated and EC 
pools is better.
But random access ( and probably other advanced features ) requires much 
more complex data handling that also brings additional overhead. 
Actually I suppose EC pools have such limitations due to these reasons. 
Thus my original idea was to simplify compression implementation from 
one side and make it  in-line with EC usage from another. The latter 
makes sense since compression and EC  have pretty the same reasons for 
implementation.

And just for the sake of my education could you please mention or point 
out existing issues in cache+EC pools usage.
How widely are EC pools used in production at all? Or that's rather 
experimental/secondary option?
>>> Is there some reason we don't just want to apply encryption across an
>>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>>> (for reasons named above) but there are other mechanisms like inline
>>> block device compression that I think are supposed to work pretty
>>> well.
>> If I understand the idea of inline block device compression correctly it has
>> some of drawbacks similar to FS compression approach. Ones to mention:
>> * Less flexibility - per device compression only, no way to have per-pool
>> compression. No control on the compression process.
> What would the use case be here? I can imagine not wanting to slow
> down your cache pools with it or something (although realistically I
> don't think that's a concern unless the sheer CPU usage is a problem
> with frequent writes), but those would be on separate OSDs/volumes
> anyway
Well I can imagine the need to have compression for some specific 
backing pools ( e.g. with seldom accessed or highly compressible data) 
and disable it for others, e.g. where original data is non-compressible 
( e.g. either already compressed  or encrypted).
Potentially we can even have some option to control compression on 
per-object basis and provide some hints for clients to enable it for 
specific use cases.
Another feature that might be useful - the ability to disable/re-enable 
compression during OSD life-cycle. E.g. when Administrator realizes that 
it's not appropriate for his use case. I doubt that's easy to do when 
compression is performed at device level.

> Plus block device compression is also able to include all the *other*
> stuff that doesn't fit inside the object proper (xattrs and omap).
Yes, that's a good point but I suppose nothing prevents us from 
compressing metadata by ourselves too.
>> * Potentially higher overhead when operating- There is no way to bypass
>> non-compressible data processing, e.g. shards with Erasure codes.
> My information theory intuition has never been very good, but I don't
> think the coded chunks are any less compressible than the data they're
> coding for, in general...
Yes, my bad. I played with EC a bit - generated chunks are pretty 
regular. I expected something absolutely random like encrypted data.

> ...I should note that I'm under the impression that transparent
> compression already exists at some level which can be stacked with
> regular filesystems, but I'm not finding it now, so maybe I'm
> misinformed and the tradeoffs are a little different than I thought.
I found some mentions about RBD device that performs inline compression. 
But pretty limited information present on the Net makes me think that 
this solution is far from production usage.

> But I still don't like the idea of doing it on a primary just for EC
> pools – I think if we were going to take that approach it'd be easier
> to compress somewhere before it reaches the EC/replicated split?
As I mentioned above the main reasons that pushed me to merge 
compression with EC pools are similar handling issues and their missions 
( space for cpu)  they provide.
Moving compression to any different place raises many-many complications..

Anyway will try to make some summary on the suggested approaches and 
their Pros and Cons.

Thanks,
Igor

PS. Gregory I highly appreciate your feedback.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-23 17:31           ` Samuel Just
@ 2015-09-24 15:34             ` Igor Fedotov
  0 siblings, 0 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 15:34 UTC (permalink / raw)
  To: Samuel Just; +Cc: Gregory Farnum, Sage Weil, ceph-devel

Samuel,
I completely agree about the need to have a blueprint before the 
implementation. But I think we should fix what approach to use ( when 
and how to perform the compression) first.
I'll summarize existing suggestions and their Pros and Cons shortly. 
Thus we'll be able to discuss them more productively.

Regarding performing the compression at the client side - I'm afraid 
it's not that easy given the fact that we have multiple clients with 
different use patterns and random data access.

Thanks,
Igor.

On 23.09.2015 20:31, Samuel Just wrote:
> I think before moving forward with any sort of implementation, the
> design would need to be pretty much completely mapped out --
> particularly how the offset mapping will be handled and stored.  The
> right thing to do would be to produce a blueprint and submit it to the
> list.  I also would vastly prefer to do it on the client side if
> possible.  Certainly, radosgw could do the compression just as easily
> as the osds (except for the load on the radosgw heads, I suppose).
> -Sam
>
> On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>>
>> On 23.09.2015 17:05, Gregory Farnum wrote:
>>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>>>> Hi Sage,
>>>>> thanks a lot for your feedback.
>>>>>
>>>>> Regarding issues with offset mapping and stripe size exposure.
>>>>> What's about the idea to apply compression in two-tier (cache+backing
>>>>> storage)
>>>>> model only ?
>>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>>> making it a feature of the EC pool means we can also address EC pool
>>>> users
>>>> like radosgw.
>>>>
>>>>> I doubt single-tier one is widely used for EC pools since there is no
>>>>> random
>>>>> write support in such mode. Thus this might be an acceptable limitation.
>>>>> At the same time it seems that appends caused by cached object flush
>>>>> have
>>>>> fixed block size (8Mb by default). And object is totally rewritten on
>>>>> the next
>>>>> flush if any. This makes offset mapping less tricky.
>>>>> Decompression should be applied in any model though as cache tier
>>>>> shutdown and
>>>>> subsequent compressed data access is possibly  a valid use case.
>>>> Yeah, we need to handle random reads either way, so I think the offset
>>>> mapping is going to be needed anyway.
>>> The idea of making the primary responsible for object compression
>>> really concerns me. It means for instance that a single random access
>>> will likely require access to multiple objects, and breaks many of the
>>> optimizations we have right now or in the pipeline (for instance:
>>> direct client access).
>> Could you please elaborate why multiple objects access is required on single
>> random access?
>> In my opinion we need to access absolutely the same object set as before: in
>> EC pool each appended block is spitted into multiple shards that go to
>> respective OSDs. In general case one has to retrieve a set of adjacent
>> shards from several OSDs on single read request. In case of compression the
>> only difference is in data range that compressed shard set occupy. I.e. we
>> simply need to translate requested data range to the actually stored one and
>> retrieve that data from OSDs. What's missed?
>>> And apparently only the EC pool will support
>>> compression, which is frustrating for all the replicated pool users
>>> out there...
>> In my opinion  replicated pool users should consider EC pool usage first if
>> they care about space saving. They automatically gain 50% space saving this
>> way. Compression brings even more saving but that's rather the second step
>> on this way.
>>> Is there some reason we don't just want to apply encryption across an
>>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>>> (for reasons named above) but there are other mechanisms like inline
>>> block device compression that I think are supposed to work pretty
>>> well.
>> If I understand the idea of inline block device compression correctly it has
>> some of drawbacks similar to FS compression approach. Ones to mention:
>> * Less flexibility - per device compression only, no way to have per-pool
>> compression. No control on the compression process.
>> * Potentially higher overhead when operating- There is no way to bypass
>> non-compressible data processing, e.g. shards with Erasure codes.
>> * Potentially higher overhead for recovery on OSD death - one needs to
>> decompress data at working OSDs and compress it at new OSD. That's not
>> necessary if compression takes place prior to EC though.
>>
>>> The only thing that doesn't get us that I can see mentioned here
>>> is the over-the-wire compression — and Haomai already has patches for
>>> that, which should be a lot easier to validate and will work at all
>>> levels of the stack!
>>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:13             ` Igor Fedotov
@ 2015-09-24 15:34               ` Sage Weil
  2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
  2015-09-24 15:56                 ` Igor Fedotov
  2015-09-24 18:10               ` Gregory Farnum
  1 sibling, 2 replies; 26+ messages in thread
From: Sage Weil @ 2015-09-24 15:34 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Gregory Farnum, ceph-devel

On Thu, 24 Sep 2015, Igor Fedotov wrote:
> On 23.09.2015 21:03, Gregory Farnum wrote:
> > On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
> > > > > 
> > > > > The idea of making the primary responsible for object compression
> > > > > really concerns me. It means for instance that a single random access
> > > > > will likely require access to multiple objects, and breaks many of the
> > > > > optimizations we have right now or in the pipeline (for instance:
> > > > > direct client access).
> > > Could you please elaborate why multiple objects access is required on
> > > single
> > > random access?
> > It sounds to me like you were planning to take an incoming object
> > write, compress it, and then chunk it. If you do that, the symbols
> > ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
> > to reside in the first object and need to be fetched for each read in
> > other objects.
> Gregory,
> do you mean a kind of compressor dictionary under symbols "abcdefgh = a", etc
> here.
> And your assumption is that such dictionary is made on the first write, saved
> and reused by any subsequent reads, right?
> I think that's not the case - it's better to compress each write
> independently.  Thus there is no need to access "dictionary" object ( i.e. the
> first object with these symbols) on every read operation,. The latter uses
> compressed block data only.
> Yes, this might affect total compression ratio but thinks that's acceptabl.

I was also assuming each stripe unit would be independently compressed, 
but I didn't think about the efficiency.  This approach implies that 
you'd want a relatively large stripe size (100s of KB or more).

Hmm, a quick google search suggests the zlib compression window is only 
32KB anyway, which isn't so big.  The more aggressive algorithms probably 
aren't what people would reach for anyway for CPU utilization reasons... I 
guess?

sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:34               ` Sage Weil
@ 2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
  2015-09-24 16:00                   ` Igor Fedotov
  2015-09-24 15:56                 ` Igor Fedotov
  1 sibling, 1 reply; 26+ messages in thread
From: HEWLETT, Paul (Paul) @ 2015-09-24 15:41 UTC (permalink / raw)
  To: ceph-devel

Out of curiosity have you considered the Google compression algos:

http://google-opensource.blogspot.co.uk/2015/09/introducing-brotli-new-comp
ression.html


Paul

On 24/09/2015 16:34, "ceph-devel-owner@vger.kernel.org on behalf of Sage
Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com>
wrote:

>On Thu, 24 Sep 2015, Igor Fedotov wrote:
>> On 23.09.2015 21:03, Gregory Farnum wrote:
>> > On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>> > > > > 
>> > > > > The idea of making the primary responsible for object
>>compression
>> > > > > really concerns me. It means for instance that a single random
>>access
>> > > > > will likely require access to multiple objects, and breaks many
>>of the
>> > > > > optimizations we have right now or in the pipeline (for
>>instance:
>> > > > > direct client access).
>> > > Could you please elaborate why multiple objects access is required
>>on
>> > > single
>> > > random access?
>> > It sounds to me like you were planning to take an incoming object
>> > write, compress it, and then chunk it. If you do that, the symbols
>> > ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
>> > to reside in the first object and need to be fetched for each read in
>> > other objects.
>> Gregory,
>> do you mean a kind of compressor dictionary under symbols "abcdefgh =
>>a", etc
>> here.
>> And your assumption is that such dictionary is made on the first write,
>>saved
>> and reused by any subsequent reads, right?
>> I think that's not the case - it's better to compress each write
>> independently.  Thus there is no need to access "dictionary" object (
>>i.e. the
>> first object with these symbols) on every read operation,. The latter
>>uses
>> compressed block data only.
>> Yes, this might affect total compression ratio but thinks that's
>>acceptabl.
>
>I was also assuming each stripe unit would be independently compressed,
>but I didn't think about the efficiency.  This approach implies that
>you'd want a relatively large stripe size (100s of KB or more).
>
>Hmm, a quick google search suggests the zlib compression window is only
>32KB anyway, which isn't so big.  The more aggressive algorithms probably
>aren't what people would reach for anyway for CPU utilization reasons...
>I 
>guess?
>
>sage
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:34               ` Sage Weil
  2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
@ 2015-09-24 15:56                 ` Igor Fedotov
  2015-09-24 16:03                   ` Sage Weil
  1 sibling, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 15:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On 24.09.2015 18:34, Sage Weil wrote:
> I was also assuming each stripe unit would be independently 
> compressed, but I didn't think about the efficiency. This approach 
> implies that you'd want a relatively large stripe size (100s of KB or 
> more). Hmm, a quick google search suggests the zlib compression window 
> is only 32KB anyway, which isn't so big. The more aggressive 
> algorithms probably aren't what people would reach for anyway for CPU 
> utilization reasons... I guess? sage 

There is probably no need in strict alignment with the stripe size. We 
can use block sizes that client provides on write dynamically. If some 
client writes in stripes - then we compress that block. If others use 
larger blocks ( e.g. caching agent on flush) - we can use that size or 
split the provided block into several smaller chunks ( e.g. up to max 
N*stripe_size ) for overhead reduction on random read. Even if client 
uses dynamic block sizes ( low level RADOS use?) we can rely on them 
some way without static bind to stripe size.
Surely this is much easier when appends are permitted only.  General 
"random writes" case will be more complex.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
@ 2015-09-24 16:00                   ` Igor Fedotov
  0 siblings, 0 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 16:00 UTC (permalink / raw)
  To: HEWLETT, Paul (Paul), ceph-devel

As for me that's the first time I hear about it.

But if we introduce pluggable compression back-ends that would be pretty 
easy to try.

Thanks,
Igor.

On 24.09.2015 18:41, HEWLETT, Paul (Paul) wrote:
> Out of curiosity have you considered the Google compression algos:
>
> http://google-opensource.blogspot.co.uk/2015/09/introducing-brotli-new-comp
> ression.html
>
>
> Paul
>
> On 24/09/2015 16:34, "ceph-devel-owner@vger.kernel.org on behalf of Sage
> Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com>
> wrote:
>
>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>> On 23.09.2015 21:03, Gregory Farnum wrote:
>>>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>>> The idea of making the primary responsible for object
>>> compression
>>>>>>> really concerns me. It means for instance that a single random
>>> access
>>>>>>> will likely require access to multiple objects, and breaks many
>>> of the
>>>>>>> optimizations we have right now or in the pipeline (for
>>> instance:
>>>>>>> direct client access).
>>>>> Could you please elaborate why multiple objects access is required
>>> on
>>>>> single
>>>>> random access?
>>>> It sounds to me like you were planning to take an incoming object
>>>> write, compress it, and then chunk it. If you do that, the symbols
>>>> ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
>>>> to reside in the first object and need to be fetched for each read in
>>>> other objects.
>>> Gregory,
>>> do you mean a kind of compressor dictionary under symbols "abcdefgh =
>>> a", etc
>>> here.
>>> And your assumption is that such dictionary is made on the first write,
>>> saved
>>> and reused by any subsequent reads, right?
>>> I think that's not the case - it's better to compress each write
>>> independently.  Thus there is no need to access "dictionary" object (
>>> i.e. the
>>> first object with these symbols) on every read operation,. The latter
>>> uses
>>> compressed block data only.
>>> Yes, this might affect total compression ratio but thinks that's
>>> acceptabl.
>> I was also assuming each stripe unit would be independently compressed,
>> but I didn't think about the efficiency.  This approach implies that
>> you'd want a relatively large stripe size (100s of KB or more).
>>
>> Hmm, a quick google search suggests the zlib compression window is only
>> 32KB anyway, which isn't so big.  The more aggressive algorithms probably
>> aren't what people would reach for anyway for CPU utilization reasons...
>> I
>> guess?
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:56                 ` Igor Fedotov
@ 2015-09-24 16:03                   ` Sage Weil
  2015-09-24 16:14                     ` Igor Fedotov
  2015-09-24 16:25                     ` Igor Fedotov
  0 siblings, 2 replies; 26+ messages in thread
From: Sage Weil @ 2015-09-24 16:03 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Gregory Farnum, ceph-devel

On Thu, 24 Sep 2015, Igor Fedotov wrote:

> On 24.09.2015 18:34, Sage Weil wrote:
> > I was also assuming each stripe unit would be independently compressed, but
> > I didn't think about the efficiency. This approach implies that you'd want a
> > relatively large stripe size (100s of KB or more). Hmm, a quick google
> > search suggests the zlib compression window is only 32KB anyway, which isn't
> > so big. The more aggressive algorithms probably aren't what people would
> > reach for anyway for CPU utilization reasons... I guess? sage 
> 
> There is probably no need in strict alignment with the stripe size. We can use
> block sizes that client provides on write dynamically. If some client writes
> in stripes - then we compress that block. If others use larger blocks ( e.g.
> caching agent on flush) - we can use that size or split the provided block
> into several smaller chunks ( e.g. up to max N*stripe_size ) for overhead
> reduction on random read. Even if client uses dynamic block sizes ( low level
> RADOS use?) we can rely on them some way without static bind to stripe size.
> Surely this is much easier when appends are permitted only.  General "random
> writes" case will be more complex.

Dynamic stripe sizes are possible but it's a significant change from the 
way the EC pool currently works.  I would make that a separate project (as 
its useful in its own right) and not complicate the compression situation.  

Or, if it simplifies the compression approach, then I'd make that change 
first.

sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 16:03                   ` Sage Weil
@ 2015-09-24 16:14                     ` Igor Fedotov
  2015-09-24 16:25                     ` Igor Fedotov
  1 sibling, 0 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 16:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel



On 24.09.2015 19:03, Sage Weil wrote:
> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>
>>
>> There is probably no need in strict alignment with the stripe size. We can use
>> block sizes that client provides on write dynamically. If some client writes
>> in stripes - then we compress that block. If others use larger blocks ( e.g.
>> caching agent on flush) - we can use that size or split the provided block
>> into several smaller chunks ( e.g. up to max N*stripe_size ) for overhead
>> reduction on random read. Even if client uses dynamic block sizes ( low level
>> RADOS use?) we can rely on them some way without static bind to stripe size.
>> Surely this is much easier when appends are permitted only.  General "random
>> writes" case will be more complex.
> Dynamic stripe sizes are possible but it's a significant change from the
> way the EC pool currently works.  I would make that a separate project (as
> its useful in its own right) and not complicate the compression situation.
>
> Or, if it simplifies the compression approach, then I'd make that change
> first.
My point was rather about the lack of need to depend on stripe size for 
compression than about the need for dynamic stripes.
As far as I understand clients can write data using blocks larger then 
stripe size, e.g. several stripes together. Is that correct?

At least I could see that for cache flush and low-level RADOS access.
So we can compress every written block independently - if it has stripe 
size - that's OK - compress it as-is. if it's larger - let's compress 
the whole block or split into less ones and compress them independently.

Thus I think there is no explicit need for additional changes in Ceph 
for doing compression.

Thanks,
Igor.
> sage


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 16:03                   ` Sage Weil
  2015-09-24 16:14                     ` Igor Fedotov
@ 2015-09-24 16:25                     ` Igor Fedotov
  2015-09-24 17:36                       ` Robert LeBlanc
  1 sibling, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-24 16:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On 24.09.2015 19:03, Sage Weil wrote:
> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>
> Dynamic stripe sizes are possible but it's a significant change from 
> the way the EC pool currently works. I would make that a separate 
> project (as its useful in its own right) and not complicate the 
> compression situation. Or, if it simplifies the compression approach, 
> then I'd make that change first. sage 
Just to clarify a bit. What I saw when played with Ceph. Please correct 
me if I'm wrong..

For low-level RADOS access client data written to EC pool has to be 
aligned with stripe size . The last block can be unaligned though but no 
more appends are permitted in this case.
Data copied from cache goes in blocks up to 8Mb size. In general case  
the last block seems to have unaligned size too.

EC pool additionally performs alignment of the incoming blocks to stripe 
bound internally. This way blocks going to EC lib are always aligned.
We should probably perform compression prior to this alignment.
Thus some dependency on stripe size is present in EC pools but it's not 
that strict.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 16:25                     ` Igor Fedotov
@ 2015-09-24 17:36                       ` Robert LeBlanc
  2015-09-24 17:53                         ` Samuel Just
  0 siblings, 1 reply; 26+ messages in thread
From: Robert LeBlanc @ 2015-09-24 17:36 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Sage Weil, Gregory Farnum, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm probably missing something, but since we are talking about data at
rest, can't we just have the OSD compress the object as it goes to
disk? Instead of
rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
be rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
Then it seems that you don't have to muck with stripe sizes or
anything. For compressible objects they would be less than 4MB, some
of theses algorithms already say if it is not compressible enough,
just store it.

Something like zlib Z_FULL_FLUSH may help provide some seek points
within an archive to prevent decompressing the whole object for reads?

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
>
>
> On 24.09.2015 19:03, Sage Weil wrote:
>>
>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>
>> Dynamic stripe sizes are possible but it's a significant change from the
>> way the EC pool currently works. I would make that a separate project (as
>> its useful in its own right) and not complicate the compression situation.
>> Or, if it simplifies the compression approach, then I'd make that change
>> first. sage
>
> Just to clarify a bit. What I saw when played with Ceph. Please correct me
> if I'm wrong..
>
> For low-level RADOS access client data written to EC pool has to be aligned
> with stripe size . The last block can be unaligned though but no more
> appends are permitted in this case.
> Data copied from cache goes in blocks up to 8Mb size. In general case  the
> last block seems to have unaligned size too.
>
> EC pool additionally performs alignment of the incoming blocks to stripe
> bound internally. This way blocks going to EC lib are always aligned.
> We should probably perform compression prior to this alignment.
> Thus some dependency on stripe size is present in EC pools but it's not that
> strict.
>
> Thanks,
> Igor
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
ouVI
=d8mm
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 17:36                       ` Robert LeBlanc
@ 2015-09-24 17:53                         ` Samuel Just
  2015-09-25 11:59                           ` Igor Fedotov
  0 siblings, 1 reply; 26+ messages in thread
From: Samuel Just @ 2015-09-24 17:53 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Igor Fedotov, Sage Weil, Gregory Farnum, ceph-devel

The catch is that currently accessing 4k in the middle of a 4MB object
does not require reading the whole object, so you'd need some kind of
logical offset -> compressed offset mapping.
-Sam

On Thu, Sep 24, 2015 at 10:36 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I'm probably missing something, but since we are talking about data at
> rest, can't we just have the OSD compress the object as it goes to
> disk? Instead of
> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
> be rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
> Then it seems that you don't have to muck with stripe sizes or
> anything. For compressible objects they would be less than 4MB, some
> of theses algorithms already say if it is not compressible enough,
> just store it.
>
> Something like zlib Z_FULL_FLUSH may help provide some seek points
> within an archive to prevent decompressing the whole object for reads?
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
>>
>>
>> On 24.09.2015 19:03, Sage Weil wrote:
>>>
>>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>>
>>> Dynamic stripe sizes are possible but it's a significant change from the
>>> way the EC pool currently works. I would make that a separate project (as
>>> its useful in its own right) and not complicate the compression situation.
>>> Or, if it simplifies the compression approach, then I'd make that change
>>> first. sage
>>
>> Just to clarify a bit. What I saw when played with Ceph. Please correct me
>> if I'm wrong..
>>
>> For low-level RADOS access client data written to EC pool has to be aligned
>> with stripe size . The last block can be unaligned though but no more
>> appends are permitted in this case.
>> Data copied from cache goes in blocks up to 8Mb size. In general case  the
>> last block seems to have unaligned size too.
>>
>> EC pool additionally performs alignment of the incoming blocks to stripe
>> bound internally. This way blocks going to EC lib are always aligned.
>> We should probably perform compression prior to this alignment.
>> Thus some dependency on stripe size is present in EC pools but it's not that
>> strict.
>>
>> Thanks,
>> Igor
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
> ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
> meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
> PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
> JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
> wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
> pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
> eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
> FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
> 6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
> 0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
> XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
> ouVI
> =d8mm
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 15:13             ` Igor Fedotov
  2015-09-24 15:34               ` Sage Weil
@ 2015-09-24 18:10               ` Gregory Farnum
  2015-09-25 13:16                 ` Igor Fedotov
  1 sibling, 1 reply; 26+ messages in thread
From: Gregory Farnum @ 2015-09-24 18:10 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Sage Weil, ceph-devel

On Thu, Sep 24, 2015 at 8:13 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> On 23.09.2015 21:03, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@newdream.net> wrote:
>>>>>
>>>>>
>>>>> The idea of making the primary responsible for object compression
>>>>> really concerns me. It means for instance that a single random access
>>>>> will likely require access to multiple objects, and breaks many of the
>>>>> optimizations we have right now or in the pipeline (for instance:
>>>>> direct client access).
>>>
>>> Could you please elaborate why multiple objects access is required on
>>> single
>>> random access?
>>
>> It sounds to me like you were planning to take an incoming object
>> write, compress it, and then chunk it. If you do that, the symbols
>> ("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
>> to reside in the first object and need to be fetched for each read in
>> other objects.
>
> Gregory,
> do you mean a kind of compressor dictionary under symbols "abcdefgh = a",
> etc here.
> And your assumption is that such dictionary is made on the first write,
> saved and reused by any subsequent reads, right?
> I think that's not the case - it's better to compress each write
> independently.  Thus there is no need to access "dictionary" object ( i.e.
> the first object with these symbols) on every read operation,. The latter
> uses compressed block data only.
> Yes, this might affect total compression ratio but thinks that's acceptabl.
>>>
>>> In my opinion we need to access absolutely the same object set as before:
>>> in
>>> EC pool each appended block is spitted into multiple shards that go to
>>> respective OSDs. In general case one has to retrieve a set of adjacent
>>> shards from several OSDs on single read request.
>>
>> Usually we just need to get the object info from the primary and then
>> read whichever object has the data for the requested region. If the
>> region spans a stripe boundary we might need to get two, but often we
>> don't...
>
> With independent block compression mentioned above the scenario is the same.
> The only thing we need to find proper compressed block is a mapping from
> original data offset to the compressed ones. We can store this as object
> metadata. Thus we need object metadata on each read only.

Okay, that's acceptable, but that metadata then gets pretty large. You
would need to store an offset, for each chunk in the PG, and for each
individual write. (And even then you'd have to read an entire write at
a time to make sure you get the data requested, even if they only want
a small portion of it.)
If you're doing it this way, then realize we've also got a problem
with recovery: we can't lose those offsets. Which means they need to
be preserved at all costs. So that means for each stripe unit you'd
store them on the primary (for easy access) and on the replica (so
they have the same lifecycle as the data they're mapping), which means
the replicas need to be compression-aware. Which is good, since I
think they'd need to be compression-aware for scrubbing and things as
well. And then when you lose the primary the next guy who's
reconstructing would need to, uh, ask each shard for the uncompressed
version of the data?

If we were going to limit this to EC pools I think we should just do
it at the replica in the FileStore or something, transparently to the
wire and recovery protocols. While the compression would help on 1GigE
networks, on 10GigE I think the CPU costs of compression outweigh any
bandwidth efficiencies we'd get...


>>>
>>> In case of compression the
>>> only difference is in data range that compressed shard set occupy. I.e.
>>> we
>>> simply need to translate requested data range to the actually stored one
>>> and
>>> retrieve that data from OSDs. What's missed?
>>>>
>>>> And apparently only the EC pool will support
>>>> compression, which is frustrating for all the replicated pool users
>>>> out there...
>>>
>>> In my opinion  replicated pool users should consider EC pool usage first
>>> if
>>> they care about space saving. They automatically gain 50% space saving
>>> this
>>> way. Compression brings even more saving but that's rather the second
>>> step
>>> on this way.
>>
>> EC pools have important limitations that replicated pools don't, like
>> not working for object classes or allowing random overwrites. You can
>> stick a replicated cache pool in front but that comes with another
>> whole can of worms. Anybody with a large enough proportion of active
>> data won't find that solution suitable but might still want to reduce
>> space required where they can, like with local compression.
>
> Well I agree that have compression support for both replicated and EC pools
> is better.
> But random access ( and probably other advanced features ) requires much
> more complex data handling that also brings additional overhead. Actually I
> suppose EC pools have such limitations due to these reasons. Thus my
> original idea was to simplify compression implementation from one side and
> make it  in-line with EC usage from another. The latter makes sense since
> compression and EC  have pretty the same reasons for implementation.

Well, EC pools still support random reads, I think? Or at least
reading along stripes, which for the purpose of this discussion is
almost the same.

>
> And just for the sake of my education could you please mention or point out
> existing issues in cache+EC pools usage.
> How widely are EC pools used in production at all? Or that's rather
> experimental/secondary option?

Promotes are expensive. Work is ongoing to make cache pools work
better, but promotes will always be expensive. So they're only
suitable if you have a hot data set which is very small compared to
your total storage needs (and you need the cache tier to be a little
larger than that hot set). I'm not sure what the deployment of EC
pools looks like.

>>>>
>>>> Is there some reason we don't just want to apply encryption across an
>>>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>>>> (for reasons named above) but there are other mechanisms like inline
>>>> block device compression that I think are supposed to work pretty
>>>> well.
>>>
>>> If I understand the idea of inline block device compression correctly it
>>> has
>>> some of drawbacks similar to FS compression approach. Ones to mention:
>>> * Less flexibility - per device compression only, no way to have per-pool
>>> compression. No control on the compression process.
>>
>> What would the use case be here? I can imagine not wanting to slow
>> down your cache pools with it or something (although realistically I
>> don't think that's a concern unless the sheer CPU usage is a problem
>> with frequent writes), but those would be on separate OSDs/volumes
>> anyway
>
> Well I can imagine the need to have compression for some specific backing
> pools ( e.g. with seldom accessed or highly compressible data) and disable
> it for others, e.g. where original data is non-compressible ( e.g. either
> already compressed  or encrypted).

Good compression algorithms already handle this, IIUC.

> Potentially we can even have some option to control compression on
> per-object basis and provide some hints for clients to enable it for
> specific use cases.

Mmmm, I'm not sure I'm comfortable exposing that to clients. If
they're compression-aware it's probably best to do it using them.

> Another feature that might be useful - the ability to disable/re-enable
> compression during OSD life-cycle. E.g. when Administrator realizes that
> it's not appropriate for his use case. I doubt that's easy to do when
> compression is performed at device level.

I confess I've no idea about this one.

In any case, as Sam said I think judging these proposals well will
require actually going through the data structure and algorithms
design work for each one and comparing. Unfortunately I've no time to
do that, but I'd definitely like to see two real approaches
well-sketched-out before any work is spent on coding one.
-Greg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 17:53                         ` Samuel Just
@ 2015-09-25 11:59                           ` Igor Fedotov
  2015-09-25 14:14                             ` Sage Weil
  0 siblings, 1 reply; 26+ messages in thread
From: Igor Fedotov @ 2015-09-25 11:59 UTC (permalink / raw)
  To: Samuel Just, Robert LeBlanc; +Cc: Sage Weil, Gregory Farnum, ceph-devel

Another thing to note is that we don't have the whole object ready for 
compression. We just have some new data block written(appended) to the 
object. And we should either compress that block and save mentioned 
mapping data or decompress the existing object data and do full 
compression again.
And IMO introducing seek points is largely similar to what we were 
talking about - it requires a sort of offset mapping as well.

Probably compression at OSD has some Pros as well. But it wouldn't 
eliminate the need to "muck with stripe sizes or anything".

On 24.09.2015 20:53, Samuel Just wrote:
> The catch is that currently accessing 4k in the middle of a 4MB object
> does not require reading the whole object, so you'd need some kind of
> logical offset -> compressed offset mapping.
> -Sam
>
> On Thu, Sep 24, 2015 at 10:36 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I'm probably missing something, but since we are talking about data at
>> rest, can't we just have the OSD compress the object as it goes to
>> disk? Instead of
>> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
>> be rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
>> Then it seems that you don't have to muck with stripe sizes or
>> anything. For compressible objects they would be less than 4MB, some
>> of theses algorithms already say if it is not compressible enough,
>> just store it.
>>
>> Something like zlib Z_FULL_FLUSH may help provide some seek points
>> within an archive to prevent decompressing the whole object for reads?
>>
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
>>>
>>> On 24.09.2015 19:03, Sage Weil wrote:
>>>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>>>
>>>> Dynamic stripe sizes are possible but it's a significant change from the
>>>> way the EC pool currently works. I would make that a separate project (as
>>>> its useful in its own right) and not complicate the compression situation.
>>>> Or, if it simplifies the compression approach, then I'd make that change
>>>> first. sage
>>> Just to clarify a bit. What I saw when played with Ceph. Please correct me
>>> if I'm wrong..
>>>
>>> For low-level RADOS access client data written to EC pool has to be aligned
>>> with stripe size . The last block can be unaligned though but no more
>>> appends are permitted in this case.
>>> Data copied from cache goes in blocks up to 8Mb size. In general case  the
>>> last block seems to have unaligned size too.
>>>
>>> EC pool additionally performs alignment of the incoming blocks to stripe
>>> bound internally. This way blocks going to EC lib are always aligned.
>>> We should probably perform compression prior to this alignment.
>>> Thus some dependency on stripe size is present in EC pools but it's not that
>>> strict.
>>>
>>> Thanks,
>>> Igor
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
>> ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
>> meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
>> PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
>> JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
>> wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
>> pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
>> eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
>> FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
>> 6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
>> 0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
>> XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
>> ouVI
>> =d8mm
>> -----END PGP SIGNATURE-----
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-24 18:10               ` Gregory Farnum
@ 2015-09-25 13:16                 ` Igor Fedotov
  0 siblings, 0 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-25 13:16 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel



On 24.09.2015 21:10, Gregory Farnum wrote:
> On Thu, Sep 24, 2015 at 8:13 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> On 23.09.2015 21:03, Gregory Farnum wrote:
>> Okay, that's acceptable, but that metadata then gets pretty large. 
>> You would need to store an offset, for each chunk in the PG, and for 
>> each individual write. (And even then you'd have to read an entire 
>> write at a time to make sure you get the data requested, even if they 
>> only want a small portion of it.) If you're doing it this way, then 
>> realize we've also got a problem with recovery: we can't lose those 
>> offsets. Which means they need to be preserved at all costs. So that 
>> means for each stripe unit you'd store them on the primary (for easy 
>> access) and on the replica (so they have the same lifecycle as the 
>> data they're mapping), which means the replicas need to be 
>> compression-aware. Which is good, since I think they'd need to be 
>> compression-aware for scrubbing and things as well. And then when you 
>> lose the primary the next guy who's reconstructing would need to, uh, 
>> ask each shard for the uncompressed version of the data? 
You are absolutely right about metadata importance and replicas 
compression-awareness. The great thing here is that it's absolutely 
similar to current EC pool implementation. Each append to EC pool 
updates some specific metadata (hash info) that are propagated to all 
replicas. And each replica is able to restore EC encoded data when 
primary is lost. IMO such replica simply becomes a new primary.

And yes - reconstructing entity collects shards from multiple OSDs. 
Moreover primary does the same during regular read. Thus all this 
mechanics already exists for EC pools.

>> If we were going to limit this to EC pools I think we should just do 
>> it at the replica in the FileStore or something, transparently to the 
>> wire and recovery protocols. While the compression would help on 
>> 1GigE networks, on 10GigE I think the CPU costs of compression 
>> outweigh any bandwidth efficiencies we'd get... 
This is definitely worth to consider but one thing to mention here. In 
general from CPU loading perspective there is no much difference where 
compression is performed: at primary OSD or at replica node. Each 
replica node can be a primary for some other object thus its' CPU can be 
utilized for that compression.
E.g.
There are three nodes: node1, node2, node3.
There are three objects written to EC pool.
They have different primaries: node1, node2, node3 respectively.
and all nodes are used for objects to store resulting EC shards.

Original disposition after EC:
obj1 -> shard1_1, shard1_2, shard1_3. (performed at node1)
obj2 -> shard2_1, shard2_2, shard2_3. (performed at node2)
obj3 -> shard3_1, shard3_2, shard3_3. (performed at node3)

Stored data disposition can be:
node1: shard1_1,  shard2_2, shard3_3
node2: shard1_3, shard2_1, shard3_2
node3: shard1_2 & shard2_3, shard3_1

Thus each node has to deal with 3 shards - no matter where you have 
compression functionality:  each node has to compress 3 shards.
If compression is done at primary  node1 compresses shards1_1, 1_2, 1_3
If compression is at replica node 1 compresses shard1_1,  shard2_2, shard3_3
The same applies to other nodes.

As a result you will have similar CPU load distribution among nodes 
under Ceph cluster load  for both compression approaches.
Actually compression at primary before EC even has some benefit: each 
object has two shards prior to EC thus you need to compress less data.

>>
>> Well I agree that have compression support for both replicated and EC pools
>> is better.
>> But random access ( and probably other advanced features ) requires much
>> more complex data handling that also brings additional overhead. Actually I
>> suppose EC pools have such limitations due to these reasons. Thus my
>> original idea was to simplify compression implementation from one side and
>> make it  in-line with EC usage from another. The latter makes sense since
>> compression and EC  have pretty the same reasons for implementation.
> Well, EC pools still support random reads, I think? Or at least
> reading along stripes, which for the purpose of this discussion is
> almost the same.
Yeah, random reads are possible for EC pools. But they aren't the major 
issue IMO. That's random writes that causes a head ache.
On such write one should decompress existing block, merge it with new 
data, compress it again and then save to a disk given the fact that 
block size has changed.
Or implement a sort of journal where new writes are saved separately and 
then data reconstruction from this journal is required on read. As well 
as some garbage collection... AFAIK ZBD I mentioned before works in this 
way.
see
http://users.ics.forth.gr/~bilas/pdffiles/makatos-snapi10.pdf

> In any case, as Sam said I think judging these proposals well will 
> require actually going through the data structure and algorithms 
> design work for each one and comparing. Unfortunately I've no time to 
> do that, but I'd definitely like to see two real approaches 
> well-sketched-out before any work is spent on coding one. -Greg 
Got it, will try to prepare some draft...

Thanks,
Igor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-25 11:59                           ` Igor Fedotov
@ 2015-09-25 14:14                             ` Sage Weil
  2015-09-28 16:56                               ` Igor Fedotov
  0 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2015-09-25 14:14 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Samuel Just, Robert LeBlanc, Gregory Farnum, ceph-devel

On Fri, 25 Sep 2015, Igor Fedotov wrote:
> Another thing to note is that we don't have the whole object ready for
> compression. We just have some new data block written(appended) to the object.
> And we should either compress that block and save mentioned mapping data or
> decompress the existing object data and do full compression again.
> And IMO introducing seek points is largely similar to what we were talking
> about - it requires a sort of offset mapping as well.
> 
> Probably compression at OSD has some Pros as well. But it wouldn't eliminate
> the need to "muck with stripe sizes or anything".

I think the best option here is going to be to compress the "stripe unit".  
I.e., if you have a stripe_size of 64K, and are doing k=4 m=2, then the 
stripe unit is 16K (64/4).  Then each shard has an independent unit it can 
compress/decompress and we don't break the ability to read a small extent 
by talking to only a single shard.

*Maybe* the shard could compress contiguous stripe units if multiple 
stripes are written together..

In any case, though, there will some metadata it has to track with the 
object, because the stripe units are no longer fixed size, and there will 
be object_size/stripe_size of them.  I forget if we are already storing a 
CRC for each stripe unit or if it is for the entire shard... if it's the 
former then this won't be a huge change, I think.

sage



> 
> On 24.09.2015 20:53, Samuel Just wrote:
> > The catch is that currently accessing 4k in the middle of a 4MB object
> > does not require reading the whole object, so you'd need some kind of
> > logical offset -> compressed offset mapping.
> > -Sam
> > 
> > On Thu, Sep 24, 2015 at 10:36 AM, Robert LeBlanc <robert@leblancnet.us>
> > wrote:
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA256
> > > 
> > > I'm probably missing something, but since we are talking about data at
> > > rest, can't we just have the OSD compress the object as it goes to
> > > disk? Instead of
> > > rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
> > > be
> > > rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
> > > Then it seems that you don't have to muck with stripe sizes or
> > > anything. For compressible objects they would be less than 4MB, some
> > > of theses algorithms already say if it is not compressible enough,
> > > just store it.
> > > 
> > > Something like zlib Z_FULL_FLUSH may help provide some seek points
> > > within an archive to prevent decompressing the whole object for reads?
> > > 
> > > - ----------------
> > > Robert LeBlanc
> > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > > 
> > > 
> > > On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
> > > > 
> > > > On 24.09.2015 19:03, Sage Weil wrote:
> > > > > On Thu, 24 Sep 2015, Igor Fedotov wrote:
> > > > > 
> > > > > Dynamic stripe sizes are possible but it's a significant change from
> > > > > the
> > > > > way the EC pool currently works. I would make that a separate project
> > > > > (as
> > > > > its useful in its own right) and not complicate the compression
> > > > > situation.
> > > > > Or, if it simplifies the compression approach, then I'd make that
> > > > > change
> > > > > first. sage
> > > > Just to clarify a bit. What I saw when played with Ceph. Please correct
> > > > me
> > > > if I'm wrong..
> > > > 
> > > > For low-level RADOS access client data written to EC pool has to be
> > > > aligned
> > > > with stripe size . The last block can be unaligned though but no more
> > > > appends are permitted in this case.
> > > > Data copied from cache goes in blocks up to 8Mb size. In general case
> > > > the
> > > > last block seems to have unaligned size too.
> > > > 
> > > > EC pool additionally performs alignment of the incoming blocks to stripe
> > > > bound internally. This way blocks going to EC lib are always aligned.
> > > > We should probably perform compression prior to this alignment.
> > > > Thus some dependency on stripe size is present in EC pools but it's not
> > > > that
> > > > strict.
> > > > 
> > > > Thanks,
> > > > Igor
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: Mailvelope v1.1.0
> > > Comment: https://www.mailvelope.com
> > > 
> > > wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
> > > ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
> > > meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
> > > PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
> > > JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
> > > wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
> > > pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
> > > eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
> > > FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
> > > 6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
> > > 0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
> > > XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
> > > ouVI
> > > =d8mm
> > > -----END PGP SIGNATURE-----
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Adding Data-At-Rest compression support to Ceph
  2015-09-25 14:14                             ` Sage Weil
@ 2015-09-28 16:56                               ` Igor Fedotov
  0 siblings, 0 replies; 26+ messages in thread
From: Igor Fedotov @ 2015-09-28 16:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, Robert LeBlanc, Gregory Farnum, ceph-devel


On 25.09.2015 17:14, Sage Weil wrote:
> On Fri, 25 Sep 2015, Igor Fedotov wrote:
>> Another thing to note is that we don't have the whole object ready for
>> compression. We just have some new data block written(appended) to the object.
>> And we should either compress that block and save mentioned mapping data or
>> decompress the existing object data and do full compression again.
>> And IMO introducing seek points is largely similar to what we were talking
>> about - it requires a sort of offset mapping as well.
>>
>> Probably compression at OSD has some Pros as well. But it wouldn't eliminate
>> the need to "muck with stripe sizes or anything".
> I think the best option here is going to be to compress the "stripe unit".
> I.e., if you have a stripe_size of 64K, and are doing k=4 m=2, then the
> stripe unit is 16K (64/4).  Then each shard has an independent unit it can
> compress/decompress and we don't break the ability to read a small extent
> by talking to only a single shard.
Sage, are you considering compression applied after erasure coding here?
Please note that one needs to compress additional 50% of data this way. 
Generated 'm' chunks need to be processed as well.
And you lose an ability to perform recovery on OSD down without applying 
decompression ( and probably another compression) to remaining shards.

Contrary doing compression before EC produces reduced data set for EC  ( 
some CPU cycles saving)  and is suitable for recovery procedure not 
involving additional decompression/compression pair.
But I suppose 'stripe unit' from the above wouldn't work in this case - 
compression entity has to produce  blocks having "stripe unit" size. 
This way you can fit all compressed data into single shard only. But 
that's hard to achieve....

Thus as usual we should choose what drawbacks(benefits) are less(more) 
important here:
ability to read small extent from single shard + increased data set for 
compression vs. ability to omit total decompression on recovery + 
reduced data set for EC.



>
> *Maybe* the shard could compress contiguous stripe units if multiple
> stripes are written together..
>
> In any case, though, there will some metadata it has to track with the
> object, because the stripe units are no longer fixed size, and there will
> be object_size/stripe_size of them.  I forget if we are already storing a
> CRC for each stripe unit or if it is for the entire shard... if it's the
> former then this won't be a huge change, I think.
>
> sage
>
>
>
>> On 24.09.2015 20:53, Samuel Just wrote:
>>> The catch is that currently accessing 4k in the middle of a 4MB object
>>> does not require reading the whole object, so you'd need some kind of
>>> logical offset -> compressed offset mapping.
>>> -Sam
>>>
>>> On Thu, Sep 24, 2015 at 10:36 AM, Robert LeBlanc <robert@leblancnet.us>
>>> wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> I'm probably missing something, but since we are talking about data at
>>>> rest, can't we just have the OSD compress the object as it goes to
>>>> disk? Instead of
>>>> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11 it would
>>>> be
>>>> rbd\udata.1ba49c10d9b00c.0000000000006859__head_2AD1002B__11.{gz,xz,bz2,lzo,etc}.
>>>> Then it seems that you don't have to muck with stripe sizes or
>>>> anything. For compressible objects they would be less than 4MB, some
>>>> of theses algorithms already say if it is not compressible enough,
>>>> just store it.
>>>>
>>>> Something like zlib Z_FULL_FLUSH may help provide some seek points
>>>> within an archive to prevent decompressing the whole object for reads?
>>>>
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Thu, Sep 24, 2015 at 10:25 AM, Igor Fedotov  wrote:
>>>>> On 24.09.2015 19:03, Sage Weil wrote:
>>>>>> On Thu, 24 Sep 2015, Igor Fedotov wrote:
>>>>>>
>>>>>> Dynamic stripe sizes are possible but it's a significant change from
>>>>>> the
>>>>>> way the EC pool currently works. I would make that a separate project
>>>>>> (as
>>>>>> its useful in its own right) and not complicate the compression
>>>>>> situation.
>>>>>> Or, if it simplifies the compression approach, then I'd make that
>>>>>> change
>>>>>> first. sage
>>>>> Just to clarify a bit. What I saw when played with Ceph. Please correct
>>>>> me
>>>>> if I'm wrong..
>>>>>
>>>>> For low-level RADOS access client data written to EC pool has to be
>>>>> aligned
>>>>> with stripe size . The last block can be unaligned though but no more
>>>>> appends are permitted in this case.
>>>>> Data copied from cache goes in blocks up to 8Mb size. In general case
>>>>> the
>>>>> last block seems to have unaligned size too.
>>>>>
>>>>> EC pool additionally performs alignment of the incoming blocks to stripe
>>>>> bound internally. This way blocks going to EC lib are always aligned.
>>>>> We should probably perform compression prior to this alignment.
>>>>> Thus some dependency on stripe size is present in EC pools but it's not
>>>>> that
>>>>> strict.
>>>>>
>>>>> Thanks,
>>>>> Igor
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.1.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWBDSDCRDmVDuy+mK58QAAmwwP/3q0tbLZA95RVsvSLrXk
>>>> ipuhjiGPvAX8o2kTYFtf5tXkMuiJIJIy+WK1uD6zs+CXM/2JR6SJthS3tE9A
>>>> meaFW7W5lropbWKRZ8TkpUNQAXDyRrpSEcTDBWciq+EOca5tlP+17KDevVnZ
>>>> PWDCNPlZmbHyBy91iJju4TTzaJYoD8mXU/+4xLCicePDPomlpO4oyndDfOmI
>>>> JP5uRDmgP0ecsxfcyoYSTCJylfnBsmK0IMyxZoV2Mx+SEcqgtECPCOY7Uc/4
>>>> wwXGhu//zO7twyOvtsk4OQGjLX9wpSpVWz+zcR2RYiYfw3YSTSzGvbBC5hpb
>>>> pfQya5DbypJra2oz5BZkikvwYPhxPoI0FcdTCYFFxclm0jMwQqh2b141kN8Z
>>>> eR7v8ttfnbACumWP74j2KSpHRm/1l65nN4wqzg3ovoesjoJDvb2miz8AX7ag
>>>> FXVa54JpIcoIzCkIkqvpCfzhatGU55yQiyt7aFAhJfpmP/cNpxmAete8buTK
>>>> 6aFMiYWFJe+md/bLOrk5g/cyr9BUq+tHT7Qf+mRmgw9fuECUXMXMzf6vOUk8
>>>> 0JnYiYVk0j+twZeuDaVPBrXEMKuYuq7NlILuHJDF3meRPM2xekan8ARZoJxL
>>>> XAOzvaEFly0TH5DJfItSVOL86qtp+1orULSrVbtvolxzQtv8xiNOzJYBKEnO
>>>> ouVI
>>>> =d8mm
>>>> -----END PGP SIGNATURE-----
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-09-28 16:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-22 17:04 Adding Data-At-Rest compression support to Ceph Igor Fedotov
2015-09-22 19:11 ` Sage Weil
2015-09-23 12:47   ` Igor Fedotov
2015-09-23 13:15     ` Sage Weil
2015-09-23 14:05       ` Gregory Farnum
2015-09-23 15:26         ` Igor Fedotov
2015-09-23 17:31           ` Samuel Just
2015-09-24 15:34             ` Igor Fedotov
2015-09-23 18:03           ` Gregory Farnum
2015-09-24 15:13             ` Igor Fedotov
2015-09-24 15:34               ` Sage Weil
2015-09-24 15:41                 ` HEWLETT, Paul (Paul)
2015-09-24 16:00                   ` Igor Fedotov
2015-09-24 15:56                 ` Igor Fedotov
2015-09-24 16:03                   ` Sage Weil
2015-09-24 16:14                     ` Igor Fedotov
2015-09-24 16:25                     ` Igor Fedotov
2015-09-24 17:36                       ` Robert LeBlanc
2015-09-24 17:53                         ` Samuel Just
2015-09-25 11:59                           ` Igor Fedotov
2015-09-25 14:14                             ` Sage Weil
2015-09-28 16:56                               ` Igor Fedotov
2015-09-24 18:10               ` Gregory Farnum
2015-09-25 13:16                 ` Igor Fedotov
2015-09-23 14:08       ` Igor Fedotov
2015-09-23 14:37         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.