Inline dedup/compression

All of lore.kernel.org
 help / color / mirror / Atom feed

* Inline dedup/compression
@ 2015-06-25 22:01 James (Fei) Liu-SSI
  2015-06-25 23:00 ` Benoît Canet
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: James (Fei) Liu-SSI @ 2015-06-25 22:01 UTC (permalink / raw)
  To: ceph-devel

Hi Cephers,
    It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
   Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?    

  Regards,
  James

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-25 22:01 James (Fei) Liu-SSI
@ 2015-06-25 23:00 ` Benoît Canet
  2015-06-26  3:08 ` Haomai Wang
  2015-06-30  6:50 ` Dałek, Piotr
  2 siblings, 0 replies; 28+ messages in thread
From: Benoît Canet @ 2015-06-25 23:00 UTC (permalink / raw)
  To: James Liu-SSI, ceph-devel


Le Ven 26 juin 2015, à 00:01, James (Fei) Liu-SSI a écrit :
> Hi Cephers,
>     It is not easy to ask when Ceph is going to support inline
>     dedup/compression across OSDs in RADOS.

disclamer: I am not a Cepher.

This would mean some kind of distributed key value store that is fast
enough    
to deliver one query by written object or a kind of distributed in ram
lru hash 
table in order to optimistically identify the hottest objects.           
(The one that are identical and get written the most often)              
                                                                                
I am more used to block devices so I don't realise how many different
objects   
per second get written on disk on a ceph setup.                          
                                                                                
I don't see how it would cope with partial writes.                       
                                                                                
If you need speed like in a block device setup you can forget anything
B-TREE to
implement the key-value store because it's O(log n).                     
                                                                                
So you are left with solutions like SILT that will eat some RAM and beg
you     
for an SSD in exchange of O(1+Epsilon) lookups.(Probably good for
Samsung ;)                                 
See https://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf                 

From my experience with block device setup deduplication is very
demanding to   
the key/value store. Maybe with only complete write on big objects this
problem
would disappear.

Best regards

Benoît


> because it is not easy task
>     and answered. Ceph is providing replication and EC for performance
>     and failure recovery. But we also lose the efficiency  of storage
>     store and cost associate with it. It is kind of contradicted with
>     each other. But I am curious how other Cephers think about this
>     question.
>    Any plan for Cephers to do anything regarding to inline
>    dedupe/compression except the features brought by local node itself
>    like BRTFS?    
> 
>   Regards,
>   James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-25 22:01 James (Fei) Liu-SSI
  2015-06-25 23:00 ` Benoît Canet
@ 2015-06-26  3:08 ` Haomai Wang
  2015-06-26 18:03   ` James (Fei) Liu-SSI
  2015-06-30  6:50 ` Dałek, Piotr
  2 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-06-26  3:08 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: ceph-devel

On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Cephers,
>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?

Compression is easier to implement in rados than dedup. The most
important thing about compression is where we begin to compress,
client, pg or objectstore. Then we need to decide how much the
compress unit is. Of course, compress and dedup both like to use
keyvalue-alike storage api to use, but I think it's not difficult to
use existing objectstore api.

Dedup is more possible to implement in local osd instead of the whole
pool or cluster, and if we want to do dedup for the pool level, we
need to do dedup from client.

>
>   Regards,
>   James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-26  3:08 ` Haomai Wang
@ 2015-06-26 18:03   ` James (Fei) Liu-SSI
  2015-06-26 18:21     ` Handzik, Joe
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: James (Fei) Liu-SSI @ 2015-06-26 18:03 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai,
  Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
1. Keep the data consistency among OSDs in one PG
2. Saving the computing resources  

IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.

About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration. 

However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.  

By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.

Regards,
James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Thursday, June 25, 2015 8:08 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Cephers,
>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?

Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.

Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.

>
>   Regards,
>   James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-26 18:03   ` James (Fei) Liu-SSI
@ 2015-06-26 18:21     ` Handzik, Joe
  2015-06-27  3:54     ` Haomai Wang
  2015-06-29 11:01     ` Gregory Farnum
  2 siblings, 0 replies; 28+ messages in thread
From: Handzik, Joe @ 2015-06-26 18:21 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Haomai Wang; +Cc: ceph-devel

Might be interesting to implement compression for a subset of replicas (for example, in a 3x replicated ruleset, leave a primary version uncompressed but compress the other two). So, in a perfectly healthy cluster there could be object operations that avoid a penalty for compression. 

Would someone attempt to write a compression engine from scratch, or reuse something like snappy? https://en.wikipedia.org/wiki/Snappy_(software)

I don't know that there are any obvious "cons" to implementing it, just side effects of using it (like you pointed out). 

Joe

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Friday, June 26, 2015 1:03 PM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
1. Keep the data consistency among OSDs in one PG 2. Saving the computing resources  

IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.

About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration. 

However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.  

By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.

Regards,
James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Thursday, June 25, 2015 8:08 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Cephers,
>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?

Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.

Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.

>
>   Regards,
>   James
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-26 18:03   ` James (Fei) Liu-SSI
  2015-06-26 18:21     ` Handzik, Joe
@ 2015-06-27  3:54     ` Haomai Wang
  2015-06-29 20:55       ` James (Fei) Liu-SSI
  2015-06-29 11:01     ` Gregory Farnum
  2 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-06-27  3:54 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: ceph-devel

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG
> 2. Saving the computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like
compression. As Joe mentioned, we can compress slave pg data to avoid
performance hurt, but it may increase the complexity of recovery and
pg remap things. Another in-detail implement way if we begin to
compress data from messenger, osd thread and pg thread won't access
data for normal client op, so maybe we can make it parallel with pg
process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in
rados may not get the best compression result. If we can do
compression in libcephfs, librbd and radosgw and make rados unknown to
compression, it maybe simpler and we can get file/block/object level
compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd
side for checksum store usage. Then we calculate object data and map
to PG instead of object name at client side, so a object could always
in a osd where it's also responsible for dedup storage. It also could
be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-26 18:03   ` James (Fei) Liu-SSI
  2015-06-26 18:21     ` Handzik, Joe
  2015-06-27  3:54     ` Haomai Wang
@ 2015-06-29 11:01     ` Gregory Farnum
  2015-06-29 18:42       ` James (Fei) Liu-SSI
  2 siblings, 1 reply; 28+ messages in thread
From: Gregory Farnum @ 2015-06-29 11:01 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: Haomai Wang, ceph-devel

We discuss this periodically but not in any great depth. Compression
and dedupe are both best performed at a single point with some sort of
global knowledge, which is very antithetical to Ceph's design.
Blue-sky discussions for dedupe generally center around trying out
some kind of CAS system with redirects from named objects that are
just indexes of CAS-addressed objects which store the actual object
data, but that introduces the redirect latency and I'm not
super-confident that we'd get much saving outside of scenarios like
RBD where we already capture most of the benefit with the gold master
"parent" images. :/

Compression we discuss less often — you can do it:
a) on the level of an OSD, in which case why would we bother
implementing it ourselves instead of just stacking on top of some
other compression system?
b) on the next level of clients (RBD, RGW, CephFS) ahead of the RADOS
object transition. But that means you generally can't do stuff like
reading portions of an object to satisfy partial reads, and ideas like
striping strategies stop making much sense.
-Greg

On Fri, Jun 26, 2015 at 7:03 PM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG
> 2. Saving the computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-29 11:01     ` Gregory Farnum
@ 2015-06-29 18:42       ` James (Fei) Liu-SSI
  0 siblings, 0 replies; 28+ messages in thread
From: James (Fei) Liu-SSI @ 2015-06-29 18:42 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, ceph-devel

Hi Greg,
  Thanks for your reply.  For sure, the clone feature somehow implements some of function of dedupe in the "image" level. It make a lot of sense to VMs and save huge amount of space for certain workload like VDIs. In addition, Doing data reduction itself isn’t hard, but doing it at large scale reliably without sacrificing performance turns out to be really, really hard. It normally requires you to deal with below issues:
1. Random IO
2. Dedupe/Compression slows writes
3. Massive Virtualization with index tree built for mapping purpose
4. Compression make data update super expensive.

The question to us is should we do better dedupe inside of image/pg groups/OSD with resolving above issues? Here are might be solutions:
1. Flash(SSD) helps a lot to resolve Random IO issues
2. Compression/dedupe take more cpu cycles but take less time to write. We need to accurately count the pros and cons.
3.Compression is best suited for data that is largely inactive. We can choose the compression algorithm based on workload from client side.

Above are just my two cents, IMHO, Enterprise world probably care more about the dedupe/compression features then cloud world. But everybody want to save space/cost as data grows.

Regards,
James


-----Original Message-----
From: Gregory Farnum [mailto:greg@gregs42.com] 
Sent: Monday, June 29, 2015 4:01 AM
To: James (Fei) Liu-SSI
Cc: Haomai Wang; ceph-devel
Subject: Re: Inline dedup/compression

We discuss this periodically but not in any great depth. Compression and dedupe are both best performed at a single point with some sort of global knowledge, which is very antithetical to Ceph's design.
Blue-sky discussions for dedupe generally center around trying out some kind of CAS system with redirects from named objects that are just indexes of CAS-addressed objects which store the actual object data, but that introduces the redirect latency and I'm not super-confident that we'd get much saving outside of scenarios like RBD where we already capture most of the benefit with the gold master "parent" images. :/

Compression we discuss less often — you can do it:
a) on the level of an OSD, in which case why would we bother implementing it ourselves instead of just stacking on top of some other compression system?
b) on the next level of clients (RBD, RGW, CephFS) ahead of the RADOS object transition. But that means you generally can't do stuff like reading portions of an object to satisfy partial reads, and ideas like striping strategies stop making much sense.
-Greg

On Fri, Jun 26, 2015 at 7:03 PM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
       [not found] <1534307780.99.1435609644861.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2015-06-29 20:32 ` Matt W. Benjamin
  0 siblings, 0 replies; 28+ messages in thread
From: Matt W. Benjamin @ 2015-06-29 20:32 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel, James (Fei) Liu-SSI

Hi,

The issues Greg raises steered us away from stream compression, but I'm glad you're experimenting with it.

We were/are interested in (block-oriented, generalized) dedup.  For us, it was clear that the different needs of users and changing capabilities of Ceph lead to different strategies for different data sets (at least).

In our variant of the system, where EC is client side, I don't think there's a conflict with dedup.  We situated it at the volume (kind of like pool) level, where it's abstracted from placement (we've only implemented some simulations to date).

Matt

----- "Haomai Wang" <haomaiwang@gmail.com> wrote:

> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> > Hi Haomai,
> >   Thanks for your response as always. I agree compression is
> comparable easier task but still very challenge in terms of
> implementation no matter where we should implement . Client side like
> RBD, or RDBGW or CephFS, or PG should be a little bit better place to
> implementation in terms of efficiency and cost reduction before the
> data were duplicated to other OSDs. It has  two reasons :
> > 1. Keep the data consistency among OSDs in one PG
> > 2. Saving the computing resources
> >
> > IMHO , The compression should be accomplished before the replication
> come into play in pool level. However, we can also have second level
> of compression in the local objectstore.  In term of unit size of
> compression , It really depends workload and in which layer we should
> implement.
> >
> > About inline deduplication, it will dramatically increase the
> complexities if we bring in the replication and Erasure Coding for
> consideration.
> >
> > However, Before we talk about implementation, It would be great if
> we can understand the pros and cons to implement inline
> dedupe/compression. We all understand the benefits of
> dedupe/compression. However, the side effect is performance hurt and
> need more computing resources. It would be great if we can understand
> the problems from 30,000 feet high for the whole picture about the
> Ceph. Please correct me if I were wrong.
> 
> Actually we may have some tricks to reduce performance hurt like
> compression. As Joe mentioned, we can compress slave pg data to avoid
> performance hurt, but it may increase the complexity of recovery and
> pg remap things. Another in-detail implement way if we begin to
> compress data from messenger, osd thread and pg thread won't access
> data for normal client op, so maybe we can make it parallel with pg
> process. Journal thread will get the compressed data at last.
> 
> The effect of compression also is a concern, we do compression in
> rados may not get the best compression result. If we can do
> compression in libcephfs, librbd and radosgw and make rados unknown
> to
> compression, it maybe simpler and we can get file/block/object level
> compression. it should be better?
> 
> About dedup, my current idea is we could setup a memory pool at osd
> side for checksum store usage. Then we calculate object data and map
> to PG instead of object name at client side, so a object could always
> in a osd where it's also responsible for dedup storage. It also could
> be distributed at pool level.
> 
> 
> >
> > By the way, Both of software defined storage solution startups like
> Hdevig and Springpath provide inline dedupe/compression.  It is not
> apple to apple comparison. But it is good reference. The datacenters
> need cost effective solution.
> >
> > Regards,
> > James
> >
> >
> >
> > -----Original Message-----
> > From: Haomai Wang [mailto:haomaiwang@gmail.com]
> > Sent: Thursday, June 25, 2015 8:08 PM
> > To: James (Fei) Liu-SSI
> > Cc: ceph-devel
> > Subject: Re: Inline dedup/compression
> >
> > On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> >> Hi Cephers,
> >>     It is not easy to ask when Ceph is going to support inline
> dedup/compression across OSDs in RADOS because it is not easy task and
> answered. Ceph is providing replication and EC for performance and
> failure recovery. But we also lose the efficiency  of storage store
> and cost associate with it. It is kind of contradicted with each
> other. But I am curious how other Cephers think about this question.
> >>    Any plan for Cephers to do anything regarding to inline
> dedupe/compression except the features brought by local node itself
> like BRTFS?
> >
> > Compression is easier to implement in rados than dedup. The most
> important thing about compression is where we begin to compress,
> client, pg or objectstore. Then we need to decide how much the
> compress unit is. Of course, compress and dedup both like to use
> keyvalue-alike storage api to use, but I think it's not difficult to
> use existing objectstore api.
> >
> > Dedup is more possible to implement in local osd instead of the
> whole pool or cluster, and if we want to do dedup for the pool level,
> we need to do dedup from client.
> >
> >>
> >>   Regards,
> >>   James
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-27  3:54     ` Haomai Wang
@ 2015-06-29 20:55       ` James (Fei) Liu-SSI
  2015-06-30  6:03         ` Haomai Wang
  2015-06-30  6:19         ` Chaitanya Huilgol
  0 siblings, 2 replies; 28+ messages in thread
From: James (Fei) Liu-SSI @ 2015-06-29 20:55 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-29 20:55       ` James (Fei) Liu-SSI
@ 2015-06-30  6:03         ` Haomai Wang
  2015-06-30  6:20           ` Blair Bethwaite
  2015-06-30  6:19         ` Chaitanya Huilgol
  1 sibling, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-06-30  6:03 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: ceph-devel

On Tue, Jun 30, 2015 at 4:55 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.

Yes, I think a lot myself about compression with Ceph. At firstly, we
could easily use objectstore backend to implement compress like
filestore with zfs/btrfs and keyvaluestore with leveldb/rocksdb etc.
The advantages are we can enjoy it now. The cons are we may lose too
much for benefit of compression especially for performance.

So we think about to move compression on osd/pg layer(implementation),
maybe we can get compression data from messenger module(NIC or IB card
may offer compression feature), then we directly carry with compressed
data and process this. The problem is that at pg/osd layer, we will
aware of the compress thing and we need to manage compress state(This
is important). Maybe we could create a pool with compressed feature,
and specify compress unit(8-64KB), compress algorithm. The cons is
that the pool level maybe coarsness, and actually increase the
complexity of io process path. For example, if compress unit is 8k, it
means all objects in this pool need to process data with 8k aligned
io, otherwise, we need to read-before-write. Consider the pool is
high-level concept and it's difficult to let users choose accurate
client workload.

If we implement compress thing at client side such as lirbd, we can
get the volume-level compress feature. It should be more friendly to
users. We can create a 64kb compress unit for seq workload and 4kb/8kb
for performance tradeoff volume and no compress for performance
volume. It's the same for cephfs directory level and radosgw bucket.
Librbd may directly split object to compress stripe and cephfs file
can compress one file, it maybe better for compress ratio than
compress unaware data structure in osd side. Another benefit is that
we can enjoy the benefit of compression as early as possible, it may
counteract a part of compress performance degraded. The cons we need
to implement more codes at client library.

>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>
>  Regards,
>  James
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 26, 2015 8:55 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Haomai,
>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>> 1. Keep the data consistency among OSDs in one PG 2. Saving the
>> computing resources
>>
>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>
>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>
>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>
> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>
> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>
> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>
>
>>
>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>
>> Regards,
>> James
>>
>>
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Thursday, June 25, 2015 8:08 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cephers,
>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>
>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>
>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>
>>>
>>>   Regards,
>>>   James
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-29 20:55       ` James (Fei) Liu-SSI
  2015-06-30  6:03         ` Haomai Wang
@ 2015-06-30  6:19         ` Chaitanya Huilgol
  2015-06-30 15:31           ` Allen Samuels
  1 sibling, 1 reply; 28+ messages in thread
From: Chaitanya Huilgol @ 2015-06-30  6:19 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Haomai Wang; +Cc: ceph-devel

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns
Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-30  6:03         ` Haomai Wang
@ 2015-06-30  6:20           ` Blair Bethwaite
  2015-06-30 14:38             ` Alexandre DERUMIER
  0 siblings, 1 reply; 28+ messages in thread
From: Blair Bethwaite @ 2015-06-30  6:20 UTC (permalink / raw)
  To: Haomai Wang; +Cc: James (Fei) Liu-SSI, ceph-devel

On 30 June 2015 at 16:03, Haomai Wang <haomaiwang@gmail.com> wrote:
> On Tue, Jun 30, 2015 at 4:55 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>
> Yes, I think a lot myself about compression with Ceph. At firstly, we
> could easily use objectstore backend to implement compress like
> filestore with zfs/btrfs and keyvaluestore with leveldb/rocksdb etc.
> The advantages are we can enjoy it now. The cons are we may lose too
> much for benefit of compression especially for performance.

If you were going to compress at the OSD I imagine the main
performance concern would be about adding to write latency? That might
be mitigated by only compressing the actual datastore and not the
journal?

I like the idea of having a compress option implemented in e.g. librbd
and rgw, both of these cases involve scale-out clients and so concerns
of performance overhead can be largely brushed aside (e.g., most
OpenStack hypervisors seem to have plenty of free CPU).

-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-25 22:01 James (Fei) Liu-SSI
  2015-06-25 23:00 ` Benoît Canet
  2015-06-26  3:08 ` Haomai Wang
@ 2015-06-30  6:50 ` Dałek, Piotr
  2 siblings, 0 replies; 28+ messages in thread
From: Dałek, Piotr @ 2015-06-30  6:50 UTC (permalink / raw)
  To: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Friday, June 26, 2015 12:01 AM
> 
> Hi Cephers,
>     It is not easy to ask when Ceph is going to support inline
> dedup/compression across OSDs in RADOS because it is not easy task and
> answered. Ceph is providing replication and EC for performance and failure
> recovery. But we also lose the efficiency  of storage store and cost associate
> with it. It is kind of contradicted with each other. But I am curious how other
> Cephers think about this question.
>    Any plan for Cephers to do anything regarding to inline
> dedupe/compression except the features brought by local node itself like
> BRTFS?

Actually, I was considering some kind of simple compression (think RLE or some fast variant of Huffman/LZW), but for data payloads in messenger. That would help in inter-datacenter configurations (where fast, low-latency links are prohibitively expensive or even not an option at all, so any extra MB/s counts and latency is already high), and with slow internal networks (1gbit Ethernet, and so on), but never had time to do some proper research on that matter.


With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-30  6:20           ` Blair Bethwaite
@ 2015-06-30 14:38             ` Alexandre DERUMIER
  0 siblings, 0 replies; 28+ messages in thread
From: Alexandre DERUMIER @ 2015-06-30 14:38 UTC (permalink / raw)
  To: Blair Bethwaite; +Cc: Haomai Wang, James (Fei) Liu-SSI, ceph-devel

Hi,

>>I like the idea of having a compress option implemented in e.g. librbd
>>and rgw, both of these cases involve scale-out clients and so concerns
>>of performance overhead can be largely brushed aside (e.g., most
>>OpenStack hypervisors seem to have plenty of free CPU).

Keep in mind that qemu use only 1 thread by disk, so I'm pretty sure that compression on librbd side will impact performance a lot for 1 vm disk.
(of course it'll scale with a lof vms)



----- Mail original -----
De: "Blair Bethwaite" <blair.bethwaite@gmail.com>
À: "Haomai Wang" <haomaiwang@gmail.com>
Cc: "James (Fei) Liu-SSI" <james.liu@ssi.samsung.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 30 Juin 2015 08:20:47
Objet: Re: Inline dedup/compression

On 30 June 2015 at 16:03, Haomai Wang <haomaiwang@gmail.com> wrote: 
> On Tue, Jun 30, 2015 at 4:55 AM, James (Fei) Liu-SSI 
> <james.liu@ssi.samsung.com> wrote: 
>> Hi Haomai, 
>> Thanks for moving the idea forward. Regarding to the compression. However, if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right? I think there is pros and cons in two solutions and we can get into details more for each solution. 
> 
> Yes, I think a lot myself about compression with Ceph. At firstly, we 
> could easily use objectstore backend to implement compress like 
> filestore with zfs/btrfs and keyvaluestore with leveldb/rocksdb etc. 
> The advantages are we can enjoy it now. The cons are we may lose too 
> much for benefit of compression especially for performance. 

If you were going to compress at the OSD I imagine the main 
performance concern would be about adding to write latency? That might 
be mitigated by only compressing the actual datastore and not the 
journal? 

I like the idea of having a compress option implemented in e.g. librbd 
and rgw, both of these cases involve scale-out clients and so concerns 
of performance overhead can be largely brushed aside (e.g., most 
OpenStack hypervisors seem to have plenty of free CPU). 

-- 
Cheers, 
~Blairo 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-30  6:19         ` Chaitanya Huilgol
@ 2015-06-30 15:31           ` Allen Samuels
  2015-06-30 15:50             ` Chaitanya Huilgol
  0 siblings, 1 reply; 28+ messages in thread
From: Allen Samuels @ 2015-06-30 15:31 UTC (permalink / raw)
  To: Chaitanya Huilgol, James (Fei) Liu-SSI, Haomai Wang; +Cc: ceph-devel

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-30 15:31           ` Allen Samuels
@ 2015-06-30 15:50             ` Chaitanya Huilgol
  2015-06-30 22:29               ` James (Fei) Liu-SSI
  0 siblings, 1 reply; 28+ messages in thread
From: Chaitanya Huilgol @ 2015-06-30 15:50 UTC (permalink / raw)
  To: Allen Samuels, James (Fei) Liu-SSI, Haomai Wang; +Cc: ceph-devel


- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the reference count
- Object Delete would result in delete on constituent segments listed in the object segment manifest
- Segment object delete will decrement reference count and remove the segment when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels 
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-30 15:50             ` Chaitanya Huilgol
@ 2015-06-30 22:29               ` James (Fei) Liu-SSI
  2015-07-01 13:46                 ` Ning Yao
  2015-07-02 10:50                 ` Chaitanya Huilgol
  0 siblings, 2 replies; 28+ messages in thread
From: James (Fei) Liu-SSI @ 2015-06-30 22:29 UTC (permalink / raw)
  To: Chaitanya Huilgol, Allen Samuels, Haomai Wang; +Cc: ceph-devel

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.

    Regards,
    James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 

- Data is segmented (rabin/static) and secure hash computed
[James] Which component in OSD are you going to do the data segment and hash computation?

- A manifest is created with the offset/len/hash for all the segments
[James] The manifest is going to be part of xattr of object? Where are you going to save manifest?

- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
[James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?

- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
[James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?

- Response is received by original primary PG for all segments
[James] What response?

- Primary PG writes the manifest to local and replicas or EC members
[James] How about the dedupe data if the data is not present in replicas?
 
- Response sent to client

Read:
- Read received at primary PG
[James]  The read can only fetch data from Primary PG?
- Reads manifest object

- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
   


-----Original Message-----
From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com] 
Sent: Tuesday, June 30, 2015 8:50 AM
To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression


- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the reference count
- Object Delete would result in delete on constituent segments listed in the object segment manifest
- Segment object delete will decrement reference count and remove the segment when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-06-30 22:29               ` James (Fei) Liu-SSI
@ 2015-07-01 13:46                 ` Ning Yao
  2015-07-02 10:50                 ` Chaitanya Huilgol
  1 sibling, 0 replies; 28+ messages in thread
From: Ning Yao @ 2015-07-01 13:46 UTC (permalink / raw)
  To: James (Fei) Liu-SSI
  Cc: Chaitanya Huilgol, Allen Samuels, Haomai Wang, ceph-devel

For compression, I prefer to implement it in ECpool, it is much easier
because objects in ECpool are already striped, which is what we have
already finished now(and in testing). And the only Append write
operation is allowed in EC, which is also lead us to implement it
conveniently. Moreover, as is mentioned by Haomai, it will induce
large read, write penalty and the data becomes fragment if compression
is used in Replicated Pool. Actually, the purpose we do compression is
to save storage and bandwidth, so the pros and cons is more related to
what service you provides. Like VDI case, which includes lots small
read_write, it is not a smart decision to do compression in Replicated
Pool so that we just apply the compression in EC. I think kv-store,
like rocksdb and leveldb, is much suitable for Replicated Pool if
compression is need to be done.
Implementation like:
Write:
      1) ECpool  copy_from() object  from HotPool
      2) Compress data by stripe and calculate hash_info for object as
well as compress_info (which maintains (off,lens) pair for compressed
object corresponding to the content for original object. And some
other info also need like compress alg and so on)
      3) encode compress_info into bufferlist and treat it as a
setattr transaction
READ:
may proxy read or promotion:
      Promotion:
          copy from whole compress object with compress_info to
ReplicatedPG and submitted promotion_write Transaction (just like
normal write Transaction, but need decompression before write to
FileStore)
      proxy read:
          may return compress_info and  read content to Replicated
Pool and decompress to content. Select out the required data and send
back to client.


Glad to hear a new idea to do the pool based dedup from Chaitanya. It
seems that the thought is to maintain the large number of manifest
objects just like current rados objects and distribute among all osds.
Am I right? It, though,  increases the request path, but reduce the
complexity. If introducing a centralized memory index server cluster,
it may become too complex and hard to maintain and scale. Finally, it
will become the second mds and cannot be done well with huge
complexity.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-06-30 22:29               ` James (Fei) Liu-SSI
  2015-07-01 13:46                 ` Ning Yao
@ 2015-07-02 10:50                 ` Chaitanya Huilgol
  2015-07-03  5:13                   ` Allen Samuels
  1 sibling, 1 reply; 28+ messages in thread
From: Chaitanya Huilgol @ 2015-07-02 10:50 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, Allen Samuels, Haomai Wang; +Cc: ceph-devel

Hi James et.al ,

Here is an example for clarity, 
1. Client Writes object  object.abcd
2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write
3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len
 [Header] 
 [Seg1_sha, len]
 [Seg2_sha, len]
 ...
 [Seg3_sha, len]
4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha>
5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence 

Partial writes on objects will be complicated,
- Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
-  New segments will be written  
- Old overwritten segments have to be deleted
- Write merged manifest of the object 

All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments. 

Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin. 

Remaining responses inline.

Regards,
Chaitanya

-----Original Message-----
From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com] 
Sent: Wednesday, July 01, 2015 4:00 AM
To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.

    Regards,
    James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 
[Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component

- Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
[Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?

- A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
[Chaitanya] The manifest is a stub object with the constituent segments list 

- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
[Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write

- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?

- Response is received by original primary PG for all segments [James] What response?
[Chaitanya] Write response indicating the status of the segment object write

- Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
[Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.  

- Response sent to client

Read:
- Read received at primary PG
[James]  The read can only fetch data from Primary PG?
- Reads manifest object

- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

-----Original Message-----
From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
Sent: Tuesday, June 30, 2015 8:50 AM
To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the reference count
- Object Delete would result in delete on constituent segments listed in the object segment manifest
- Segment object delete will decrement reference count and remove the segment when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.

>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat

--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
       [not found] <1840766443.51.1435851210328.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2015-07-02 15:34 ` Matt W. Benjamin
  2015-07-02 16:20   ` Chaitanya Huilgol
  0 siblings, 1 reply; 28+ messages in thread
From: Matt W. Benjamin @ 2015-07-02 15:34 UTC (permalink / raw)
  To: Chaitanya Huilgol
  Cc: ceph-devel, James (Fei) Liu-SSI, Allen Samuels, Haomai Wang

Hi Chaitanya,

Have you ruled out variants using fixed chunksize?  (Arguments for/against
fingerprinting elided.)

Matt

----- "Chaitanya Huilgol" <Chaitanya.Huilgol@sandisk.com> wrote:

> Hi James et.al ,
> 
> Here is an example for clarity, 
> 1. Client Writes object  object.abcd
> 2. Based on the crush rules, say  OSD.a is the primary OSD which
> receives the write
> 3. OSD.a  performs segmenting/fingerprinting which can be static or
> dynamic and generates a list of segments, the object.abcd is now
> represented by a manifest object with the list of segment hash and
> len
>  [Header] 
>  [Seg1_sha, len]
>  [Seg2_sha, len]
>  ...
>  [Seg3_sha, len]
> 4. OSD.a writes each segment as a new object in the cluster with
> object name  <reserved_dedupe_perfix><sha>
> 5. The dedupe object write is treated differently from regular object
> writes, If the object is present then an object reference count is
> incremented and the object is not overwritten - this forms the basis
> of the dedupe logic. Multiple objects with one or more same
> constituent segments start sharing the segment objects.
> 6. Once all the segments are successfully written the object
> 'object.abcd' is now just a stub object with the segment manifest as
> described above and is goes through a regular object write sequence 
> 
> Partial writes on objects will be complicated,
> - Partially affected segments will have to be read and segmentation
> logic has to be run from first to last affected segment boundaries
> -  New segments will be written  
> - Old overwritten segments have to be deleted
> - Write merged manifest of the object 
> 
> All this will need protection of the PG lock, Also additional
> journaling mechanism will be needed to  recover from cases where the
> osd goes down before writing all the segments. 
> 
> Since this is quite a lot of processing, a better use case for this
> dedupe mechanism would be in the data tiering model with object
> redirects.
> The manifest object fits quiet well into object redirects scheme of
> things, the idea is that, when an object is moved out of the base
> tier, you have an option to create a dedupe stub object and write
> individual segments into the cold backend tier with a rados plugin. 
> 
> Remaining responses inline.
> 
> Regards,
> Chaitanya
> 
> -----Original Message-----
> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com] 
> Sent: Wednesday, July 01, 2015 4:00 AM
> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
> 
> Hi Chaitanya,
>    Very interesting thoughts. I am not sure whether I get all of them
> or now. Here are several questions for the solution you provided,
> Might be a little bit detailed.
> 
>     Regards,
>     James
> 
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> [James] Does the OSD/PG mean PG Backend over here? 
> [Chaitanya] I mean the Primary OSD and the PG which get selected by
> the crush - not the specific OSD component
> 
> - Data is segmented (rabin/static) and secure hash computed [James]
> Which component in OSD are you going to do the data segment and hash
> computation?
> [Chaitanya] If partial writes are not supported then this could be
> down before acquiring the PG lock, else we need the protection of the
> PG lock.  Probably in the do_request() path?
> 
> - A manifest is created with the offset/len/hash for all the segments
> [James] The manifest is going to be part of xattr of object? Where are
> you going to save manifest?
> [Chaitanya] The manifest is a stub object with the constituent
> segments list 
> 
> - OSD/pg sends rados write with a special name
> <__known__prefix><secure hash> for all segments [James] What's your
> meaning of Rados Wirte?  Where do the all segments with secure hash
> signature write to?
> [Chaitanya] All segments are unique objects with the above mentioned
> naming scheme, they get written back into the cluster as a regular
> client rados object write
> 
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is
> incremented (check and increment needs to be atomic) [James] It makes
> sense. But I was wondering the unit for dedupe is segment or object?
> If object base, it totally make sense. However, why we need to have
> segment with manifest?
> 
> - Response is received by original primary PG for all segments [James]
> What response?
> [Chaitanya] Write response indicating the status of the segment object
> write
> 
> - Primary PG writes the manifest to local and replicas or EC members
> [James] How about the dedupe data if the data is not present in
> replicas?
> [Chaitanya] I am sorry, I did not get your question, the manifest
> object gets written in the primary and the replicas or encoded and
> written to the EC members, it is afforded the protection policy set
> for the pool. Same is the case with the individual constituent
> segments.  
>  
> - Response sent to client
> 
> Read:
> - Read received at primary PG
> [James]  The read can only fetch data from Primary PG?
> - Reads manifest object
> 
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
> 
> 
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck
> philosophy
> 
> Cons:
> Some PGs may get overloaded due to frequently occurring segment
> patterns Latency and increased traffic on the network
>    
> 
> 
> -----Original Message-----
> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
> Sent: Tuesday, June 30, 2015 8:50 AM
> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
> 
> 
> - Reference count has to be maintained as an attribute of the object
> - As mentioned in the write workflow, duplicate segment writes
> increment the reference count
> - Object Delete would result in delete on constituent segments listed
> in the object segment manifest
> - Segment object delete will decrement reference count and remove the
> segment when there are no more references present 
> 
> Regards,
> Chaitanya
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Tuesday, June 30, 2015 9:02 PM
> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
> 
> This covers the read and write, what about the delete? One of the
> major issues with Dedupe, whether global or local is to address the
> inherent ref-counting associated with sharing of pieces of storage.
> 
> Allen Samuels
> Software Architect, Emerging Storage Solutions 
> 
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya
> Huilgol
> Sent: Monday, June 29, 2015 11:20 PM
> To: James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
> 
> Below is an alternative idea at a very high level around dedup with
> ceph without a need of centralized hash index,
> 
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> - Data is segmented (rabin/static) and secure hash computed
> - A manifest is created with the offset/len/hash for all the segments
> - OSD/pg sends rados write with a special name
> <__known__prefix><secure hash> for all segments
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is
> incremented (check and increment needs to be atomic)
> - Response is received by original primary PG for all segments
> - Primary PG writes the manifest to local and replicas or EC members
> - Response sent to client
> 
> Read:
> - Read received at primary PG
> - Reads manifest object
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
> 
> 
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck
> philosophy
> 
> Cons:
> Some PGs may get overloaded due to frequently occurring segment
> patterns Latency and increased traffic on the network
> 
> Regards,
> Chaitanya
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei)
> Liu-SSI
> Sent: Tuesday, June 30, 2015 2:25 AM
> To: Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
> 
> Hi Haomai,
>   Thanks for moving the idea forward. Regarding to the compression. 
> However,  if we do compression on the client level, it is not global.
> And the compression was only applied to the local client, am I right? 
> I think there is pros and cons in two solutions and we can get into
> details more for each solution.
>   I really like your idea for dedupe in OSD side   by the way. Let me
> think more about it.
> 
>  Regards,
>  James
> 
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 26, 2015 8:55 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
> 
> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> > Hi Haomai,
> >   Thanks for your response as always. I agree compression is
> comparable easier task but still very challenge in terms of
> implementation no matter where we should implement . Client side like
> RBD, or RDBGW or CephFS, or PG should be a little bit better place to
> implementation in terms of efficiency and cost reduction before the
> data were duplicated to other OSDs. It has  two reasons :
> > 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> > computing resources
> >
> > IMHO , The compression should be accomplished before the replication
> come into play in pool level. However, we can also have second level
> of compression in the local objectstore.  In term of unit size of
> compression , It really depends workload and in which layer we should
> implement.
> >
> > About inline deduplication, it will dramatically increase the
> complexities if we bring in the replication and Erasure Coding for
> consideration.
> >
> > However, Before we talk about implementation, It would be great if
> we can understand the pros and cons to implement inline
> dedupe/compression. We all understand the benefits of
> dedupe/compression. However, the side effect is performance hurt and
> need more computing resources. It would be great if we can understand
> the problems from 30,000 feet high for the whole picture about the
> Ceph. Please correct me if I were wrong.
> 
> Actually we may have some tricks to reduce performance hurt like
> compression. As Joe mentioned, we can compress slave pg data to avoid
> performance hurt, but it may increase the complexity of recovery and
> pg remap things. Another in-detail implement way if we begin to
> compress data from messenger, osd thread and pg thread won't access
> data for normal client op, so maybe we can make it parallel with pg
> process. Journal thread will get the compressed data at last.
> 
> The effect of compression also is a concern, we do compression in
> rados may not get the best compression result. If we can do
> compression in libcephfs, librbd and radosgw and make rados unknown to
> compression, it maybe simpler and we can get file/block/object level
> compression. it should be better?
> 
> About dedup, my current idea is we could setup a memory pool at osd
> side for checksum store usage. Then we calculate object data and map
> to PG instead of object name at client side, so a object could always
> in a osd where it's also responsible for dedup storage. It also could
> be distributed at pool level.
> 
> 
> >
> > By the way, Both of software defined storage solution startups like
> Hdevig and Springpath provide inline dedupe/compression.  It is not
> apple to apple comparison. But it is good reference. The datacenters
> need cost effective solution.
> >
> > Regards,
> > James
> >
> >
> >
> > -----Original Message-----
> > From: Haomai Wang [mailto:haomaiwang@gmail.com]
> > Sent: Thursday, June 25, 2015 8:08 PM
> > To: James (Fei) Liu-SSI
> > Cc: ceph-devel
> > Subject: Re: Inline dedup/compression
> >
> > On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> >> Hi Cephers,
> >>     It is not easy to ask when Ceph is going to support inline
> dedup/compression across OSDs in RADOS because it is not easy task and
> answered. Ceph is providing replication and EC for performance and
> failure recovery. But we also lose the efficiency  of storage store
> and cost associate with it. It is kind of contradicted with each
> other. But I am curious how other Cephers think about this question.
> >>    Any plan for Cephers to do anything regarding to inline
> dedupe/compression except the features brought by local node itself
> like BRTFS?
> >
> > Compression is easier to implement in rados than dedup. The most
> important thing about compression is where we begin to compress,
> client, pg or objectstore. Then we need to decide how much the
> compress unit is. Of course, compress and dedup both like to use
> keyvalue-alike storage api to use, but I think it's not difficult to
> use existing objectstore api.
> >
> > Dedup is more possible to implement in local osd instead of the
> whole pool or cluster, and if we want to do dedup for the pool level,
> we need to do dedup from client.
> >
> >>
> >>   Regards,
> >>   James
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo 
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> 
> 
> 
> --
> Best Regards,
> 
> Wheat
>   칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G   h ( 階 ݢj"  
> m     z ޖ   f   h   ~ m
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically
> stored copies).
> 
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w      
> j:+v   w j m         zZ+     ݢj"  ! i
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-07-02 15:34 ` Matt W. Benjamin
@ 2015-07-02 16:20   ` Chaitanya Huilgol
  0 siblings, 0 replies; 28+ messages in thread
From: Chaitanya Huilgol @ 2015-07-02 16:20 UTC (permalink / raw)
  To: Matt W. Benjamin
  Cc: ceph-devel, James (Fei) Liu-SSI, Allen Samuels, Haomai Wang

Hi Matt,

Static chunking mentioned in step-3 is fixed chunk size, apart from the process of chunking itself, everything else in the workflow should remain the same in both cases.

Regards,
Chaitanya

-----Original Message-----
From: Matt W. Benjamin [mailto:matt@cohortfs.com]
Sent: Thursday, July 02, 2015 9:05 PM
To: Chaitanya Huilgol
Cc: ceph-devel; James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
Subject: Re: Inline dedup/compression

Hi Chaitanya,

Have you ruled out variants using fixed chunksize?  (Arguments for/against fingerprinting elided.)

Matt

----- "Chaitanya Huilgol" <Chaitanya.Huilgol@sandisk.com> wrote:

> Hi James et.al ,
>
> Here is an example for clarity,
> 1. Client Writes object  object.abcd
> 2. Based on the crush rules, say  OSD.a is the primary OSD which
> receives the write 3. OSD.a  performs segmenting/fingerprinting which
> can be static or dynamic and generates a list of segments, the
> object.abcd is now represented by a manifest object with the list of
> segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>  [Seg3_sha, len]
> 4. OSD.a writes each segment as a new object in the cluster with
> object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write
> is treated differently from regular object writes, If the object is
> present then an object reference count is incremented and the object
> is not overwritten - this forms the basis of the dedupe logic.
> Multiple objects with one or more same constituent segments start
> sharing the segment objects.
> 6. Once all the segments are successfully written the object
> 'object.abcd' is now just a stub object with the segment manifest as
> described above and is goes through a regular object write sequence
>
> Partial writes on objects will be complicated,
> - Partially affected segments will have to be read and segmentation
> logic has to be run from first to last affected segment boundaries
> -  New segments will be written
> - Old overwritten segments have to be deleted
> - Write merged manifest of the object
>
> All this will need protection of the PG lock, Also additional
> journaling mechanism will be needed to  recover from cases where the
> osd goes down before writing all the segments.
>
> Since this is quite a lot of processing, a better use case for this
> dedupe mechanism would be in the data tiering model with object
> redirects.
> The manifest object fits quiet well into object redirects scheme of
> things, the idea is that, when an object is moved out of the base
> tier, you have an option to create a dedupe stub object and write
> individual segments into the cold backend tier with a rados plugin.
>
> Remaining responses inline.
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
> Sent: Wednesday, July 01, 2015 4:00 AM
> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Hi Chaitanya,
>    Very interesting thoughts. I am not sure whether I get all of them
> or now. Here are several questions for the solution you provided,
> Might be a little bit detailed.
>
>     Regards,
>     James
>
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> [James] Does the OSD/PG mean PG Backend over here?
> [Chaitanya] I mean the Primary OSD and the PG which get selected by
> the crush - not the specific OSD component
>
> - Data is segmented (rabin/static) and secure hash computed [James]
> Which component in OSD are you going to do the data segment and hash
> computation?
> [Chaitanya] If partial writes are not supported then this could be
> down before acquiring the PG lock, else we need the protection of the
> PG lock.  Probably in the do_request() path?
>
> - A manifest is created with the offset/len/hash for all the segments
> [James] The manifest is going to be part of xattr of object? Where are
> you going to save manifest?
> [Chaitanya] The manifest is a stub object with the constituent
> segments list
>
> - OSD/pg sends rados write with a special name
> <__known__prefix><secure hash> for all segments [James] What's your
> meaning of Rados Wirte?  Where do the all segments with secure hash
> signature write to?
> [Chaitanya] All segments are unique objects with the above mentioned
> naming scheme, they get written back into the cluster as a regular
> client rados object write
>
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is
> incremented (check and increment needs to be atomic) [James] It makes
> sense. But I was wondering the unit for dedupe is segment or object?
> If object base, it totally make sense. However, why we need to have
> segment with manifest?
>
> - Response is received by original primary PG for all segments [James]
> What response?
> [Chaitanya] Write response indicating the status of the segment object
> write
>
> - Primary PG writes the manifest to local and replicas or EC members
> [James] How about the dedupe data if the data is not present in
> replicas?
> [Chaitanya] I am sorry, I did not get your question, the manifest
> object gets written in the primary and the replicas or encoded and
> written to the EC members, it is afforded the protection policy set
> for the pool. Same is the case with the individual constituent
> segments.
>
> - Response sent to client
>
> Read:
> - Read received at primary PG
> [James]  The read can only fetch data from Primary PG?
> - Reads manifest object
>
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
>
>
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck
> philosophy
>
> Cons:
> Some PGs may get overloaded due to frequently occurring segment
> patterns Latency and increased traffic on the network
>
>
>
> -----Original Message-----
> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
> Sent: Tuesday, June 30, 2015 8:50 AM
> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
>
> - Reference count has to be maintained as an attribute of the object
> - As mentioned in the write workflow, duplicate segment writes
> increment the reference count
> - Object Delete would result in delete on constituent segments listed
> in the object segment manifest
> - Segment object delete will decrement reference count and remove the
> segment when there are no more references present
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: Allen Samuels
> Sent: Tuesday, June 30, 2015 9:02 PM
> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> This covers the read and write, what about the delete? One of the
> major issues with Dedupe, whether global or local is to address the
> inherent ref-counting associated with sharing of pieces of storage.
>
> Allen Samuels
> Software Architect, Emerging Storage Solutions
>
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya
> Huilgol
> Sent: Monday, June 29, 2015 11:20 PM
> To: James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Below is an alternative idea at a very high level around dedup with
> ceph without a need of centralized hash index,
>
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> - Data is segmented (rabin/static) and secure hash computed
> - A manifest is created with the offset/len/hash for all the segments
> - OSD/pg sends rados write with a special name
> <__known__prefix><secure hash> for all segments
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is
> incremented (check and increment needs to be atomic)
> - Response is received by original primary PG for all segments
> - Primary PG writes the manifest to local and replicas or EC members
> - Response sent to client
>
> Read:
> - Read received at primary PG
> - Reads manifest object
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
>
>
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck
> philosophy
>
> Cons:
> Some PGs may get overloaded due to frequently occurring segment
> patterns Latency and increased traffic on the network
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei)
> Liu-SSI
> Sent: Tuesday, June 30, 2015 2:25 AM
> To: Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Hi Haomai,
>   Thanks for moving the idea forward. Regarding to the compression.
> However,  if we do compression on the client level, it is not global.
> And the compression was only applied to the local client, am I right?
> I think there is pros and cons in two solutions and we can get into
> details more for each solution.
>   I really like your idea for dedupe in OSD side   by the way. Let me
> think more about it.
>
>  Regards,
>  James
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 26, 2015 8:55 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> > Hi Haomai,
> >   Thanks for your response as always. I agree compression is
> comparable easier task but still very challenge in terms of
> implementation no matter where we should implement . Client side like
> RBD, or RDBGW or CephFS, or PG should be a little bit better place to
> implementation in terms of efficiency and cost reduction before the
> data were duplicated to other OSDs. It has  two reasons :
> > 1. Keep the data consistency among OSDs in one PG 2. Saving the
> > computing resources
> >
> > IMHO , The compression should be accomplished before the replication
> come into play in pool level. However, we can also have second level
> of compression in the local objectstore.  In term of unit size of
> compression , It really depends workload and in which layer we should
> implement.
> >
> > About inline deduplication, it will dramatically increase the
> complexities if we bring in the replication and Erasure Coding for
> consideration.
> >
> > However, Before we talk about implementation, It would be great if
> we can understand the pros and cons to implement inline
> dedupe/compression. We all understand the benefits of
> dedupe/compression. However, the side effect is performance hurt and
> need more computing resources. It would be great if we can understand
> the problems from 30,000 feet high for the whole picture about the
> Ceph. Please correct me if I were wrong.
>
> Actually we may have some tricks to reduce performance hurt like
> compression. As Joe mentioned, we can compress slave pg data to avoid
> performance hurt, but it may increase the complexity of recovery and
> pg remap things. Another in-detail implement way if we begin to
> compress data from messenger, osd thread and pg thread won't access
> data for normal client op, so maybe we can make it parallel with pg
> process. Journal thread will get the compressed data at last.
>
> The effect of compression also is a concern, we do compression in
> rados may not get the best compression result. If we can do
> compression in libcephfs, librbd and radosgw and make rados unknown to
> compression, it maybe simpler and we can get file/block/object level
> compression. it should be better?
>
> About dedup, my current idea is we could setup a memory pool at osd
> side for checksum store usage. Then we calculate object data and map
> to PG instead of object name at client side, so a object could always
> in a osd where it's also responsible for dedup storage. It also could
> be distributed at pool level.
>
>
> >
> > By the way, Both of software defined storage solution startups like
> Hdevig and Springpath provide inline dedupe/compression.  It is not
> apple to apple comparison. But it is good reference. The datacenters
> need cost effective solution.
> >
> > Regards,
> > James
> >
> >
> >
> > -----Original Message-----
> > From: Haomai Wang [mailto:haomaiwang@gmail.com]
> > Sent: Thursday, June 25, 2015 8:08 PM
> > To: James (Fei) Liu-SSI
> > Cc: ceph-devel
> > Subject: Re: Inline dedup/compression
> >
> > On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI
> <james.liu@ssi.samsung.com> wrote:
> >> Hi Cephers,
> >>     It is not easy to ask when Ceph is going to support inline
> dedup/compression across OSDs in RADOS because it is not easy task and
> answered. Ceph is providing replication and EC for performance and
> failure recovery. But we also lose the efficiency  of storage store
> and cost associate with it. It is kind of contradicted with each
> other. But I am curious how other Cephers think about this question.
> >>    Any plan for Cephers to do anything regarding to inline
> dedupe/compression except the features brought by local node itself
> like BRTFS?
> >
> > Compression is easier to implement in rados than dedup. The most
> important thing about compression is where we begin to compress,
> client, pg or objectstore. Then we need to decide how much the
> compress unit is. Of course, compress and dedup both like to use
> keyvalue-alike storage api to use, but I think it's not difficult to
> use existing objectstore api.
> >
> > Dedup is more possible to implement in local osd instead of the
> whole pool or cluster, and if we want to do dedup for the pool level,
> we need to do dedup from client.
> >
> >>
> >>   Regards,
> >>   James
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
>   칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G   h ( 階 ݢj"
> m     z ޖ   f   h   ~ m
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically
> stored copies).
>
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w
> j:+v   w j m         zZ+     ݢj"  ! i
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j  f   h   z  w      j
> :+v   w j m        zZ+  ݢj"

--
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-07-02 10:50                 ` Chaitanya Huilgol
@ 2015-07-03  5:13                   ` Allen Samuels
  2015-08-21  2:51                     ` Haomai Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Allen Samuels @ 2015-07-03  5:13 UTC (permalink / raw)
  To: Chaitanya Huilgol, James (Fei) Liu-SSI, Haomai Wang; +Cc: ceph-devel

For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: Chaitanya Huilgol 
Sent: Thursday, July 02, 2015 3:50 AM
To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi James et.al ,

Here is an example for clarity,
1. Client Writes object  object.abcd
2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
 [Seg3_sha, len]
4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence 

Partial writes on objects will be complicated,
- Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
-  New segments will be written
- Old overwritten segments have to be deleted
- Write merged manifest of the object 

All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments. 

Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin. 

Remaining responses inline.

Regards,
Chaitanya

-----Original Message-----
From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
Sent: Wednesday, July 01, 2015 4:00 AM
To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.

    Regards,
    James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 
[Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component

- Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
[Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?

- A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
[Chaitanya] The manifest is a stub object with the constituent segments list 

- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
[Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write

- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?

- Response is received by original primary PG for all segments [James] What response?
[Chaitanya] Write response indicating the status of the segment object write

- Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
[Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.  

- Response sent to client

Read:
- Read received at primary PG
[James]  The read can only fetch data from Primary PG?
- Reads manifest object

- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

-----Original Message-----
From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
Sent: Tuesday, June 30, 2015 8:50 AM
To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the reference count
- Object Delete would result in delete on constituent segments listed in the object segment manifest
- Segment object delete will decrement reference count and remove the segment when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.

>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat

--
Best Regards,

Wheat
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-07-03  5:13                   ` Allen Samuels
@ 2015-08-21  2:51                     ` Haomai Wang
  2015-08-21  3:01                       ` Haomai Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-08-21  2:51 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chaitanya Huilgol, James (Fei) Liu-SSI, ceph-devel

I found a blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/)
about mysql innodb transparent compression. It's surprised that innodb
will do it at low level(just like filestore in ceph) and rely it on
filesystem file hole feature. I'm very suspect about the performance
afeter storing lot's of *small* hole files on fs. If reliable, it
would be easy that filestore/newstore impl alike feature.

On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>
>
> Allen Samuels
> Software Architect, Emerging Storage Solutions
>
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: Chaitanya Huilgol
> Sent: Thursday, July 02, 2015 3:50 AM
> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Hi James et.al ,
>
> Here is an example for clarity,
> 1. Client Writes object  object.abcd
> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>  [Seg3_sha, len]
> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
> 6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence
>
> Partial writes on objects will be complicated,
> - Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
> -  New segments will be written
> - Old overwritten segments have to be deleted
> - Write merged manifest of the object
>
> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>
> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>
> Remaining responses inline.
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
> Sent: Wednesday, July 01, 2015 4:00 AM
> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Hi Chaitanya,
>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>
>     Regards,
>     James
>
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> [James] Does the OSD/PG mean PG Backend over here?
> [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component
>
> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>
> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
> [Chaitanya] The manifest is a stub object with the constituent segments list
>
> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
> [Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write
>
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>
> - Response is received by original primary PG for all segments [James] What response?
> [Chaitanya] Write response indicating the status of the segment object write
>
> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>
> - Response sent to client
>
> Read:
> - Read received at primary PG
> [James]  The read can only fetch data from Primary PG?
> - Reads manifest object
>
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
>
>
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck philosophy
>
> Cons:
> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>
>
>
> -----Original Message-----
> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
> Sent: Tuesday, June 30, 2015 8:50 AM
> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
>
> - Reference count has to be maintained as an attribute of the object
> - As mentioned in the write workflow, duplicate segment writes increment the reference count
> - Object Delete would result in delete on constituent segments listed in the object segment manifest
> - Segment object delete will decrement reference count and remove the segment when there are no more references present
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: Allen Samuels
> Sent: Tuesday, June 30, 2015 9:02 PM
> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>
> Allen Samuels
> Software Architect, Emerging Storage Solutions
>
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
> Sent: Monday, June 29, 2015 11:20 PM
> To: James (Fei) Liu-SSI; Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,
>
> - Dedupe is set as a pool property
> Write:
> - Write arrives at the primary OSD/pg
> - Data is segmented (rabin/static) and secure hash computed
> - A manifest is created with the offset/len/hash for all the segments
> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
> - PG receiving dedup write will:
>         1. check for object presence and create object if not present
>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
> - Response is received by original primary PG for all segments
> - Primary PG writes the manifest to local and replicas or EC members
> - Response sent to client
>
> Read:
> - Read received at primary PG
> - Reads manifest object
> - sends reads for each segment object <__know_prefix><secure hash>
> - coalesces all the response to build the required data
> - Responds to client
>
>
> Pros:
> No need of centralized hash index so inline with ceph no bottleneck philosophy
>
> Cons:
> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>
> Regards,
> Chaitanya
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, June 30, 2015 2:25 AM
> To: Haomai Wang
> Cc: ceph-devel
> Subject: RE: Inline dedup/compression
>
> Hi Haomai,
>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>
>  Regards,
>  James
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 26, 2015 8:55 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Haomai,
>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>> 1. Keep the data consistency among OSDs in one PG 2. Saving the
>> computing resources
>>
>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>
>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>
>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>
> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>
> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>
> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>
>
>>
>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>
>> Regards,
>> James
>>
>>
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Thursday, June 25, 2015 8:08 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cephers,
>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>
>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>
>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>
>>>
>>>   Regards,
>>>   James
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Inline dedup/compression
  2015-08-21  2:51                     ` Haomai Wang
@ 2015-08-21  3:01                       ` Haomai Wang
  2015-08-21  3:37                         ` Allen Samuels
  0 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-08-21  3:01 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chaitanya Huilgol, James (Fei) Liu-SSI, ceph-devel

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
> I found a blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/)
> about mysql innodb transparent compression. It's surprised that innodb
> will do it at low level(just like filestore in ceph) and rely it on
> filesystem file hole feature. I'm very suspect about the performance
> afeter storing lot's of *small* hole files on fs. If reliable, it
> would be easy that filestore/newstore impl alike feature.
>
> On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>>
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol
>> Sent: Thursday, July 02, 2015 3:50 AM
>> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi James et.al ,
>>
>> Here is an example for clarity,
>> 1. Client Writes object  object.abcd
>> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>>  [Seg3_sha, len]
>> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
>> 6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence
>>
>> Partial writes on objects will be complicated,
>> - Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
>> -  New segments will be written
>> - Old overwritten segments have to be deleted
>> - Write merged manifest of the object
>>
>> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>>
>> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
>> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>>
>> Remaining responses inline.
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
>> Sent: Wednesday, July 01, 2015 4:00 AM
>> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Chaitanya,
>>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>>
>>     Regards,
>>     James
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> [James] Does the OSD/PG mean PG Backend over here?
>> [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component
>>
>> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
>> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>>
>> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
>> [Chaitanya] The manifest is a stub object with the constituent segments list
>>
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
>> [Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write
>>
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>>
>> - Response is received by original primary PG for all segments [James] What response?
>> [Chaitanya] Write response indicating the status of the segment object write
>>
>> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
>> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>>
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> [James]  The read can only fetch data from Primary PG?
>> - Reads manifest object
>>
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
>> Sent: Tuesday, June 30, 2015 8:50 AM
>> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>>
>> - Reference count has to be maintained as an attribute of the object
>> - As mentioned in the write workflow, duplicate segment writes increment the reference count
>> - Object Delete would result in delete on constituent segments listed in the object segment manifest
>> - Segment object delete will decrement reference count and remove the segment when there are no more references present
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Tuesday, June 30, 2015 9:02 PM
>> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
>> Sent: Monday, June 29, 2015 11:20 PM
>> To: James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> - Data is segmented (rabin/static) and secure hash computed
>> - A manifest is created with the offset/len/hash for all the segments
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
>> - Response is received by original primary PG for all segments
>> - Primary PG writes the manifest to local and replicas or EC members
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> - Reads manifest object
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>> Sent: Tuesday, June 30, 2015 2:25 AM
>> To: Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>>
>>  Regards,
>>  James
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 26, 2015 8:55 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Haomai,
>>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>>> 1. Keep the data consistency among OSDs in one PG 2. Saving the
>>> computing resources
>>>
>>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>>
>>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>>
>>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>>
>> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>>
>> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>>
>> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>>
>>
>>>
>>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>>
>>> Regards,
>>> James
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Thursday, June 25, 2015 8:08 PM
>>> To: James (Fei) Liu-SSI
>>> Cc: ceph-devel
>>> Subject: Re: Inline dedup/compression
>>>
>>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>>> Hi Cephers,
>>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>>
>>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>>
>>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>>
>>>>
>>>>   Regards,
>>>>   James
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-08-21  3:01                       ` Haomai Wang
@ 2015-08-21  3:37                         ` Allen Samuels
  2015-08-21  4:43                           ` Chaitanya Huilgol
  0 siblings, 1 reply; 28+ messages in thread
From: Allen Samuels @ 2015-08-21  3:37 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Chaitanya Huilgol, James (Fei) Liu-SSI, ceph-devel

XFS shouldn't have any trouble with the "holes" scheme. I don't know BTRFS as well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data.

For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
> I found a 
> blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
> ) about mysql innodb transparent compression. It's surprised that 
> innodb will do it at low level(just like filestore in ceph) and rely 
> it on filesystem file hole feature. I'm very suspect about the 
> performance afeter storing lot's of *small* hole files on fs. If 
> reliable, it would be easy that filestore/newstore impl alike feature.
>
> On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>>
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol
>> Sent: Thursday, July 02, 2015 3:50 AM
>> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi James et.al ,
>>
>> Here is an example for clarity,
>> 1. Client Writes object  object.abcd
>> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>>  [Seg3_sha, len]
>> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
>> 6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence
>>
>> Partial writes on objects will be complicated,
>> - Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
>> -  New segments will be written
>> - Old overwritten segments have to be deleted
>> - Write merged manifest of the object
>>
>> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>>
>> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
>> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>>
>> Remaining responses inline.
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
>> Sent: Wednesday, July 01, 2015 4:00 AM
>> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Chaitanya,
>>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>>
>>     Regards,
>>     James
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> [James] Does the OSD/PG mean PG Backend over here?
>> [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component
>>
>> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
>> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>>
>> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
>> [Chaitanya] The manifest is a stub object with the constituent segments list
>>
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
>> [Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write
>>
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>>
>> - Response is received by original primary PG for all segments [James] What response?
>> [Chaitanya] Write response indicating the status of the segment object write
>>
>> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
>> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>>
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> [James]  The read can only fetch data from Primary PG?
>> - Reads manifest object
>>
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
>> Sent: Tuesday, June 30, 2015 8:50 AM
>> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>>
>> - Reference count has to be maintained as an attribute of the object
>> - As mentioned in the write workflow, duplicate segment writes increment the reference count
>> - Object Delete would result in delete on constituent segments listed in the object segment manifest
>> - Segment object delete will decrement reference count and remove the segment when there are no more references present
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Tuesday, June 30, 2015 9:02 PM
>> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya Huilgol
>> Sent: Monday, June 29, 2015 11:20 PM
>> To: James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> - Data is segmented (rabin/static) and secure hash computed
>> - A manifest is created with the offset/len/hash for all the segments
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
>> - Response is received by original primary PG for all segments
>> - Primary PG writes the manifest to local and replicas or EC members
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> - Reads manifest object
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>> Sent: Tuesday, June 30, 2015 2:25 AM
>> To: Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>>
>>  Regards,
>>  James
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 26, 2015 8:55 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Haomai,
>>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>>> 1. Keep the data consistency among OSDs in one PG 2. Saving the
>>> computing resources
>>>
>>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>>
>>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>>
>>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>>
>> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>>
>> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>>
>> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>>
>>
>>>
>>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>>
>>> Regards,
>>> James
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Thursday, June 25, 2015 8:08 PM
>>> To: James (Fei) Liu-SSI
>>> Cc: ceph-devel
>>> Subject: Re: Inline dedup/compression
>>>
>>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>>> Hi Cephers,
>>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>>
>>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>>
>>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>>
>>>>
>>>>   Regards,
>>>>   James
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-08-21  3:37                         ` Allen Samuels
@ 2015-08-21  4:43                           ` Chaitanya Huilgol
  2015-08-21  4:44                             ` Allen Samuels
  0 siblings, 1 reply; 28+ messages in thread
From: Chaitanya Huilgol @ 2015-08-21  4:43 UTC (permalink / raw)
  To: Allen Samuels, Haomai Wang; +Cc: James (Fei) Liu-SSI, ceph-devel

Hi,

The original idea of dedupe was to make it cluster wide, If we go with a filestore or kevvalue-store based dedupe/compression then isn't it localized to the OSD? W.r.t Ceph architecture of object distribution,  won't the probability of objects with same/similar data landing on the same OSD be pretty low? 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels 
Sent: Friday, August 21, 2015 9:07 AM
To: Haomai Wang
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: RE: Inline dedup/compression

XFS shouldn't have any trouble with the "holes" scheme. I don't know BTRFS as well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data.

For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
> I found a
> blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
> ) about mysql innodb transparent compression. It's surprised that 
> innodb will do it at low level(just like filestore in ceph) and rely 
> it on filesystem file hole feature. I'm very suspect about the 
> performance afeter storing lot's of *small* hole files on fs. If 
> reliable, it would be easy that filestore/newstore impl alike feature.
>
> On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>>
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol
>> Sent: Thursday, July 02, 2015 3:50 AM
>> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi James et.al ,
>>
>> Here is an example for clarity,
>> 1. Client Writes object  object.abcd
>> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>>  [Seg3_sha, len]
>> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
>> 6. Once all the segments are successfully written the object 
>> 'object.abcd' is now just a stub object with the segment manifest as 
>> described above and is goes through a regular object write sequence
>>
>> Partial writes on objects will be complicated,
>> - Partially affected segments will have to be read and segmentation 
>> logic has to be run from first to last affected segment boundaries
>> -  New segments will be written
>> - Old overwritten segments have to be deleted
>> - Write merged manifest of the object
>>
>> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>>
>> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
>> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>>
>> Remaining responses inline.
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
>> Sent: Wednesday, July 01, 2015 4:00 AM
>> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Chaitanya,
>>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>>
>>     Regards,
>>     James
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg [James] Does the OSD/PG mean PG 
>> Backend over here?
>> [Chaitanya] I mean the Primary OSD and the PG which get selected by 
>> the crush - not the specific OSD component
>>
>> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
>> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>>
>> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
>> [Chaitanya] The manifest is a stub object with the constituent 
>> segments list
>>
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
>> [Chaitanya] All segments are unique objects with the above mentioned 
>> naming scheme, they get written back into the cluster as a regular 
>> client rados object write
>>
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>>
>> - Response is received by original primary PG for all segments [James] What response?
>> [Chaitanya] Write response indicating the status of the segment 
>> object write
>>
>> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
>> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>>
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> [James]  The read can only fetch data from Primary PG?
>> - Reads manifest object
>>
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck 
>> philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment 
>> patterns Latency and increased traffic on the network
>>
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
>> Sent: Tuesday, June 30, 2015 8:50 AM
>> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>>
>> - Reference count has to be maintained as an attribute of the object
>> - As mentioned in the write workflow, duplicate segment writes 
>> increment the reference count
>> - Object Delete would result in delete on constituent segments listed 
>> in the object segment manifest
>> - Segment object delete will decrement reference count and remove the 
>> segment when there are no more references present
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Tuesday, June 30, 2015 9:02 PM
>> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya 
>> Huilgol
>> Sent: Monday, June 29, 2015 11:20 PM
>> To: James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Below is an alternative idea at a very high level around dedup with 
>> ceph without a need of centralized hash index,
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> - Data is segmented (rabin/static) and secure hash computed
>> - A manifest is created with the offset/len/hash for all the segments
>> - OSD/pg sends rados write with a special name 
>> <__known__prefix><secure hash> for all segments
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is 
>> incremented (check and increment needs to be atomic)
>> - Response is received by original primary PG for all segments
>> - Primary PG writes the manifest to local and replicas or EC members
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> - Reads manifest object
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck 
>> philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment 
>> patterns Latency and increased traffic on the network
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) 
>> Liu-SSI
>> Sent: Tuesday, June 30, 2015 2:25 AM
>> To: Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>>
>>  Regards,
>>  James
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 26, 2015 8:55 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Haomai,
>>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>>> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
>>> computing resources
>>>
>>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>>
>>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>>
>>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>>
>> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>>
>> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>>
>> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>>
>>
>>>
>>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>>
>>> Regards,
>>> James
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Thursday, June 25, 2015 8:08 PM
>>> To: James (Fei) Liu-SSI
>>> Cc: ceph-devel
>>> Subject: Re: Inline dedup/compression
>>>
>>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>>> Hi Cephers,
>>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>>
>>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>>
>>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>>
>>>>
>>>>   Regards,
>>>>   James
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Inline dedup/compression
  2015-08-21  4:43                           ` Chaitanya Huilgol
@ 2015-08-21  4:44                             ` Allen Samuels
  0 siblings, 0 replies; 28+ messages in thread
From: Allen Samuels @ 2015-08-21  4:44 UTC (permalink / raw)
  To: Chaitanya Huilgol, Haomai Wang; +Cc: James (Fei) Liu-SSI, ceph-devel

I was referring strictly to compression. Dedupe is a whole 'nother issue.

I agree that dedupe on a per-OSD basis isn't interesting. It needs to be done at the pool level (or higher). 


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Chaitanya Huilgol 
Sent: Thursday, August 20, 2015 9:43 PM
To: Allen Samuels; Haomai Wang
Cc: James (Fei) Liu-SSI; ceph-devel
Subject: RE: Inline dedup/compression

Hi,

The original idea of dedupe was to make it cluster wide, If we go with a filestore or kevvalue-store based dedupe/compression then isn't it localized to the OSD? W.r.t Ceph architecture of object distribution,  won't the probability of objects with same/similar data landing on the same OSD be pretty low? 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Friday, August 21, 2015 9:07 AM
To: Haomai Wang
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: RE: Inline dedup/compression

XFS shouldn't have any trouble with the "holes" scheme. I don't know BTRFS as well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data.

For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang <haomaiwang@gmail.com> wrote:
> I found a
> blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
> ) about mysql innodb transparent compression. It's surprised that 
> innodb will do it at low level(just like filestore in ceph) and rely 
> it on filesystem file hole feature. I'm very suspect about the 
> performance afeter storing lot's of *small* hole files on fs. If 
> reliable, it would be easy that filestore/newstore impl alike feature.
>
> On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>>
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol
>> Sent: Thursday, July 02, 2015 3:50 AM
>> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi James et.al ,
>>
>> Here is an example for clarity,
>> 1. Client Writes object  object.abcd
>> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>>  [Seg3_sha, len]
>> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
>> 6. Once all the segments are successfully written the object 
>> 'object.abcd' is now just a stub object with the segment manifest as 
>> described above and is goes through a regular object write sequence
>>
>> Partial writes on objects will be complicated,
>> - Partially affected segments will have to be read and segmentation 
>> logic has to be run from first to last affected segment boundaries
>> -  New segments will be written
>> - Old overwritten segments have to be deleted
>> - Write merged manifest of the object
>>
>> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>>
>> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
>> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>>
>> Remaining responses inline.
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: James (Fei) Liu-SSI [mailto:james.liu@ssi.samsung.com]
>> Sent: Wednesday, July 01, 2015 4:00 AM
>> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Chaitanya,
>>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>>
>>     Regards,
>>     James
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg [James] Does the OSD/PG mean PG 
>> Backend over here?
>> [Chaitanya] I mean the Primary OSD and the PG which get selected by 
>> the crush - not the specific OSD component
>>
>> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
>> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>>
>> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
>> [Chaitanya] The manifest is a stub object with the constituent 
>> segments list
>>
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
>> [Chaitanya] All segments are unique objects with the above mentioned 
>> naming scheme, they get written back into the cluster as a regular 
>> client rados object write
>>
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>>
>> - Response is received by original primary PG for all segments [James] What response?
>> [Chaitanya] Write response indicating the status of the segment 
>> object write
>>
>> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
>> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>>
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> [James]  The read can only fetch data from Primary PG?
>> - Reads manifest object
>>
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck 
>> philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment 
>> patterns Latency and increased traffic on the network
>>
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@sandisk.com]
>> Sent: Tuesday, June 30, 2015 8:50 AM
>> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>>
>> - Reference count has to be maintained as an attribute of the object
>> - As mentioned in the write workflow, duplicate segment writes 
>> increment the reference count
>> - Object Delete would result in delete on constituent segments listed 
>> in the object segment manifest
>> - Segment object delete will decrement reference count and remove the 
>> segment when there are no more references present
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Tuesday, June 30, 2015 9:02 PM
>> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Chaitanya 
>> Huilgol
>> Sent: Monday, June 29, 2015 11:20 PM
>> To: James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Below is an alternative idea at a very high level around dedup with 
>> ceph without a need of centralized hash index,
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> - Data is segmented (rabin/static) and secure hash computed
>> - A manifest is created with the offset/len/hash for all the segments
>> - OSD/pg sends rados write with a special name 
>> <__known__prefix><secure hash> for all segments
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is 
>> incremented (check and increment needs to be atomic)
>> - Response is received by original primary PG for all segments
>> - Primary PG writes the manifest to local and replicas or EC members
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> - Reads manifest object
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck 
>> philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment 
>> patterns Latency and increased traffic on the network
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) 
>> Liu-SSI
>> Sent: Tuesday, June 30, 2015 2:25 AM
>> To: Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>>
>>  Regards,
>>  James
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 26, 2015 8:55 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Haomai,
>>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>>> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
>>> computing resources
>>>
>>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>>
>>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>>
>>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>>
>> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>>
>> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>>
>> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>>
>>
>>>
>>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>>
>>> Regards,
>>> James
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Thursday, June 25, 2015 8:08 PM
>>> To: James (Fei) Liu-SSI
>>> Cc: ceph-devel
>>> Subject: Re: Inline dedup/compression
>>>
>>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>>> Hi Cephers,
>>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>>
>>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>>
>>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>>
>>>>
>>>>   Regards,
>>>>   James
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-08-21  5:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1534307780.99.1435609644861.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-06-29 20:32 ` Inline dedup/compression Matt W. Benjamin
     [not found] <1840766443.51.1435851210328.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-07-02 15:34 ` Matt W. Benjamin
2015-07-02 16:20   ` Chaitanya Huilgol
2015-06-25 22:01 James (Fei) Liu-SSI
2015-06-25 23:00 ` Benoît Canet
2015-06-26  3:08 ` Haomai Wang
2015-06-26 18:03   ` James (Fei) Liu-SSI
2015-06-26 18:21     ` Handzik, Joe
2015-06-27  3:54     ` Haomai Wang
2015-06-29 20:55       ` James (Fei) Liu-SSI
2015-06-30  6:03         ` Haomai Wang
2015-06-30  6:20           ` Blair Bethwaite
2015-06-30 14:38             ` Alexandre DERUMIER
2015-06-30  6:19         ` Chaitanya Huilgol
2015-06-30 15:31           ` Allen Samuels
2015-06-30 15:50             ` Chaitanya Huilgol
2015-06-30 22:29               ` James (Fei) Liu-SSI
2015-07-01 13:46                 ` Ning Yao
2015-07-02 10:50                 ` Chaitanya Huilgol
2015-07-03  5:13                   ` Allen Samuels
2015-08-21  2:51                     ` Haomai Wang
2015-08-21  3:01                       ` Haomai Wang
2015-08-21  3:37                         ` Allen Samuels
2015-08-21  4:43                           ` Chaitanya Huilgol
2015-08-21  4:44                             ` Allen Samuels
2015-06-29 11:01     ` Gregory Farnum
2015-06-29 18:42       ` James (Fei) Liu-SSI
2015-06-30  6:50 ` Dałek, Piotr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.