From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Compression implementation options Date: Mon, 28 Sep 2015 18:41:55 +0300 Message-ID: <56095FC3.5070206@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-la0-f52.google.com ([209.85.215.52]:36284 "EHLO mail-la0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933698AbbI1Pl7 (ORCPT ); Mon, 28 Sep 2015 11:41:59 -0400 Received: by laclj5 with SMTP id lj5so71029820lac.3 for ; Mon, 28 Sep 2015 08:41:57 -0700 (PDT) Received: from [127.0.0.1] ([91.218.144.129]) by smtp.googlemail.com with ESMTPSA id h10sm2185642lam.29.2015.09.28.08.41.55 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 28 Sep 2015 08:41:56 -0700 (PDT) Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel Hi folks, Here is a brief summary on potential compression implementation options= =2E I think we should choose the desired approach prior to start working on= =20 the compression feature. Comments, additions and fixes are welcome. Compression At Client - compression/decompression to be performed at th= e=20 client level (most preferably - Rados) before sending/after receiving=20 data to/from Ceph. Pros: * Ceph cluster isn=E2=80=99t loaded with additional computatio= n burden. * All Ceph cluster components and data transfers benefit from=20 reduce data volume. * Compression is transparent to Ceph cluster components Cons: * Weak clients can lack CPU resources to handle their traffic. * Any Read/Write access requires at least two sequential=20 requests to Ceph cluster to get data: the first one to retrieve=20 =E2=80=9Coriginal to compressed=E2=80=9C offset mapping for desired dat= a block, the=20 second one to get compressed data block. * Random write access handling is tricky (see notes below).=20 Even more requests to the cluster per single user one might be needed i= n=20 this case. Compression At Replicated Pool - compression to be performed at primary= =20 Ceph entities at Replicated Pool level prior to data replication. Pros: * Clients benefit from cluster CPU resources utilization. * Compression for specific data block is performed at a single= =20 point only - thus total CPU utilization for Ceph cluster is less. * Underlying Ceph components and data transfers benefit from=20 from reduced data volume. Cons: * Clients that use EC pools directly lack compression unless=20 it=E2=80=99s implemented there too. * In two-tier model data compression at cache tier may be=20 inappropriate due to performance reasons. Compression at cache tier als= o=20 prevents from cache removal when/if needed. * Random write access handling is tricky (see notes below). Compression At Erasure Coded pool - compression to be performed at=20 primary Ceph entities at EC Pool level prior to Erasure Coding. Pros: * Clients benefit from cluster CPU resources utilization. * Erasure Coding =E2=80=9Cinflates=E2=80=9D processed data blo= ck (up to ~50%).=20 Thus doing compression prior to that reduces CPU utilization. * Natural combination with EC means. Compression and EC have=20 similar purposes - save storage space at the cost of CPU usage. One can= =20 reuse EC infrastructure and design solutions. * No need for random write access support - EC pools don=E2=80= =99t=20 provide that on its own. Thus we can reuse the same approach to resolve= =20 the issue when needed. Implementation becomes much easier. * Underlying Ceph components and data transfers benefit from=20 reduced data volume. Cons: * Limited applicability - clients that don=E2=80=99t use EC po= ols lack=20 compression. Compression At Ceph Filestore entity - compression to be performed by=20 Ceph File Store component prior to saving object data to underlying fil= e=20 system. Pros: *Clients benefit from cluster CPU resources utilization. Cons: * Random write access is tricky (see notes below). * From cluster perspective compression is performed either on=20 each replicated block or on a block =E2=80=9Cinflated=E2=80=9D by erasu= re coding. Thus=20 total Ceph cluster CPU utilization to perform compression becomes=20 considerably higher ( three times increase for replicated pools and ~50= %=20 one for EC pools). * No benefit in reduced data transfers over the net. * Recovery procedure caused by OSD down triggers complete data= =20 set decompression and compression when EC pool used. This might=20 considerably increase CPU usage utilization for recovery process. Compression Externally at File System - compression to be performed at=20 =46ile Store node by means of underlying file system. Pros: * Compression is (mostly) transparent to Ceph * Clients benefit from cluster CPU resources utilization. Cons: * File system =E2=80=9Clock-in=E2=80=9D. One can use BTRFS fil= e system only for=20 now. Its production readiness is questionable. * Limited flexibility - compression is a partition/mount point= =20 property. Hard to have better granularity - on per-pool or per-object.=20 No way to disable compression. * From cluster perspective compression is performed either on=20 each replicated block or on a block =E2=80=9Cinflated=E2=80=9D by erasu= re coding. Thus=20 total Ceph cluster CPU utilization to perform compression becomes=20 considerably higher ( three times increase for replicated pools and ~50= %=20 one for EC pools). * No benefit in reduced data transfers over the net. * Recovery procedure caused by OSD down triggers complete data= =20 set decompression and compression when EC pool used. This might=20 considerably increase CPU usage utilization for recovery process. Compression Externally at Block Device - compression to be performed at= =20 =46ile Store node by means of underlying block device that supports inl= ine=20 data compression. Pros: * Compression is transparent to Ceph * Clients benefit from cluster CPU resources utilization. Cons: * Production quality solution seems to be absent. * Limited flexibility - compression is a partition/mount point= =20 property. Hard to have better granularity - on per-pool or per-object.=20 No way to disable compression. * From cluster perspective compression is performed either on=20 each replicated block or on a block =E2=80=9Cinflated=E2=80=9D by erasu= re coding. Thus=20 total Ceph cluster CPU utilization to perform compression becomes=20 considerably higher ( three times increase for replicated pools and ~50= %=20 one for EC pools). * No benefit in reduced data transfers over the net. * Recovery procedure caused by OSD down triggers complete data= =20 set decompression and compression when EC pool used. This might=20 considerably increase CPU usage utilization for recovery process. Notes: Probably the most troublesome issue brought by compression introduction= =20 is random write access handling. Please note that Its brief overview i= s=20 as follows: Compressing entity processes original data blocks for a specific object= =20 and eventually saves a set of new compressed blocks to the storage.=20 Since different blocks can have different compression ratio new block=20 are variable in size. When a new write request for specific data range=20 overlapping existing data comes from the client one needs to save=20 resulting compressed block some way. Again due to different compression= =20 ratio new block may not fit into the space allocated for the previous=20 one. Moreover if new write request isn=E2=80=99t aligned with the origi= nal one=20 we might face the case when previous block is invalidated partially. Thus the flat and sequential object data keeping model doesn=E2=80=99t = work any=20 more. Instead one needs to introduce some trick scheme to store, access and=20 overwrite object content. One can find more details on both the issue=20 and potential implementation approach here ( sections I & II): http://users.ics.forth.gr/~bilas/pdffiles/makatos-snapi10.pdf Thanks, Igor. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html