From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: Adding Data-At-Rest compression support to Ceph Date: Wed, 23 Sep 2015 18:26:04 +0300 Message-ID: <5602C48C.4010009@mirantis.com> References: <56018A05.6090100@mirantis.com> <56029F66.3070503@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-la0-f54.google.com ([209.85.215.54]:36381 "EHLO mail-la0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753506AbbIWP0K (ORCPT ); Wed, 23 Sep 2015 11:26:10 -0400 Received: by lacao8 with SMTP id ao8so31892231lac.3 for ; Wed, 23 Sep 2015 08:26:07 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum , Sage Weil Cc: ceph-devel On 23.09.2015 17:05, Gregory Farnum wrote: > On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil wrote: >> On Wed, 23 Sep 2015, Igor Fedotov wrote: >>> Hi Sage, >>> thanks a lot for your feedback. >>> >>> Regarding issues with offset mapping and stripe size exposure. >>> What's about the idea to apply compression in two-tier (cache+backi= ng storage) >>> model only ? >> I'm not sure we win anything by making it a two-tier only thing... s= imply >> making it a feature of the EC pool means we can also address EC pool= users >> like radosgw. >> >>> I doubt single-tier one is widely used for EC pools since there is = no random >>> write support in such mode. Thus this might be an acceptable limita= tion. >>> At the same time it seems that appends caused by cached object flus= h have >>> fixed block size (8Mb by default). And object is totally rewritten = on the next >>> flush if any. This makes offset mapping less tricky. >>> Decompression should be applied in any model though as cache tier s= hutdown and >>> subsequent compressed data access is possibly a valid use case. >> Yeah, we need to handle random reads either way, so I think the offs= et >> mapping is going to be needed anyway. > The idea of making the primary responsible for object compression > really concerns me. It means for instance that a single random access > will likely require access to multiple objects, and breaks many of th= e > optimizations we have right now or in the pipeline (for instance: > direct client access). Could you please elaborate why multiple objects access is required on=20 single random access? In my opinion we need to access absolutely the same object set as=20 before: in EC pool each appended block is spitted into multiple shards=20 that go to respective OSDs. In general case one has to retrieve a set o= f=20 adjacent shards from several OSDs on single read request. In case of=20 compression the only difference is in data range that compressed shard=20 set occupy. I.e. we simply need to translate requested data range to th= e=20 actually stored one and retrieve that data from OSDs. What's missed? > And apparently only the EC pool will support > compression, which is frustrating for all the replicated pool users > out there... In my opinion replicated pool users should consider EC pool usage firs= t=20 if they care about space saving. They automatically gain 50% space=20 saving this way. Compression brings even more saving but that's rather=20 the second step on this way. > Is there some reason we don't just want to apply encryption across an > OSD store? Perhaps doing it on the filesystem level is the wrong way > (for reasons named above) but there are other mechanisms like inline > block device compression that I think are supposed to work pretty > well. If I understand the idea of inline block device compression correctly i= t=20 has some of drawbacks similar to FS compression approach. Ones to menti= on: * Less flexibility - per device compression only, no way to have=20 per-pool compression. No control on the compression process. * Potentially higher overhead when operating- There is no way to bypass= =20 non-compressible data processing, e.g. shards with Erasure codes. * Potentially higher overhead for recovery on OSD death - one needs to=20 decompress data at working OSDs and compress it at new OSD. That's not=20 necessary if compression takes place prior to EC though. > The only thing that doesn't get us that I can see mentioned here > is the over-the-wire compression =E2=80=94 and Haomai already has pat= ches for > that, which should be a lot easier to validate and will work at all > levels of the stack! > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html