From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: Adding compression support for bluestore. Date: Wed, 24 Feb 2016 21:18:52 +0300 Message-ID: <56CDF40C.9060405@mirantis.com> References: <56C1FCF3.4030505@mirantis.com> <56C3BAA3.3070804@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-lb0-f179.google.com ([209.85.217.179]:34285 "EHLO mail-lb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750901AbcBXSSr (ORCPT ); Wed, 24 Feb 2016 13:18:47 -0500 Received: by mail-lb0-f179.google.com with SMTP id of3so15270213lbc.1 for ; Wed, 24 Feb 2016 10:18:46 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil , Allen Samuels Cc: ceph-devel Allen, Sage thanks a lot for interesting input. May I have some clarification and highlight some caveats though? 1) Allen, are you suggesting to have permanent logical blocks layout=20 established after the initial writing? Please find what I mean at the example below ( logical offset/size are=20 provided only for the sake of simplicity). Imagine client has performed multiple writes that created following map= =20 : <0, 100> <100, 50> <150, 70> <230, 70> and an overwrite request <120,70> is coming. The question is if resulting mapping to be the same or should be update= d=20 as below: <0,100> <100, 20> //updated extent <120, 100> //new extent <220, 10> //updated extent <230, 70> 2) In fact "Application units" that write requests delivers to BlueStor= e=20 are pretty( or even completely) distorted by Ceph internals (Caching=20 infra, striping, EC). Thus there is a chance we are dealing with a=20 broken picture and suggested modification brings no/minor benefit. 3) Sage - could you please elaborate the per-extent checksum use case -= =20 how are we planing to use that? Thanks, Igor. On 22.02.2016 15:25, Sage Weil wrote: > On Fri, 19 Feb 2016, Allen Samuels wrote: >> This is a good start to an architecture for performing compression. >> >> I am concerned that it's a bit too simple at the expense of potentia= lly >> significant performance. In particular, I believe it's often ineffic= ient >> to force compression to be performed in block sizes and alignments t= hat >> may not match the application's usage. >> >> I think that extent mapping should be enhanced to include the full >> tuple: > compression algo> > I agree. > =20 >> With the full tuple, you can compress data in the natural units of t= he >> application (which is most likely the size of the write operation th= at >> you received) and on its natural alignment (which will eliminate a l= ot >> of expensive-and-hard-to-handle partial overwrites) rather than the >> proposal of a fixed size compression block on fixed boundaries. >> >> Using the application's natural block size for performing compressio= n >> may allow you a greater choice of compression algorithms. For exampl= e, >> if you're doing 1MB object writes, then you might want to be using >> bzip-ish algorithms that have large compression windows rather than = the >> 32-K limited zlib algorithm or the 64-k limited snappy. You wouldn't >> want to do that if all compression was limited to a fixed 64K window= =2E >> >> With this extra information a number of interesting algorithm choice= s >> become available. For example, in the partial-overwrite case you can >> just delay recovering the partially overwritten data by having an ex= tent >> that overlaps a previous extent. > Yep. > >> One objection to the increased extent tuple is that amount of >> space/memory it would consume. This need not be the case, the existi= ng >> BlueStore architecture stores the extent map in a serialized format >> different from the in-memory format. It would be relatively simple t= o >> create multiple serialization formats that optimize for the typical >> cases of when the logical space is contiguous (i.e., logical offset = is >> previous logical offset + logical size) and when there's no compress= ion >> (logical size =3D=3D physical size). Only the deserialized in-memory= format >> of the extent table has the fully populated tuples. In fact this is = a >> desirable optimization for the current bluestore regardless of wheth= er >> this compression proposal is adopted or not. > Yeah. > > The other bit we should probably think about here is how to store > checksums. In the compressed extent case, a simple approach would be= to > just add the checksum (either compressed, uncompressed, or both) to t= he > extent tuple, since the extent will generally need to be read in its > entirety anyway. For uncompressed extents, that's not the case, and > having an independent map of checksums over smaller block sizes makes > sense, but that doesn't play well with the variable alignment/extent = size > approach. I kind of sucks to have multiple formats here, but if we c= an > hide it behind the in-memory representation and/or interface (so that= , > e.g., each extent has a checksum block size and a vector of checksums= ) we > can optimize the encoding however we like without affecting other cod= e. > > sage > >> >> Allen Samuels >> Software Architect, Fellow, Systems and Software Solutions >> >> 2880 Junction Avenue, San Jose, CA 95134 >> T: +1 408 801 7030| M: +1 408 780 6416 >> allen.samuels@SanDisk.com >> >> >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger= =2Ekernel.org] On Behalf Of Igor Fedotov >> Sent: Tuesday, February 16, 2016 4:11 PM >> To: Haomai Wang >> Cc: ceph-devel >> Subject: Re: Adding compression support for bluestore. >> >> Hi Haomai, >> Thanks for your comments. >> Please find my response inline. >> >> On 2/16/2016 5:06 AM, Haomai Wang wrote: >>> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov wrote: >>>> Hi guys, >>>> Here is my preliminary overview how one can add compression suppor= t >>>> allowing random reads/writes for bluestore. >>>> >>>> Preface: >>>> Bluestore keeps object content using a set of dispersed extents >>>> aligned by 64K (configurable param). It also permits gaps in objec= t >>>> content i.e. it prevents storage space allocation for object data >>>> regions unaffected by user writes. >>>> A sort of following mapping is used for tracking stored object >>>> content disposition (actual current implementation may differ but >>>> representation below seems to be sufficient for our purposes): >>>> Extent Map >>>> { >>>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > = =2E.. >>>> < logical offset N -> extent N 'physical' offset, extent N size > = } >>>> >>>> >>>> Compression support approach: >>>> The aim is to provide generic compression support allowing random >>>> object read/write. >>>> To do that compression engine to be placed (logically - actual >>>> implementation may be discussed later) on top of bluestore to "int= ercept" >>>> read-write requests and modify them as needed. >>>> The major idea is to split object content into fixed size logical >>>> blocks ( MAX_BLOCK_SIZE, e.g. 1Mb). Blocks are compressed >>>> independently. Due to compression each block can potentially occup= y >>>> smaller store space comparing to their original size. Each block i= s >>>> addressed using original data offset ( AKA 'logical offset' above = ). >>>> After compression is applied each block is written using the exist= ing >>>> bluestore infra. In fact single original write request may affect >>>> multiple blocks thus it transforms into multiple sub-write request= s. >>>> Block logical offset, compressed block data and compressed data le= ngth are the parameters for injected sub-write requests. >>>> As a result stored object content: >>>> a) Has gaps >>>> b) Uses less space if compression was beneficial enough. >>>> >>>> Overwrite request handling is pretty simple. Write request data is >>>> splitted into fully and partially overlapping blocks. Fully >>>> overlapping blocks are compressed and written to the store (given = the >>>> extended write functionality described below). For partially >>>> overwlapping blocks ( no more than 2 of them >>>> - head and tail in general case) we need to retrieve already stor= ed >>>> blocks, decompress them, merge the existing and received data into= a >>>> block, compress it and save to the store using new size. >>>> The tricky thing for any written block is that it can be both long= er >>>> and shorter than previously stored one. However it always has upp= er >>>> limit >>>> (MAX_BLOCK_SIZE) since we can omit compression and use original bl= ock >>>> if compression ratio is poor. Thus corresponding bluestore extent = for >>>> this block is limited too and existing bluestore mapping doesn't >>>> suffer: offsets are permanent and are equal to originally ones pro= vided by the caller. >>>> The only extension required for bluestore interface is to provide = an >>>> ability to remove existing extents( specified by logical offset, >>>> size). In other words we need write request semantics extension ( >>>> rather by introducing an additional extended write method). Curren= tly >>>> overwriting request can either increase allocated space or leave i= t >>>> unaffected only. And it can have arbitrary offset,size parameters >>>> pair. Extended one should be able to squeeze store space ( e.g. by >>>> removing existing extents for a block and allocating reduced set o= f >>>> new ones) as well. And extended write should be applied to a speci= fic >>>> block only, i.e. logical offset to be aligned with block start off= set >>>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple= to >>>> add - most of the functionality for extent append/removal if alrea= dy present. >>>> >>>> To provide reading and (over)writing compression engine needs to >>>> track additional block mapping: >>>> Block Map >>>> { >>>> < logical offset 0 -> compression method, compressed block 0 size = > >>>> ... >>>> < logical offset N -> compression method, compressed block N size = > } >>>> Please note that despite the similarity with the original bluestor= e >>>> extent map the difference is in record granularity: 1Mb vs 64Kb. T= hus >>>> each block mapping record might have multiple corresponding extent= mapping records. >>>> >>>> Below is a sample of mappings transform for a pair of overwrites. >>>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 f= or >>>> each >>>> block) >>>> Block Map >>>> { >>>> 0 -> zlib, 512Kb >>>> 1Mb -> zlib, 512Kb >>>> 2Mb -> zlib, 512Kb >>>> } >>>> Extent Map >>>> { >>>> 0 -> 0, 512Kb >>>> 1Mb -> 512Kb, 512Kb >>>> 2Mb -> 1Mb, 512Kb >>>> } >>>> 1.5Mb allocated [ 0, 1.5 Mb] range ) >>>> >>>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, >>>> compress ratio 1 for both affected blocks) Block Map { >>>> 0 -> none, 1Mb >>>> 1Mb -> none, 1Mb >>>> 2Mb -> zlib, 512Kb >>>> } >>>> Extent Map >>>> { >>>> 0 -> 1.5Mb, 1Mb >>>> 1Mb -> 2.5Mb, 1Mb >>>> 2Mb -> 1Mb, 512Kb >>>> } >>>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range ) >>>> >>>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, >>>> compress ratio 4 for all affected blocks) Block Map { >>>> 0 -> none, 1Mb >>>> 1Mb -> zlib, 256Kb >>>> 2Mb -> zlib, 256Kb >>>> 3Mb -> zlib, 256Kb >>>> } >>>> Extent Map >>>> { >>>> 0 -> 1.5Mb, 1Mb >>>> 1Mb -> 0Mb, 256Kb >>>> 2Mb -> 0.25Mb, 256Kb >>>> 3Mb -> 0.5Mb, 256Kb >>>> } >>>> 1.75Mb allocated ( [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb ) >>>> >>> Thanks for Igore! >>> >>> Maybe I'm missing something, is it compressed inline not offline? >> That's about inline compression. >>> If so, I guess we need to provide with more flexible controls to >>> upper, like explicate compression flag or compression unit. >> Yes I agree. We need a sort of control for compression - on per obje= ct or per pool basis... >> But at the overview above I was more concerned about algorithmic asp= ect i.e. how to implement random read/write handling for compressed obj= ects. >> Compression management from the user side can be considered a bit la= ter. >> >>>> Any comments/suggestions are highly appreciated. >>>> >>>> Kind regards, >>>> Igor. >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-dev= el" >>>> in the body of a message to majordomo@vger.kernel.org More majordo= mo >>>> info at http://vger.kernel.org/majordomo-info.html >> Thanks, >> Igor >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel= " in the body of a message to majordomo@vger.kernel.org More majordomo = info at http://vger.kernel.org/majordomo-info.html >> PLEASE NOTE: The information contained in this electronic mail messa= ge is intended only for the use of the designated recipient(s) named ab= ove. If the reader of this message is not the intended recipient, you a= re hereby notified that you have received this message in error and tha= t any review, dissemination, distribution, or copying of this message i= s strictly prohibited. If you have received this communication in error= , please notify the sender by telephone or e-mail (as shown above) imme= diately and destroy any and all copies of this message in your possessi= on (whether hard copies or electronically stored copies). >> N?????r??y??????X??=C7=A7v???)=DE=BA{.n?????z?]z????ay?=1D=CA=87=DA=99= ??j=07??f???h?????=1E?w???=0C???j:+v???w????????=07????zZ+???????j"????= i -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html