Re: Adding compression support for bluestore.

CEPH filesystem development
 help / color / mirror / Atom feed

From: Igor Fedotov <ifedotov@mirantis.com>
To: Haomai Wang <haomaiwang@gmail.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Wed, 17 Feb 2016 03:11:15 +0300	[thread overview]
Message-ID: <56C3BAA3.3070804@mirantis.com> (raw)
In-Reply-To: <CACJqLyb=X1i7tsYeKOEJRdJEEMBGvgW817eY5Bo9YBXDszUDmw@mail.gmail.com>

Hi Haomai,
Thanks for your comments.
Please find my response inline.

On 2/16/2016 5:06 AM, Haomai Wang wrote:
> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> Hi guys,
>> Here is my preliminary overview how one can add compression support allowing
>> random reads/writes for bluestore.
>>
>> Preface:
>> Bluestore keeps object content using a set of dispersed extents aligned by
>> 64K (configurable param). It also permits gaps in object content i.e. it
>> prevents storage space allocation for object data regions unaffected by user
>> writes.
>> A sort of following mapping is used for tracking stored object content
>> disposition (actual current implementation may differ but representation
>> below seems to be sufficient for our purposes):
>> Extent Map
>> {
>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
>> ...
>> < logical offset N -> extent N 'physical' offset, extent N size >
>> }
>>
>>
>> Compression support approach:
>> The aim is to provide generic compression support allowing random object
>> read/write.
>> To do that compression engine to be placed (logically - actual
>> implementation may be discussed later) on top of bluestore to "intercept"
>> read-write requests and modify them as needed.
>> The major idea is to split object content into fixed size logical blocks (
>> MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed independently. Due to
>> compression each block can potentially occupy smaller store space comparing
>> to their original size. Each block is addressed using original data offset (
>> AKA 'logical offset' above ). After compression is applied each block is
>> written using the existing bluestore infra. In fact single original write
>> request may affect multiple blocks thus it transforms into multiple
>> sub-write requests. Block logical offset, compressed block data and
>> compressed data length are the parameters for injected sub-write requests.
>> As a result stored object content:
>> a) Has gaps
>> b) Uses less space if compression was beneficial enough.
>>
>> Overwrite request handling is pretty simple. Write request data is splitted
>> into fully and partially overlapping blocks. Fully overlapping blocks are
>> compressed and written to the store (given the extended write functionality
>> described below). For partially overwlapping blocks ( no more than 2 of them
>> - head and tail in general case)  we need to retrieve already stored blocks,
>> decompress them, merge the existing and received data into a block, compress
>> it and save to the store using new size.
>> The tricky thing for any written block is that it can be both longer and
>> shorter than previously stored one.  However it always has upper limit
>> (MAX_BLOCK_SIZE) since we can omit compression and use original block if
>> compression ratio is poor. Thus corresponding bluestore extent for this
>> block is limited too and existing bluestore mapping doesn't suffer: offsets
>> are permanent and are equal to originally ones provided by the caller.
>> The only extension required for bluestore interface is to provide an ability
>> to remove existing extents( specified by logical offset, size). In other
>> words we need write request semantics extension ( rather by introducing an
>> additional extended write method). Currently overwriting request can either
>> increase allocated space or leave it unaffected only. And it can have
>> arbitrary offset,size parameters pair. Extended one should be able to
>> squeeze store space ( e.g. by removing existing extents for a block and
>> allocating reduced set of new ones) as well. And extended write should be
>> applied to a specific block only, i.e. logical offset to be aligned with
>> block start offset and size limited to MAX_BLOCK_SIZE. It seems this is
>> pretty simple to add - most of the functionality for extent append/removal
>> if already present.
>>
>> To provide reading and (over)writing compression engine needs to track
>> additional block mapping:
>> Block Map
>> {
>> < logical offset 0 -> compression method, compressed block 0 size >
>> ...
>> < logical offset N -> compression method, compressed block N size >
>> }
>> Please note that despite the similarity with the original bluestore extent
>> map the difference is in record granularity: 1Mb vs 64Kb. Thus each block
>> mapping record might have multiple corresponding extent mapping records.
>>
>> Below is a sample of mappings transform for a pair of overwrites.
>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for each
>> block)
>> Block Map
>> {
>>   0 -> zlib, 512Kb
>>   1Mb -> zlib, 512Kb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 0, 512Kb
>>   1Mb -> 512Kb, 512Kb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>
>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, compress
>> ratio 1 for both affected blocks)
>> Block Map
>> {
>>   0 -> none, 1Mb
>>   1Mb -> none, 1Mb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 2.5Mb, 1Mb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>
>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress
>> ratio 4 for all affected blocks)
>> Block Map
>> {
>>   0 -> none, 1Mb
>>   1Mb -> zlib, 256Kb
>>   2Mb -> zlib, 256Kb
>>   3Mb -> zlib, 256Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 0Mb, 256Kb
>>   2Mb -> 0.25Mb, 256Kb
>>   3Mb -> 0.5Mb, 256Kb
>> }
>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>
> Thanks for Igore!
>
> Maybe I'm missing something, is it compressed inline not offline?
That's about inline compression.
> If so, I guess we need to provide with more flexible controls to
> upper, like explicate compression flag or compression unit.
Yes I agree. We need a sort of control for compression - on per object 
or per pool basis...
But at the overview above I was more concerned about algorithmic aspect 
i.e. how to implement random read/write handling for compressed objects. 
Compression management from the user side can be considered a bit later.

>> Any comments/suggestions are highly appreciated.
>>
>> Kind regards,
>> Igor.
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Thanks,
Igor

next prev parent reply	other threads:[~2016-02-17  0:11 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16  2:06 ` Haomai Wang
2016-02-17  0:11   ` Igor Fedotov [this message]
2016-02-19 23:13     ` Allen Samuels
2016-02-22 12:25       ` Sage Weil
2016-02-24 18:18         ` Igor Fedotov
2016-02-24 18:43           ` Allen Samuels
2016-02-26 17:41             ` Igor Fedotov
2016-03-15 17:12               ` Sage Weil
2016-03-16  1:06                 ` Allen Samuels
2016-03-16 18:34                 ` Igor Fedotov
2016-03-16 19:02                   ` Allen Samuels
2016-03-16 19:15                     ` Sage Weil
2016-03-16 19:20                       ` Allen Samuels
2016-03-16 19:29                         ` Sage Weil
2016-03-16 19:36                           ` Allen Samuels
2016-03-17 14:55                     ` Igor Fedotov
2016-03-17 15:28                       ` Allen Samuels
2016-03-18 13:00                         ` Igor Fedotov
2016-03-16 19:27                   ` Sage Weil
2016-03-16 19:41                     ` Allen Samuels
     [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56                         ` Blair Bethwaite
2016-03-17  3:21                           ` Allen Samuels
2016-03-17 10:01                             ` Willem Jan Withagen
2016-03-17 17:29                               ` Howard Chu
2016-03-17 15:21                             ` Igor Fedotov
2016-03-17 15:18                     ` Igor Fedotov
2016-03-17 15:33                       ` Sage Weil
2016-03-17 18:53                         ` Allen Samuels
2016-03-18 14:58                           ` Igor Fedotov
2016-03-18 15:53                         ` Igor Fedotov
2016-03-18 17:17                           ` Vikas Sinha-SSI
2016-03-19  3:14                             ` Allen Samuels
2016-03-21 14:19                             ` Igor Fedotov
2016-03-19  3:14                           ` Allen Samuels
2016-03-21 14:07                             ` Igor Fedotov
2016-03-21 15:14                               ` Allen Samuels
2016-03-21 16:35                                 ` Igor Fedotov
2016-03-21 17:14                                   ` Allen Samuels
2016-03-21 18:31                                     ` Igor Fedotov
2016-03-21 21:14                                       ` Allen Samuels
2016-03-21 15:32                             ` Igor Fedotov
2016-03-21 15:50                               ` Sage Weil
2016-03-21 18:01                                 ` Igor Fedotov
2016-03-24 12:45                                 ` Igor Fedotov
2016-03-24 22:29                                   ` Allen Samuels
2016-03-29 20:19                                   ` Sage Weil
2016-03-29 20:45                                     ` Allen Samuels
2016-03-30 12:32                                       ` Igor Fedotov
2016-03-30 12:28                                     ` Igor Fedotov
2016-03-30 12:47                                       ` Sage Weil
2016-03-31 21:56                                   ` Sage Weil
2016-04-01 18:54                                     ` Allen Samuels
2016-04-04 12:31                                     ` Igor Fedotov
2016-04-04 12:38                                     ` Igor Fedotov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C3BAA3.3070804@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=haomaiwang@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox