Re: Adding compression/checksum support for bluestore.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Igor Fedotov <ifedotov@mirantis.com>
To: Vikas Sinha-SSI <v.sinha@ssi.samsung.com>,
	Allen Samuels <Allen.Samuels@sandisk.com>,
	Sage Weil <sage@newdream.net>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression/checksum support for bluestore.
Date: Thu, 31 Mar 2016 19:31:32 +0300	[thread overview]
Message-ID: <56FD50E4.4040300@mirantis.com> (raw)
In-Reply-To: <EC1BE61F68296D4A8CE45AD3DEDB99E21AEF7D1C@SSIEXCH-MB3.ssi.samsung.com>


On 30.03.2016 23:41, Vikas Sinha-SSI wrote:
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>> Sent: Wednesday, March 30, 2016 12:47 PM
>> To: Sage Weil; Igor Fedotov
>> Cc: ceph-devel
>> Subject: Adding compression/checksum support for bluestore.
>>
>> [snip]
>>
>> Time to talk about checksums.
>>
>> First let's divide the world into checksums for data and checksums for
>> metadata -- and defer the discussion about checksums for metadata
>> (important, but one at a time...)
>>
>> I believe it's a requirement that when checksums are enabled that 100% of
>> data reads must be validated against their corresponding checksum. This
>> leads you to conclude that you must store a checksum for each
>> independently readable piece of data.
>>
>> When compression is disabled, it's relatively straightforward -- there's a
>> checksum for each 4K readable block of data. Presumably this is a simple
>> vector stored in the pextent structure with one entry for each 4K block of
>> data.
>>
>> Things get more complicated when compression is enabled. At a minimum,
>> you'll need a checksum for each blob of compressed data (I'm using blob
>> here as unit of data put into the compressor, but what I really mean is the
>> minimum amount of *decompressable* data). As I've pointed out before,
>> many of the compression algorithms do their own checksum validation. For
>> algorithms that don't do their own checksum we'll want one checksum to
>> protect the block -- however, there's no reason that we can't implement this
>> as one checksum for each 4K physical blob, the runtime cost is nearly
>> equivalent and it will considerably simplify the code.
>>
>> Thus I think we really end up with a single, simple design. The pextent
>> structure contains a vector of checksums. Either that vector is empty
>> (checksum disabled) OR there is a checksum for each 4K block of data (not
>> this is NOT min_allocation size, it's minimum_read_size [if that's even a
>> parameter or does the code assume 4K readable blocks? [or worse, 512
>> readable blocks?? -- if so, we'll need to cripple this]).
>>
>> When compressing with a compression algorithm that does checksuming we
>> can automatically suppress checksum generation. There should also be an
>> administrative switch for this.
>>
>> This allows the checksuming to be pretty much independent of compression
>> -- which is nice :)
>>
>> This got me thinking, we have another issue to discuss and resolve.
>>
>> The scenario is when compression is enabled. Assume that we've taken a big
>> blob of data and compressed it into a smaller blob. We then call the allocator
>> for that blob. What do we do if the allocator can't find a CONTIGUOUS block
>> of storage of that size??? In the non-compressed case, it's relatively easy to
>> simply break it up into smaller chunks -- but that doesn't work well with
>> compression.
>>
>> This isn't that unlikely a case, worse it could happen with shockingly high
>> amounts of freespace (>>75%) with some pathological access patterns.
>>
>> There's really only two choices. You either fracture the logical data and
>> recompress OR you modify the pextent data structure to handle this case.
>> The later isn't terribly difficult to do, you just make the size/address values
>> into a vector of pairs. The former scheme could be quite expensive CPU wise
>> as you may end up fracturing and recompressing multiple times (granted, in a
>> pathological case). The latter case adds space to each onode for a rare case.
>> The space is recoverable with an optimized serialize/deserializer (in essence
>> you could burn a flag to indicate when a vector of physical chunks/sizes is
>> needed instead of the usual scalar pair).
>>
>> IMO, we should pursue the later scenario as it avoids the variable latency
>> problem. I see the code/testing complexity of either choice as about the
>> same.
>>
> If I understand correctly, then there would still be a cost associated with writing dis-contiguously
> to disk. In cases such as this where the resources for compression are not easily available, I wonder
> if it is reasonable to simply not do compression for that Write. The cost of not compressing would be
> a missed space optimization, but the cost of compressing in any and all cases could be significant to latency.
Seems to be a reasonable and simple solution.
There is still a technical question to distinguish "no space" and "no 
contiguous space" allocation failure cases. Need to be addressed by the 
allocator...

next prev parent reply	other threads:[~2016-03-31 16:31 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
2016-03-30 20:41 ` Vikas Sinha-SSI
2016-03-30 22:24   ` Sage Weil
2016-03-30 22:35     ` Allen Samuels
2016-03-31 16:31   ` Igor Fedotov [this message]
2016-03-30 22:15 ` Sage Weil
2016-03-30 22:22   ` Gregory Farnum
2016-03-30 22:30     ` Sage Weil
2016-03-30 22:43       ` Allen Samuels
2016-03-30 22:32   ` Allen Samuels
2016-03-30 22:52   ` Allen Samuels
2016-03-30 22:57     ` Sage Weil
2016-03-30 23:03       ` Gregory Farnum
2016-03-30 23:08         ` Allen Samuels
2016-03-31 23:02       ` Milosz Tanski
2016-04-01  3:56     ` Chris Dunlop
2016-04-01  4:56       ` Sage Weil
2016-04-01  5:28         ` Chris Dunlop
2016-04-01 14:58           ` Sage Weil
2016-04-01 19:49             ` Chris Dunlop
2016-04-01 23:08               ` Allen Samuels
2016-04-02  2:23                 ` Allen Samuels
2016-04-02  2:51                   ` Gregory Farnum
2016-04-02  5:05                     ` Chris Dunlop
2016-04-02  5:48                       ` Allen Samuels
2016-04-02  6:18                       ` Gregory Farnum
2016-04-03 13:27                         ` Sage Weil
2016-04-04 15:33                           ` Chris Dunlop
2016-04-04 15:51                             ` Chris Dunlop
2016-04-04 17:58                               ` Allen Samuels
2016-04-04 15:26                         ` Chris Dunlop
2016-04-04 17:56                           ` Allen Samuels
2016-04-02  5:08                     ` Allen Samuels
2016-04-02  4:07                 ` Chris Dunlop
2016-04-02  5:38                   ` Allen Samuels
2016-04-04 15:00                     ` Chris Dunlop
2016-04-04 23:58                       ` Allen Samuels
2016-04-05 12:35                         ` Sage Weil
2016-04-05 15:10                           ` Chris Dunlop
2016-04-06  6:38                             ` Chris Dunlop
2016-04-06 15:47                               ` Allen Samuels
2016-04-06 17:17                                 ` Chris Dunlop
2016-04-06 18:06                                   ` Allen Samuels
2016-04-07  0:43                                     ` Chris Dunlop
2016-04-07  0:52                                       ` Allen Samuels
2016-04-07  2:59                                         ` Chris Dunlop
2016-04-07  9:51                                           ` Willem Jan Withagen
2016-04-07 12:21                                             ` Atchley, Scott
2016-04-07 15:01                                               ` Willem Jan Withagen
2016-04-07  9:51                                           ` Chris Dunlop
2016-04-08 23:16                                             ` Allen Samuels
2016-04-05 20:41                           ` Allen Samuels
2016-04-05 21:14                             ` Sage Weil
2016-04-05 12:57                         ` Dan van der Ster
2016-04-05 20:50                           ` Allen Samuels
2016-04-06  7:15                             ` Dan van der Ster
2016-03-31 16:27   ` Igor Fedotov
2016-03-31 16:32     ` Allen Samuels
2016-03-31 17:18       ` Igor Fedotov
2016-03-31 17:39         ` Piotr.Dalek
2016-03-31 18:44         ` Allen Samuels
2016-03-31 16:58 ` Igor Fedotov
2016-03-31 18:38   ` Allen Samuels
2016-04-04 12:14     ` Igor Fedotov
2016-04-04 14:44       ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56FD50E4.4040300@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=Allen.Samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    --cc=v.sinha@ssi.samsung.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.