Re: Adding compression support for bluestore.

CEPH filesystem development
 help / color / mirror / Atom feed

From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Mon, 21 Mar 2016 21:01:14 +0300	[thread overview]
Message-ID: <56F036EA.5060401@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1603211141560.23875@cpach.fuggernut.com>



On 21.03.2016 18:50, Sage Weil wrote:
> On Mon, 21 Mar 2016, Igor Fedotov wrote:
>> On 19.03.2016 6:14, Allen Samuels wrote:
>>> If we're going to both allow compression and delayed overwrite we simply
>>> have to handle the case where new data actually overlaps with previous data
>>> -- recursively. If I understand the current code, it handles exactly one
>>> layer of overlay which is always stored in KV store. We need to generalize
>>> this data structure. I'm going to outline a proposal, which If I get wrong,
>>> I beg forgiveness -- I'm not as familiar with this code as I would like,
>>> especially the ref-counted shared extent stuff. But I'm going to blindly
>>> dive in and assume that Sage will correct me when I go off the tracks -- and
>>> therefore end up learning how all of this stuff REALLY works.
>>>
>>> I propose that the current bluestore_extent_t and bluestore_overlay_t  be
>>> essentially unified into a single structure with a typemark to distinguish
>>> between being in KV store or in raw block storage. Here's an example: (for
>>> this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
>>>
>>> Struct bluestore_extent_t {
>>>      Uint64_t logical_size;			// size of data before any
>>> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>>>      Uint64_t physical_size;                              // size of data on
>>> physical media (yes, this is unneeded when location == KV, the
>>> serialize/deserialize could compress this out --  but this is an unneeded
>>> optimization
>>>      Uint64_t location:1;                                    // values (in
>>> ENUM form) are "KV" and "BLOCK"
>>>      Uint64_t compression_alg:4;                  // compression algorithm...
>>>      Uint64_t otherflags:xx;                             // round it out.
>>>      Uint64_t media_address;                        // forms Key when
>>> location == KV block address when location == BLOCK
>>>      Vector<uint32_t> checksums;              // Media checksums. See
>>> commentary on this below.
>>> };
>>>
>>> This allows any amount of compressed or uncompressed data to be identified
>>> in either a KV key or a block store.
>>>
>> As promised please find a competing proposal for extent map structure. It can
>> be used for handling unaligned overlapping writes of both
>> compressed/uncompressed data. It seems it's applicable for any compression
>> policy but my primary intention was to allow overwrites that use totally
>> different extents without the touch to the existing(overwritten) ones. I.e.
>> that's what Sage explained this way some time ago:
>>
>> "b) we could just leave the overwritten extents alone and structure the
>> block_map so that they are occluded.  This will 'leak' space for some
>> write patterns, but that might be okay given that we can come back later
>> and clean it up, or refine our strategy to be smarter."
>>
>> Nevertheless the corresponding infrastructure seems to be applicable for
>> different use cases too.
>>
>> At first let's consider simple raw data overwrite case. No compression,
>> checksums, flags at this point for the sake of simplicity.
>> Block map entry to be defined as follows:
>> OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
>> where
>> EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address and
>> size.
>> X_OFFS - relative offset within the block where valid (not overwritten) data
>> starts. Full data offset = OFFS + X_OFFS
>> X_LEN - valid data size.
>> Invariant: Block length == X_OFFS + X_LEN
>>
>> Let's consider sample block map transform:
>> --------------------------------------------------------
>> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
>> ->Write(0,50)
>> ->Write(100, 50)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 50}
>> 100: {EO2, 50, 0, 50}
>>
>> Where EO1, EO2 - physical addresses for allocated extents.
>> Two new entries have been inserted.
>>
>> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
>> ->Write(25,100)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 100}
>> 125: {EO2, 50, 25, 25}
>>
>> As one can see new entry at offset 25 has appeared and previous entries have
>> been altered (including the map key (100->125) for the last entry). No
>> physical extents reallocation took place though - just a new one at E03 has
>> been allocated.
>> Please note that client accessible data for block EO2 are actually stored at
>> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K total.
>> The same for block EO1 - valid data length = 25K only.
>>
>>
>> ****** Step 2 ( overwrite that partially overlaps existing blocks once again):
>> ->Write(70, 65)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 65}
>> 135: {EO2, 50, 35, 15}
>>
>> Yet another new entry. Overlapped block entries at 25 & 125 were altered.
>>
>> ****** Step 3 ( overwrite that partially overlaps one block and totally
>> overwrite the last one):
>> ->Write(100, 60)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>> -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
>> X_LEN = 0 )
>>
>> Entry for EO4 have been altered and entry EO2 to be removed. The latter can be
>> done both immediately on map alteration and by some background cleanup
>> procedure.
>>
>> ****** Step 4 ( overwrite that totally overlap the first block):
>> ->Write(0, 25)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO6, 25, 0, 25}
>> - 0:      {EO1, 50, 25, 0}  -> to be removed
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>>
>> Entry for EO1 has been overwritten and to be removed.
>> --------------------------------------------------------------------------------------
>>
>> Extending this block map for compression is trivial - we need to introduce
>> compression algorithm flag to the map. And vary EXT_LEN (and actual physical
>> allocation) depending on the actual compression ratio.
>> E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn into
>> :
>> 100: {EO5, 20, 0, 60}
>>
>> Other compression aspects handled by the corresponding policies ( e.g. when
>> perform the compression ( immediately, lazily or totally in background ) or
>> how to merge neighboring compressed blocks ) probably don't impact the
>> structure of the map entry - they just shuffle the entries.
> This is much simpler!  There is one case we need to address that I don't
> see above, though.  Consider,
>
> - write 0~1048576, and compress it
> - write 16384~4096
Good catch! I really missed this case.
> When we split the large extent into two pieces, the resulting extent map,
> as per above, would be something like
>
> 0:      {EO1, 1048576, 0, 4096, zlib}
> 4096:   {E02, 16384, 0, 4096, uncompressed}
> 16384:	{E01, 1048576, 20480, 1028096, zlib}
>
> ...which is fine, except that it's the *same* compressed extent, which
> means the code that decides that the physical extent is no longer
> referenced and can be released needs to ensure that no other extents in
> the map reference it.  I think that's an O(n) pass across the map when
> releasing.
>
> Also, if we add in checksums, then we'd be duplicating them in the two
> instances that reference the raw extent.
>
> I wonder if it makes sense to break this into two structures.. one that
> lists the raw extents, and another that maps them into the logical space.
> So that there is one record for {E01, 1048576, zlib, checksums}, and then
> the block map is more like
>
> 0:      {E0, 0, 4096}
> 4096:   {E1, 0, 4096}
> 16384:	{E0, 20480, 1028096}
>
> and
>
> 0: E01, 1048576, 0, 4096, zlib, checksums
> 1: E02, 16384, 0, 4096, uncompressed, checksums
Sounds good!
> ?
>
> sage

Thanks,
Igor

next prev parent reply	other threads:[~2016-03-21 18:01 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16  2:06 ` Haomai Wang
2016-02-17  0:11   ` Igor Fedotov
2016-02-19 23:13     ` Allen Samuels
2016-02-22 12:25       ` Sage Weil
2016-02-24 18:18         ` Igor Fedotov
2016-02-24 18:43           ` Allen Samuels
2016-02-26 17:41             ` Igor Fedotov
2016-03-15 17:12               ` Sage Weil
2016-03-16  1:06                 ` Allen Samuels
2016-03-16 18:34                 ` Igor Fedotov
2016-03-16 19:02                   ` Allen Samuels
2016-03-16 19:15                     ` Sage Weil
2016-03-16 19:20                       ` Allen Samuels
2016-03-16 19:29                         ` Sage Weil
2016-03-16 19:36                           ` Allen Samuels
2016-03-17 14:55                     ` Igor Fedotov
2016-03-17 15:28                       ` Allen Samuels
2016-03-18 13:00                         ` Igor Fedotov
2016-03-16 19:27                   ` Sage Weil
2016-03-16 19:41                     ` Allen Samuels
     [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56                         ` Blair Bethwaite
2016-03-17  3:21                           ` Allen Samuels
2016-03-17 10:01                             ` Willem Jan Withagen
2016-03-17 17:29                               ` Howard Chu
2016-03-17 15:21                             ` Igor Fedotov
2016-03-17 15:18                     ` Igor Fedotov
2016-03-17 15:33                       ` Sage Weil
2016-03-17 18:53                         ` Allen Samuels
2016-03-18 14:58                           ` Igor Fedotov
2016-03-18 15:53                         ` Igor Fedotov
2016-03-18 17:17                           ` Vikas Sinha-SSI
2016-03-19  3:14                             ` Allen Samuels
2016-03-21 14:19                             ` Igor Fedotov
2016-03-19  3:14                           ` Allen Samuels
2016-03-21 14:07                             ` Igor Fedotov
2016-03-21 15:14                               ` Allen Samuels
2016-03-21 16:35                                 ` Igor Fedotov
2016-03-21 17:14                                   ` Allen Samuels
2016-03-21 18:31                                     ` Igor Fedotov
2016-03-21 21:14                                       ` Allen Samuels
2016-03-21 15:32                             ` Igor Fedotov
2016-03-21 15:50                               ` Sage Weil
2016-03-21 18:01                                 ` Igor Fedotov [this message]
2016-03-24 12:45                                 ` Igor Fedotov
2016-03-24 22:29                                   ` Allen Samuels
2016-03-29 20:19                                   ` Sage Weil
2016-03-29 20:45                                     ` Allen Samuels
2016-03-30 12:32                                       ` Igor Fedotov
2016-03-30 12:28                                     ` Igor Fedotov
2016-03-30 12:47                                       ` Sage Weil
2016-03-31 21:56                                   ` Sage Weil
2016-04-01 18:54                                     ` Allen Samuels
2016-04-04 12:31                                     ` Igor Fedotov
2016-04-04 12:38                                     ` Igor Fedotov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56F036EA.5060401@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=Allen.Samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox