From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Mon, 21 Mar 2016 21:01:14 +0300 [thread overview]
Message-ID: <56F036EA.5060401@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1603211141560.23875@cpach.fuggernut.com>
On 21.03.2016 18:50, Sage Weil wrote:
> On Mon, 21 Mar 2016, Igor Fedotov wrote:
>> On 19.03.2016 6:14, Allen Samuels wrote:
>>> If we're going to both allow compression and delayed overwrite we simply
>>> have to handle the case where new data actually overlaps with previous data
>>> -- recursively. If I understand the current code, it handles exactly one
>>> layer of overlay which is always stored in KV store. We need to generalize
>>> this data structure. I'm going to outline a proposal, which If I get wrong,
>>> I beg forgiveness -- I'm not as familiar with this code as I would like,
>>> especially the ref-counted shared extent stuff. But I'm going to blindly
>>> dive in and assume that Sage will correct me when I go off the tracks -- and
>>> therefore end up learning how all of this stuff REALLY works.
>>>
>>> I propose that the current bluestore_extent_t and bluestore_overlay_t be
>>> essentially unified into a single structure with a typemark to distinguish
>>> between being in KV store or in raw block storage. Here's an example: (for
>>> this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
>>>
>>> Struct bluestore_extent_t {
>>> Uint64_t logical_size; // size of data before any
>>> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>>> Uint64_t physical_size; // size of data on
>>> physical media (yes, this is unneeded when location == KV, the
>>> serialize/deserialize could compress this out -- but this is an unneeded
>>> optimization
>>> Uint64_t location:1; // values (in
>>> ENUM form) are "KV" and "BLOCK"
>>> Uint64_t compression_alg:4; // compression algorithm...
>>> Uint64_t otherflags:xx; // round it out.
>>> Uint64_t media_address; // forms Key when
>>> location == KV block address when location == BLOCK
>>> Vector<uint32_t> checksums; // Media checksums. See
>>> commentary on this below.
>>> };
>>>
>>> This allows any amount of compressed or uncompressed data to be identified
>>> in either a KV key or a block store.
>>>
>> As promised please find a competing proposal for extent map structure. It can
>> be used for handling unaligned overlapping writes of both
>> compressed/uncompressed data. It seems it's applicable for any compression
>> policy but my primary intention was to allow overwrites that use totally
>> different extents without the touch to the existing(overwritten) ones. I.e.
>> that's what Sage explained this way some time ago:
>>
>> "b) we could just leave the overwritten extents alone and structure the
>> block_map so that they are occluded. This will 'leak' space for some
>> write patterns, but that might be okay given that we can come back later
>> and clean it up, or refine our strategy to be smarter."
>>
>> Nevertheless the corresponding infrastructure seems to be applicable for
>> different use cases too.
>>
>> At first let's consider simple raw data overwrite case. No compression,
>> checksums, flags at this point for the sake of simplicity.
>> Block map entry to be defined as follows:
>> OFFS: < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
>> where
>> EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address and
>> size.
>> X_OFFS - relative offset within the block where valid (not overwritten) data
>> starts. Full data offset = OFFS + X_OFFS
>> X_LEN - valid data size.
>> Invariant: Block length == X_OFFS + X_LEN
>>
>> Let's consider sample block map transform:
>> --------------------------------------------------------
>> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
>> ->Write(0,50)
>> ->Write(100, 50)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN} ):
>> 0: {EO1, 50, 0, 50}
>> 100: {EO2, 50, 0, 50}
>>
>> Where EO1, EO2 - physical addresses for allocated extents.
>> Two new entries have been inserted.
>>
>> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
>> ->Write(25,100)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN} ):
>> 0: {EO1, 50, 0, 25}
>> 25: {EO3, 100, 0, 100}
>> 125: {EO2, 50, 25, 25}
>>
>> As one can see new entry at offset 25 has appeared and previous entries have
>> been altered (including the map key (100->125) for the last entry). No
>> physical extents reallocation took place though - just a new one at E03 has
>> been allocated.
>> Please note that client accessible data for block EO2 are actually stored at
>> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K total.
>> The same for block EO1 - valid data length = 25K only.
>>
>>
>> ****** Step 2 ( overwrite that partially overlaps existing blocks once again):
>> ->Write(70, 65)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN} ):
>> 0: {EO1, 50, 0, 25}
>> 25: {EO3, 100, 0, 45}
>> 70: {EO4, 65, 0, 65}
>> 135: {EO2, 50, 35, 15}
>>
>> Yet another new entry. Overlapped block entries at 25 & 125 were altered.
>>
>> ****** Step 3 ( overwrite that partially overlaps one block and totally
>> overwrite the last one):
>> ->Write(100, 60)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN} ):
>> 0: {EO1, 50, 0, 25}
>> 25: {EO3, 100, 0, 45}
>> 70: {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>> -140: {EO2, 50, 50, 0} -> to be removed as it's totally overwritten ( see
>> X_LEN = 0 )
>>
>> Entry for EO4 have been altered and entry EO2 to be removed. The latter can be
>> done both immediately on map alteration and by some background cleanup
>> procedure.
>>
>> ****** Step 4 ( overwrite that totally overlap the first block):
>> ->Write(0, 25)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN} ):
>> 0: {EO6, 25, 0, 25}
>> - 0: {EO1, 50, 25, 0} -> to be removed
>> 25: {EO3, 100, 0, 45}
>> 70: {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>>
>> Entry for EO1 has been overwritten and to be removed.
>> --------------------------------------------------------------------------------------
>>
>> Extending this block map for compression is trivial - we need to introduce
>> compression algorithm flag to the map. And vary EXT_LEN (and actual physical
>> allocation) depending on the actual compression ratio.
>> E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn into
>> :
>> 100: {EO5, 20, 0, 60}
>>
>> Other compression aspects handled by the corresponding policies ( e.g. when
>> perform the compression ( immediately, lazily or totally in background ) or
>> how to merge neighboring compressed blocks ) probably don't impact the
>> structure of the map entry - they just shuffle the entries.
> This is much simpler! There is one case we need to address that I don't
> see above, though. Consider,
>
> - write 0~1048576, and compress it
> - write 16384~4096
Good catch! I really missed this case.
> When we split the large extent into two pieces, the resulting extent map,
> as per above, would be something like
>
> 0: {EO1, 1048576, 0, 4096, zlib}
> 4096: {E02, 16384, 0, 4096, uncompressed}
> 16384: {E01, 1048576, 20480, 1028096, zlib}
>
> ...which is fine, except that it's the *same* compressed extent, which
> means the code that decides that the physical extent is no longer
> referenced and can be released needs to ensure that no other extents in
> the map reference it. I think that's an O(n) pass across the map when
> releasing.
>
> Also, if we add in checksums, then we'd be duplicating them in the two
> instances that reference the raw extent.
>
> I wonder if it makes sense to break this into two structures.. one that
> lists the raw extents, and another that maps them into the logical space.
> So that there is one record for {E01, 1048576, zlib, checksums}, and then
> the block map is more like
>
> 0: {E0, 0, 4096}
> 4096: {E1, 0, 4096}
> 16384: {E0, 20480, 1028096}
>
> and
>
> 0: E01, 1048576, 0, 4096, zlib, checksums
> 1: E02, 16384, 0, 4096, uncompressed, checksums
Sounds good!
> ?
>
> sage
Thanks,
Igor
next prev parent reply other threads:[~2016-03-21 18:01 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16 2:06 ` Haomai Wang
2016-02-17 0:11 ` Igor Fedotov
2016-02-19 23:13 ` Allen Samuels
2016-02-22 12:25 ` Sage Weil
2016-02-24 18:18 ` Igor Fedotov
2016-02-24 18:43 ` Allen Samuels
2016-02-26 17:41 ` Igor Fedotov
2016-03-15 17:12 ` Sage Weil
2016-03-16 1:06 ` Allen Samuels
2016-03-16 18:34 ` Igor Fedotov
2016-03-16 19:02 ` Allen Samuels
2016-03-16 19:15 ` Sage Weil
2016-03-16 19:20 ` Allen Samuels
2016-03-16 19:29 ` Sage Weil
2016-03-16 19:36 ` Allen Samuels
2016-03-17 14:55 ` Igor Fedotov
2016-03-17 15:28 ` Allen Samuels
2016-03-18 13:00 ` Igor Fedotov
2016-03-16 19:27 ` Sage Weil
2016-03-16 19:41 ` Allen Samuels
[not found] ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56 ` Blair Bethwaite
2016-03-17 3:21 ` Allen Samuels
2016-03-17 10:01 ` Willem Jan Withagen
2016-03-17 17:29 ` Howard Chu
2016-03-17 15:21 ` Igor Fedotov
2016-03-17 15:18 ` Igor Fedotov
2016-03-17 15:33 ` Sage Weil
2016-03-17 18:53 ` Allen Samuels
2016-03-18 14:58 ` Igor Fedotov
2016-03-18 15:53 ` Igor Fedotov
2016-03-18 17:17 ` Vikas Sinha-SSI
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:19 ` Igor Fedotov
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:07 ` Igor Fedotov
2016-03-21 15:14 ` Allen Samuels
2016-03-21 16:35 ` Igor Fedotov
2016-03-21 17:14 ` Allen Samuels
2016-03-21 18:31 ` Igor Fedotov
2016-03-21 21:14 ` Allen Samuels
2016-03-21 15:32 ` Igor Fedotov
2016-03-21 15:50 ` Sage Weil
2016-03-21 18:01 ` Igor Fedotov [this message]
2016-03-24 12:45 ` Igor Fedotov
2016-03-24 22:29 ` Allen Samuels
2016-03-29 20:19 ` Sage Weil
2016-03-29 20:45 ` Allen Samuels
2016-03-30 12:32 ` Igor Fedotov
2016-03-30 12:28 ` Igor Fedotov
2016-03-30 12:47 ` Sage Weil
2016-03-31 21:56 ` Sage Weil
2016-04-01 18:54 ` Allen Samuels
2016-04-04 12:31 ` Igor Fedotov
2016-04-04 12:38 ` Igor Fedotov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56F036EA.5060401@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=Allen.Samuels@sandisk.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox