From: Igor Fedotov <ifedotov@mirantis.com>
To: Allen Samuels <Allen.Samuels@sandisk.com>, Sage Weil <sage@newdream.net>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Wed, 30 Mar 2016 15:32:51 +0300 [thread overview]
Message-ID: <56FBC773.90006@mirantis.com> (raw)
In-Reply-To: <CY1PR0201MB18978DB59BEB7C5DF68E346EE8870@CY1PR0201MB1897.namprd02.prod.outlook.com>
On 29.03.2016 23:45, Allen Samuels wrote:
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Tuesday, March 29, 2016 1:20 PM
>> To: Igor Fedotov <ifedotov@mirantis.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> On Thu, 24 Mar 2016, Igor Fedotov wrote:
>>> Sage, Allen et. al.
>>>
>>> Please find some follow-up on our discussion below.
>>>
>>> Your past and future comments are highly appreciated.
>>>
>>> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
>> OVERVIEW.
>>> Used terminology:
>>> Extent - basic allocation unit. Variable in size, maximum size is
>>> limited by lblock length (see below), alignment: min_alloc_unit param
>>> (configurable, expected range: 4-64 Kb .
>>> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
>>> Alignment unspecified. Max size limited by max_logical_unit param
>>> (configurable, expected range: 128-512 Kb)
>>>
>>> Compression to be applied on per-extent basis.
>>> Multiple lblocks can refer specific region within a single extent.
>> This (and the what's below) sound right to me. My main concern is around
>> naming. I don't much like "extent" vs "lblock" (which is which?). Maybe
>> extent and extent_ref?
>>
>> Also, I don't think we need the size limits you mention above. When
>> compression is enabled, we'll limit the size of the disk extents by policy, but
>> the structures themselves needn't enforce that. Similarly, I don't think the
>> lblocks (extent refs? logical extents?) need a max size either.
>>
>> Anyway, right now we have bluestore_extent_t. I'd suggest maybe
>>
>> bluestore_pextent_t and bluestore_lextent_t or
>> bluestore_extent_t and bluestore_extent_ref_t
>>
>> ?
> I prefer the lextent and pextent variant.
+1
>
> Can't we move all of these into a namespace, i.e., bluestore::lextent_t, bluestore::pextent_t, bluestore::onode_t, bluestore::bdev_label_t, etc.. That way the code within Bluestore itself doesn't have to keep redundantly repeating itself with super-long type names...
+1
but I'd suggest to have a standalone activity for that code refactor.
>>> POTENTIAL COMPRESSION APPLICATION POLICIES
>>>
>>> 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
>>> New write request triggers partially overlapped lblock(s)
>>> reading/decompression followed by their merge into a set of new
>>> lblocks. Then compression is (optionally) applied. Resulting lblocks
>>> overwrite existing ones.
>>> For non-overlapping/fully overlapped lblocks read/merge steps are
>>> simply bypassed.
>>> - Read, merge and final compression take place prior to write commit
>>> ack that can impact write operation latency.
>>>
>>> 2) Deferred RMW for partial overlaps. (DRMW) General approach:
>>> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
>>> For partially overlapped lblocks one should use Write-Ahead Log to
>>> defer RMW procedure until write commit ack return.
>>> - Write operation latency can still be high in some cases (
>>> non-overlapped/fully overlapped writes).
>>> - WAL can grow significantly.
>>>
>>> 3) Writing new lblocks over new extents. (LBlock Bedding?) General
>>> approach:
>>> Write request creates new lblock(s) that use freshly allocated extents.
>>> Overlapped regions within existing lblocks are occluded.
>>> Previously existing extents are preserved for some time (or while
>>> being used) depending on the cleanup policy.
>>> Compression to be performed before write commit ack return.
>>> - Write operation latency is still affected by the compression.
>>> - Store space usage is usually higher.
>>>
>>> 4) Background compression (BCOMP)
>>> General approach:
>>> Write request to be handled using any of the above policies (or their
>>> combination) with no compression applied. Stored extents are
>>> compressed by some background process independently from the client
>> write flow.
>>> Merging new uncompressed lblock with already compressed one can be
>>> tricky here.
>>> + Write operation latency isn't affected by the compression.
>>> - Double disk write occurs
>>>
>>> To provide better user experience above-mentioned policies can be used
>>> together depending on the write pattern.
>>>
>>> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
>>>
>>> To track object content we need to introduce following 2 collections:
>>>
>>> 1) LBlock map:
>>> That's a logical offset mapping to a region within an extent:
>>> LOFFS -> {
>>> EXTENT_REF - reference to an underlying extent, e.g. pointer for
>>> in-memory representation or extent ID for "on-disk" one
>>> X_OFFS, X_LEN, - region descriptor within an extent: relative offset and
>>> region length
>>> LFLAGS - some associated flags for the lblock. Any usage???
>>> }
>>>
>>> 2) Extent collection:
>>> Each entry describes an allocation unit within storage space.
>>> Compression to be applied on per-extent basis thus extent's logical
>>> volume can be greater than it's physical size.
>>>
>>> {
>>> P_OFFS - physical block address
>>> SIZE - actual stored data length
>>> EFLAGS - flags associated with the extent
>>> COMPRESSION_ALG - An applied compression algorithm id if any
>>> CHECKSUM(s) - Pre-/Post compression checksums. Use cases TBD.
>>> REFCOUNT - Number of references to this entry
>>> }
>> Yep (modulo naming).
>>
>>> The possible container for this collection can be a mapping: id ->
>>> extent. It looks like such mapping is required during on-disk to
>>> in-memory representation transform as smart pointer seems to be enough
>> for in-memory use.
>>
>> Given the structures are small I'm not sure smart pointers are worth it..
>> Maybe just a simple vector (or maybe flat_map) for the extents? Lookup will
>> be fast.
>>
> Smart pointers don't work well in the code. The deallocation of the pextent is more than just freeing the memory when the lextent reference count goes to zero -- It also includes the updating of a transaction to mutate the KV store to match the deallocation. Thus the destructor needs a reference to the KeyValueDB::transaction, which isn't really clean and easy to arrange (you'll have to hide it in the object, or some other ugly hack). From a coding perspective, I think you'll just have manually managed reference counts with explicit deallocation calls that pass in the right parameters.
Agree.
next prev parent reply other threads:[~2016-03-30 12:32 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16 2:06 ` Haomai Wang
2016-02-17 0:11 ` Igor Fedotov
2016-02-19 23:13 ` Allen Samuels
2016-02-22 12:25 ` Sage Weil
2016-02-24 18:18 ` Igor Fedotov
2016-02-24 18:43 ` Allen Samuels
2016-02-26 17:41 ` Igor Fedotov
2016-03-15 17:12 ` Sage Weil
2016-03-16 1:06 ` Allen Samuels
2016-03-16 18:34 ` Igor Fedotov
2016-03-16 19:02 ` Allen Samuels
2016-03-16 19:15 ` Sage Weil
2016-03-16 19:20 ` Allen Samuels
2016-03-16 19:29 ` Sage Weil
2016-03-16 19:36 ` Allen Samuels
2016-03-17 14:55 ` Igor Fedotov
2016-03-17 15:28 ` Allen Samuels
2016-03-18 13:00 ` Igor Fedotov
2016-03-16 19:27 ` Sage Weil
2016-03-16 19:41 ` Allen Samuels
[not found] ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56 ` Blair Bethwaite
2016-03-17 3:21 ` Allen Samuels
2016-03-17 10:01 ` Willem Jan Withagen
2016-03-17 17:29 ` Howard Chu
2016-03-17 15:21 ` Igor Fedotov
2016-03-17 15:18 ` Igor Fedotov
2016-03-17 15:33 ` Sage Weil
2016-03-17 18:53 ` Allen Samuels
2016-03-18 14:58 ` Igor Fedotov
2016-03-18 15:53 ` Igor Fedotov
2016-03-18 17:17 ` Vikas Sinha-SSI
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:19 ` Igor Fedotov
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:07 ` Igor Fedotov
2016-03-21 15:14 ` Allen Samuels
2016-03-21 16:35 ` Igor Fedotov
2016-03-21 17:14 ` Allen Samuels
2016-03-21 18:31 ` Igor Fedotov
2016-03-21 21:14 ` Allen Samuels
2016-03-21 15:32 ` Igor Fedotov
2016-03-21 15:50 ` Sage Weil
2016-03-21 18:01 ` Igor Fedotov
2016-03-24 12:45 ` Igor Fedotov
2016-03-24 22:29 ` Allen Samuels
2016-03-29 20:19 ` Sage Weil
2016-03-29 20:45 ` Allen Samuels
2016-03-30 12:32 ` Igor Fedotov [this message]
2016-03-30 12:28 ` Igor Fedotov
2016-03-30 12:47 ` Sage Weil
2016-03-31 21:56 ` Sage Weil
2016-04-01 18:54 ` Allen Samuels
2016-04-04 12:31 ` Igor Fedotov
2016-04-04 12:38 ` Igor Fedotov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56FBC773.90006@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=Allen.Samuels@sandisk.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox