From: Igor Fedotov <ifedotov@mirantis.com>
To: Allen Samuels <Allen.Samuels@sandisk.com>, Sage Weil <sage@newdream.net>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Wed, 30 Mar 2016 15:32:51 +0300 [thread overview]
Message-ID: <56FBC773.90006@mirantis.com> (raw)
In-Reply-To: <CY1PR0201MB18978DB59BEB7C5DF68E346EE8870@CY1PR0201MB1897.namprd02.prod.outlook.com>
On 29.03.2016 23:45, Allen Samuels wrote:
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Tuesday, March 29, 2016 1:20 PM
>> To: Igor Fedotov <ifedotov@mirantis.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> On Thu, 24 Mar 2016, Igor Fedotov wrote:
>>> Sage, Allen et. al.
>>>
>>> Please find some follow-up on our discussion below.
>>>
>>> Your past and future comments are highly appreciated.
>>>
>>> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
>> OVERVIEW.
>>> Used terminology:
>>> Extent - basic allocation unit. Variable in size, maximum size is
>>> limited by lblock length (see below), alignment: min_alloc_unit param
>>> (configurable, expected range: 4-64 Kb .
>>> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
>>> Alignment unspecified. Max size limited by max_logical_unit param
>>> (configurable, expected range: 128-512 Kb)
>>>
>>> Compression to be applied on per-extent basis.
>>> Multiple lblocks can refer specific region within a single extent.
>> This (and the what's below) sound right to me. My main concern is around
>> naming. I don't much like "extent" vs "lblock" (which is which?). Maybe
>> extent and extent_ref?
>>
>> Also, I don't think we need the size limits you mention above. When
>> compression is enabled, we'll limit the size of the disk extents by policy, but
>> the structures themselves needn't enforce that. Similarly, I don't think the
>> lblocks (extent refs? logical extents?) need a max size either.
>>
>> Anyway, right now we have bluestore_extent_t. I'd suggest maybe
>>
>> bluestore_pextent_t and bluestore_lextent_t or
>> bluestore_extent_t and bluestore_extent_ref_t
>>
>> ?
> I prefer the lextent and pextent variant.
+1
>
> Can't we move all of these into a namespace, i.e., bluestore::lextent_t, bluestore::pextent_t, bluestore::onode_t, bluestore::bdev_label_t, etc.. That way the code within Bluestore itself doesn't have to keep redundantly repeating itself with super-long type names...
+1
but I'd suggest to have a standalone activity for that code refactor.
>>> POTENTIAL COMPRESSION APPLICATION POLICIES
>>>
>>> 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
>>> New write request triggers partially overlapped lblock(s)
>>> reading/decompression followed by their merge into a set of new
>>> lblocks. Then compression is (optionally) applied. Resulting lblocks
>>> overwrite existing ones.
>>> For non-overlapping/fully overlapped lblocks read/merge steps are
>>> simply bypassed.
>>> - Read, merge and final compression take place prior to write commit
>>> ack that can impact write operation latency.
>>>
>>> 2) Deferred RMW for partial overlaps. (DRMW) General approach:
>>> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
>>> For partially overlapped lblocks one should use Write-Ahead Log to
>>> defer RMW procedure until write commit ack return.
>>> - Write operation latency can still be high in some cases (
>>> non-overlapped/fully overlapped writes).
>>> - WAL can grow significantly.
>>>
>>> 3) Writing new lblocks over new extents. (LBlock Bedding?) General
>>> approach:
>>> Write request creates new lblock(s) that use freshly allocated extents.
>>> Overlapped regions within existing lblocks are occluded.
>>> Previously existing extents are preserved for some time (or while
>>> being used) depending on the cleanup policy.
>>> Compression to be performed before write commit ack return.
>>> - Write operation latency is still affected by the compression.
>>> - Store space usage is usually higher.
>>>
>>> 4) Background compression (BCOMP)
>>> General approach:
>>> Write request to be handled using any of the above policies (or their
>>> combination) with no compression applied. Stored extents are
>>> compressed by some background process independently from the client
>> write flow.
>>> Merging new uncompressed lblock with already compressed one can be
>>> tricky here.
>>> + Write operation latency isn't affected by the compression.
>>> - Double disk write occurs
>>>
>>> To provide better user experience above-mentioned policies can be used
>>> together depending on the write pattern.
>>>
>>> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
>>>
>>> To track object content we need to introduce following 2 collections:
>>>
>>> 1) LBlock map:
>>> That's a logical offset mapping to a region within an extent:
>>> LOFFS -> {
>>> EXTENT_REF - reference to an underlying extent, e.g. pointer for
>>> in-memory representation or extent ID for "on-disk" one
>>> X_OFFS, X_LEN, - region descriptor within an extent: relative offset and
>>> region length
>>> LFLAGS - some associated flags for the lblock. Any usage???
>>> }
>>>
>>> 2) Extent collection:
>>> Each entry describes an allocation unit within storage space.
>>> Compression to be applied on per-extent basis thus extent's logical
>>> volume can be greater than it's physical size.
>>>
>>> {
>>> P_OFFS - physical block address
>>> SIZE - actual stored data length
>>> EFLAGS - flags associated with the extent
>>> COMPRESSION_ALG - An applied compression algorithm id if any
>>> CHECKSUM(s) - Pre-/Post compression checksums. Use cases TBD.
>>> REFCOUNT - Number of references to this entry
>>> }
>> Yep (modulo naming).
>>
>>> The possible container for this collection can be a mapping: id ->
>>> extent. It looks like such mapping is required during on-disk to
>>> in-memory representation transform as smart pointer seems to be enough
>> for in-memory use.
>>
>> Given the structures are small I'm not sure smart pointers are worth it..
>> Maybe just a simple vector (or maybe flat_map) for the extents? Lookup will
>> be fast.
>>
> Smart pointers don't work well in the code. The deallocation of the pextent is more than just freeing the memory when the lextent reference count goes to zero -- It also includes the updating of a transaction to mutate the KV store to match the deallocation. Thus the destructor needs a reference to the KeyValueDB::transaction, which isn't really clean and easy to arrange (you'll have to hide it in the object, or some other ugly hack). From a coding perspective, I think you'll just have manually managed reference counts with explicit deallocation calls that pass in the right parameters.
Agree.
next prev parent reply other threads:[~2016-03-30 12:32 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16 2:06 ` Haomai Wang
2016-02-17 0:11 ` Igor Fedotov
2016-02-19 23:13 ` Allen Samuels
2016-02-22 12:25 ` Sage Weil
2016-02-24 18:18 ` Igor Fedotov
2016-02-24 18:43 ` Allen Samuels
2016-02-26 17:41 ` Igor Fedotov
2016-03-15 17:12 ` Sage Weil
2016-03-16 1:06 ` Allen Samuels
2016-03-16 18:34 ` Igor Fedotov
2016-03-16 19:02 ` Allen Samuels
2016-03-16 19:15 ` Sage Weil
2016-03-16 19:20 ` Allen Samuels
2016-03-16 19:29 ` Sage Weil
2016-03-16 19:36 ` Allen Samuels
2016-03-17 14:55 ` Igor Fedotov
2016-03-17 15:28 ` Allen Samuels
2016-03-18 13:00 ` Igor Fedotov
2016-03-16 19:27 ` Sage Weil
2016-03-16 19:41 ` Allen Samuels
[not found] ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56 ` Blair Bethwaite
2016-03-17 3:21 ` Allen Samuels
2016-03-17 10:01 ` Willem Jan Withagen
2016-03-17 17:29 ` Howard Chu
2016-03-17 15:21 ` Igor Fedotov
2016-03-17 15:18 ` Igor Fedotov
2016-03-17 15:33 ` Sage Weil
2016-03-17 18:53 ` Allen Samuels
2016-03-18 14:58 ` Igor Fedotov
2016-03-18 15:53 ` Igor Fedotov
2016-03-18 17:17 ` Vikas Sinha-SSI
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:19 ` Igor Fedotov
2016-03-19 3:14 ` Allen Samuels
2016-03-21 14:07 ` Igor Fedotov
2016-03-21 15:14 ` Allen Samuels
2016-03-21 16:35 ` Igor Fedotov
2016-03-21 17:14 ` Allen Samuels
2016-03-21 18:31 ` Igor Fedotov
2016-03-21 21:14 ` Allen Samuels
2016-03-21 15:32 ` Igor Fedotov
2016-03-21 15:50 ` Sage Weil
2016-03-21 18:01 ` Igor Fedotov
2016-03-24 12:45 ` Igor Fedotov
2016-03-24 22:29 ` Allen Samuels
2016-03-29 20:19 ` Sage Weil
2016-03-29 20:45 ` Allen Samuels
2016-03-30 12:32 ` Igor Fedotov [this message]
2016-03-30 12:28 ` Igor Fedotov
2016-03-30 12:47 ` Sage Weil
2016-03-31 21:56 ` Sage Weil
2016-04-01 18:54 ` Allen Samuels
2016-04-04 12:31 ` Igor Fedotov
2016-04-04 12:38 ` Igor Fedotov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56FBC773.90006@mirantis.com \
--to=ifedotov@mirantis.com \
--cc=Allen.Samuels@sandisk.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.