Re: [PATCH] BTT: Use dram freelist and remove bflog to otpimize perf

NVDIMM Device and Persistent Memory development
 help / color / mirror / Atom feed

From: dennis.wu <dennis.wu@intel.com>
To: Dan Williams <dan.j.williams@intel.com>, <nvdimm@lists.linux.dev>
Cc: <vishal.l.verma@intel.com>, <dave.jiang@intel.com>
Subject: Re: [PATCH] BTT: Use dram freelist and remove bflog to otpimize perf
Date: Tue, 19 Jul 2022 14:01:08 +0800	[thread overview]
Message-ID: <810ab3e8-6755-da02-b6ed-ac480708067f@intel.com> (raw)
In-Reply-To: <62cd01462c460_5c814294e@dwillia2-xfh.notmuch>

Hi Dan,

Thank you!

Currently, we are working with one customer to evaluate the clickhouse 
and rocketmq with the optimization. From the preliminary performance 
data, we can see performance improvement. We will have some pathfinding 
work in Q3.

About the compatibility, we have the limitation to change from the new 
algorithm to the old one. I think it is good to have a new BTT layout 
version. I will check how to make it happen.

Thank you very much!

Dennis Wu

On 7/12/22 13:06, Dan Williams wrote:
> dennis.wu wrote:
>> Dependency:
>> [PATCH] nvdimm: Add NVDIMM_NO_DEEPFLUSH flag to control btt
>> data deepflush
>> https://lore.kernel.org/nvdimm/20220629135801.192821-1-dennis.wu@intel.com/T/#u
>>
>> Reason:
>> In BTT, each write will write sector data, update 4 bytes btt_map
>> entry and update 16 bytes bflog (two 8 bytes atomic write),the
>> meta data write overhead is big and we can optimize the algorithm
>> and not use the bflog. Then each write, we will update the sector
>> data and then 4 bytes btt_map entry.
>>
>> How:
>> 1. scan the btt_map to generate the aba mapping bitmap, if one
>> internal aba used, the bit will be set.
>> 2. generate the in-memory freelist according the aba bitmap, the
>> freelist is a array that records all the free ABAs like:
>> | 340 | 422 | 578 |...
>> that means ABA 340, 422, 578 are free. The last nfree(nlane)
>> records in the array will be used for each lane at the beginning.
>> 3. Get a free ABA of a lane, write data to the ABA. If the premap
>> btt_map entry is initialization state (e_flag=0, z_flag=0), get
>> an free ABA from the free ABA array for the lane. If the premap
>> btt_map entry is not in initialization state, the ABA in the
>> btt_map entry will be looked as the free ABA of the lane.Once
>> the free ABAs = nfree that means the arena is fully written and
>> we can free the whole freelist (not implimented yet).
>> 4. In the code, "version_major ==2" is the new algorithm and
>> the logic in else is the old algorithm.
>>
>> Result:
>> 1. The write performance can improve ~50% and the latency also
>> reduce to 60% of origial algorithm.
> How does this improvement affect a real-world workload vs a
> microbenchmark?
>
>> 2. During initialization, scan btt_map and generate the freelist
>> will take time and lead namespace enable longer. With 4K sector,
>> 1TB namespace, the enable time less than 4s. This will only happen
>> once during initalization.
>> 3. Take 4 bytes per sector memory to store the freelist. But once
>> the arena fully written, the freelist can be freed. As we know,in
>> the storage case, the disk always be fully written for usage, then
>> we don't have memory space overhead.
>>
>> Compatablity:
>> 1. The new algorithm keep the layout of bflog, only ignore its
>> logic, that means no update during new algorithm.
>> 2. If a namespace create with old algorithm and layout, you can
>> switch to the new algorithm seamless w/o any specific operation.
>> 3. Since the bflog will not be updated if you move to the new
>> algorithm. After you write data with the new algorithmyou, you
>> can't switch back from the new algorithm to old algorithm.
> Before digging deeper into the implementation, this needs a better
> compatibility story. It is not acceptable to break the on-media format
> like this.  Consider someone bisecting a kernel problem over this
> change, or someone reverting to an older kernel after encountering a
> regression. As far as I can see this would need to be a BTT3 layout and
> require explicit opt-in to move to the new format.

     prev parent reply	other threads:[~2022-07-19  6:00 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-30 13:42 [PATCH] BTT: Use dram freelist and remove bflog to otpimize perf dennis.wu
2022-07-11  2:31 ` dennis.wu
2022-07-12  5:06 ` Dan Williams
2022-07-19  6:01   ` dennis.wu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=810ab3e8-6755-da02-b6ed-ac480708067f@intel.com \
    --to=dennis.wu@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox