NVDIMM Device and Persistent Memory development
 help / color / mirror / Atom feed
From: dennis.wu <dennis.wu@intel.com>
To: "nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>
Cc: "Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"Jiang, Dave" <dave.jiang@intel.com>
Subject: Re: [PATCH] BTT: Use dram freelist and remove bflog to otpimize perf
Date: Mon, 11 Jul 2022 10:31:46 +0800	[thread overview]
Message-ID: <60ee6aef-3dd7-a2ce-1a04-c8f19e9efd59@intel.com> (raw)
In-Reply-To: <20220630134244.685331-1-dennis.wu@intel.com>

Vishal, Dan and Dave,

Can you help review the patch and give some comments?

BR,
Dennis Wu

On 6/30/22 21:42, Wu, Dennis wrote:
> Dependency:
> [PATCH] nvdimm: Add NVDIMM_NO_DEEPFLUSH flag to control btt
> data deepflush
> https://lore.kernel.org/nvdimm/20220629135801.192821-1-dennis.wu@intel.com/T/#u
>
> Reason:
> In BTT, each write will write sector data, update 4 bytes btt_map
> entry and update 16 bytes bflog (two 8 bytes atomic write),the
> meta data write overhead is big and we can optimize the algorithm
> and not use the bflog. Then each write, we will update the sector
> data and then 4 bytes btt_map entry.
>
> How:
> 1. scan the btt_map to generate the aba mapping bitmap, if one
> internal aba used, the bit will be set.
> 2. generate the in-memory freelist according the aba bitmap, the
> freelist is a array that records all the free ABAs like:
> | 340 | 422 | 578 |...
> that means ABA 340, 422, 578 are free. The last nfree(nlane)
> records in the array will be used for each lane at the beginning.
> 3. Get a free ABA of a lane, write data to the ABA. If the premap
> btt_map entry is initialization state (e_flag=0, z_flag=0), get
> an free ABA from the free ABA array for the lane. If the premap
> btt_map entry is not in initialization state, the ABA in the
> btt_map entry will be looked as the free ABA of the lane.Once
> the free ABAs = nfree that means the arena is fully written and
> we can free the whole freelist (not implimented yet).
> 4. In the code, "version_major ==2" is the new algorithm and
> the logic in else is the old algorithm.
>
> Result:
> 1. The write performance can improve ~50% and the latency also
> reduce to 60% of origial algorithm.
> 2. During initialization, scan btt_map and generate the freelist
> will take time and lead namespace enable longer. With 4K sector,
> 1TB namespace, the enable time less than 4s. This will only happen
> once during initalization.
> 3. Take 4 bytes per sector memory to store the freelist. But once
> the arena fully written, the freelist can be freed. As we know,in
> the storage case, the disk always be fully written for usage, then
> we don't have memory space overhead.
>
> Compatablity:
> 1. The new algorithm keep the layout of bflog, only ignore its
> logic, that means no update during new algorithm.
> 2. If a namespace create with old algorithm and layout, you can
> switch to the new algorithm seamless w/o any specific operation.
> 3. Since the bflog will not be updated if you move to the new
> algorithm. After you write data with the new algorithmyou, you
> can't switch back from the new algorithm to old algorithm.
>
> Signed-off-by: dennis.wu <dennis.wu@intel.com>
> ---
>   drivers/nvdimm/btt.c | 231 ++++++++++++++++++++++++++++++++++---------
>   drivers/nvdimm/btt.h |  15 +++
>   2 files changed, 199 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index c71ba7a1edd0..1d75e5f4d88e 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -70,10 +70,6 @@ static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
>   	dev_WARN_ONCE(to_dev(arena), !IS_ALIGNED(arena->info2off, 512),
>   		"arena->info2off: %#llx is unaligned\n", arena->info2off);
>   
> -	/*
> -	 * btt_sb is critial information and need proper write
> -	 * nvdimm_flush will be called (deepflush)
> -	 */
>   	ret = arena_write_bytes(arena, arena->info2off, super,
>   			sizeof(struct btt_sb), 0);
>   	if (ret)
> @@ -194,6 +190,8 @@ static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping,
>   		break;
>   	case 3:
>   		*mapping = postmap;
> +		z_flag = 1;
> +		e_flag = 1;
>   		break;
>   	default:
>   		return -EIO;
> @@ -507,6 +505,30 @@ static u64 to_namespace_offset(struct arena_info *arena, u64 lba)
>   	return arena->dataoff + ((u64)lba * arena->internal_lbasize);
>   }
>   
> +static int arena_clear_error(struct arena_info *arena, u32 lba)
> +{
> +	int ret = 0;
> +
> +	void *zero_page = page_address(ZERO_PAGE(0));
> +	u64 nsoff = to_namespace_offset(arena, lba);
> +	unsigned long len = arena->sector_size;
> +
> +	mutex_lock(&arena->err_lock);
> +	while (len) {
> +		unsigned long chunk = min(len, PAGE_SIZE);
> +
> +		ret = arena_write_bytes(arena, nsoff, zero_page,
> +			chunk, 0);
> +		if (ret)
> +			break;
> +		len -= chunk;
> +		nsoff += chunk;
> +	}
> +	mutex_unlock(&arena->err_lock);
> +
> +	return ret;
> +}
> +
>   static int arena_clear_freelist_error(struct arena_info *arena, u32 lane)
>   {
>   	int ret = 0;
> @@ -536,6 +558,82 @@ static int arena_clear_freelist_error(struct arena_info *arena, u32 lane)
>   	return ret;
>   }
>   
> +/*
> + * get_aba_in_a_lane - get a free block out of the freelist.
> + * @arena: arena handler
> + * @lane:	the block (postmap) will be put back to free array list
> + */
> +static inline void get_lane_aba(struct arena_info *arena,
> +		u32 lane, u32 *entry)
> +{
> +	uint32_t free_num;
> +
> +	spin_lock(&(arena->list_lock.lock));
> +	free_num = arena->freezone_array.free_num;
> +	arena->lane_free[lane] = arena->freezone_array.free_array[free_num - 1];
> +	arena->freezone_array.free_num = free_num - 1;
> +	spin_unlock(&(arena->list_lock.lock));
> +
> +	*entry = arena->lane_free[lane];
> +}
> +
> +static int btt_freezone_init(struct arena_info *arena)
> +{
> +	int ret = 0, trim, err;
> +	u32 i;
> +	u32 mapping;
> +	u8 *aba_map_byte, *aba_map;
> +	u32 *free_array;
> +	u32 free_num = 0;
> +	u32 aba_map_size = (arena->internal_nlba>>3) + 1;
> +
> +	aba_map = vzalloc(aba_map_size);
> +	if (!aba_map)
> +		return -ENOMEM;
> +
> +	/*
> +	 * prepare the aba_map, each aba will be in a bit, occupied bit=1, free bit=0
> +	 * the scan will take times, but it is only once execution during initialization.
> +	 */
> +	for (i = 0; i < arena->external_nlba; i++) {
> +		ret = btt_map_read(arena, i, &mapping, &trim, &err, 0);
> +		if (ret || (trim == 0 && err == 0))
> +			continue;
> +		if (mapping < arena->internal_nlba) {
> +			aba_map_byte = aba_map + (mapping>>3);
> +			*aba_map_byte |= (u8)(1<<(mapping % 8));
> +		}
> +	}
> +
> +	/*
> +	 * Scan the aba_bitmap , use the static array, that will take 1% memory.
> +	 */
> +	free_array = vmalloc(arena->internal_nlba*sizeof(u32));
> +	if (!free_array) {
> +		vfree(aba_map);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < arena->internal_nlba; i++) {
> +		aba_map_byte = aba_map + (i>>3);
> +		if (((*aba_map_byte) & (1<<(i%8))) == 0) {
> +			free_array[free_num] = i;
> +			free_num++;
> +		}
> +	}
> +	spin_lock_init(&(arena->list_lock.lock));
> +
> +	for (i = 0; i < arena->nfree; i++) {
> +		arena->lane_free[i] = free_array[free_num - 1];
> +		free_num--;
> +	}
> +	arena->freezone_array.free_array = free_array;
> +	arena->freezone_array.free_num = free_num;
> +
> +	vfree(aba_map);
> +	return ret;
> +}
> +
>   static int btt_freelist_init(struct arena_info *arena)
>   {
>   	int new, ret;
> @@ -597,8 +695,7 @@ static int btt_freelist_init(struct arena_info *arena)
>   			 * to complete the map write. So fix up the map.
>   			 */
>   			ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
> -					le32_to_cpu(log_new.new_map), 0, 0,
> -					NVDIMM_NO_DEEPFLUSH);
> +					le32_to_cpu(log_new.new_map), 0, 0, NVDIMM_NO_DEEPFLUSH);
>   			if (ret)
>   				return ret;
>   		}
> @@ -813,7 +910,12 @@ static void free_arenas(struct btt *btt)
>   		list_del(&arena->list);
>   		kfree(arena->rtt);
>   		kfree(arena->map_locks);
> -		kfree(arena->freelist);
> +		if (arena->version_major == 2) {
> +			if (arena->freezone_array.free_array)
> +				vfree(arena->freezone_array.free_array);
> +		} else {
> +			kfree(arena->freelist);
> +		}
>   		debugfs_remove_recursive(arena->debugfs_dir);
>   		kfree(arena);
>   	}
> @@ -892,14 +994,18 @@ static int discover_arenas(struct btt *btt)
>   		arena->external_lba_start = cur_nlba;
>   		parse_arena_meta(arena, super, cur_off);
>   
> -		ret = log_set_indices(arena);
> -		if (ret) {
> -			dev_err(to_dev(arena),
> -				"Unable to deduce log/padding indices\n");
> -			goto out;
> -		}
> +		if (arena->version_major == 2) {
> +			ret = btt_freezone_init(arena);
> +		} else {
> +			ret = log_set_indices(arena);
> +			if (ret) {
> +				dev_err(to_dev(arena),
> +					"Unable to deduce log/padding indices\n");
> +				goto out;
> +			}
>   
> -		ret = btt_freelist_init(arena);
> +			ret = btt_freelist_init(arena);
> +		}
>   		if (ret)
>   			goto out;
>   
> @@ -984,9 +1090,11 @@ static int btt_arena_write_layout(struct arena_info *arena)
>   	if (ret)
>   		return ret;
>   
> -	ret = btt_log_init(arena);
> -	if (ret)
> -		return ret;
> +	if (arena->version_major != 2) {
> +		ret = btt_log_init(arena);
> +		if (ret)
> +			return ret;
> +	}
>   
>   	super = kzalloc(sizeof(struct btt_sb), GFP_NOIO);
>   	if (!super)
> @@ -1039,7 +1147,10 @@ static int btt_meta_init(struct btt *btt)
>   		if (ret)
>   			goto unlock;
>   
> -		ret = btt_freelist_init(arena);
> +		if (arena->version_major == 2)
> +			ret = btt_freezone_init(arena);
> +		else
> +			ret = btt_freelist_init(arena);
>   		if (ret)
>   			goto unlock;
>   
> @@ -1233,12 +1344,14 @@ static int btt_read_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   			u32 new_map;
>   			int new_t, new_e;
>   
> -			if (t_flag) {
> +			/* t_flag = 1, e_flag = 0 or t_flag=0, e_flag=0 */
> +			if ((t_flag && e_flag == 0) || (t_flag == 0 && e_flag == 0)) {
>   				zero_fill_data(page, off, cur_len);
>   				goto out_lane;
>   			}
>   
> -			if (e_flag) {
> +			/* t_flag = 0, e_flag = 1*/
> +			if (e_flag && t_flag == 0) {
>   				ret = -EIO;
>   				goto out_lane;
>   			}
> @@ -1326,6 +1439,7 @@ static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   	while (len) {
>   		u32 cur_len;
>   		int e_flag;
> +		int z_flag;
>   
>    retry:
>   		lane = nd_region_acquire_lane(btt->nd_region);
> @@ -1340,29 +1454,41 @@ static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   			goto out_lane;
>   		}
>   
> -		if (btt_is_badblock(btt, arena, arena->freelist[lane].block))
> -			arena->freelist[lane].has_err = 1;
> +		if (arena->version_major == 2) {
> +			new_postmap = arena->lane_free[lane];
> +			if (btt_is_badblock(btt, arena, new_postmap)
> +				|| mutex_is_locked(&arena->err_lock)) {
> +				nd_region_release_lane(btt->nd_region, lane);
> +				ret = arena_clear_error(arena, new_postmap);
> +				if (ret)
> +					return ret;
> +				/* OK to acquire a different lane/free block */
> +				goto retry;
> +			}
> +		} else {
> +			if (btt_is_badblock(btt, arena, arena->freelist[lane].block))
> +				arena->freelist[lane].has_err = 1;
>   
> -		if (mutex_is_locked(&arena->err_lock)
> -				|| arena->freelist[lane].has_err) {
> -			nd_region_release_lane(btt->nd_region, lane);
> +			if (mutex_is_locked(&arena->err_lock)
> +					|| arena->freelist[lane].has_err) {
> +				nd_region_release_lane(btt->nd_region, lane);
>   
> -			ret = arena_clear_freelist_error(arena, lane);
> -			if (ret)
> -				return ret;
> +				ret = arena_clear_freelist_error(arena, lane);
> +				if (ret)
> +					return ret;
>   
> -			/* OK to acquire a different lane/free block */
> -			goto retry;
> -		}
> +				/* OK to acquire a different lane/free block */
> +				goto retry;
> +			}
>   
> -		new_postmap = arena->freelist[lane].block;
> +			new_postmap = arena->freelist[lane].block;
> +		}
>   
>   		/* Wait if the new block is being read from */
>   		for (i = 0; i < arena->nfree; i++)
>   			while (arena->rtt[i] == (RTT_VALID | new_postmap))
>   				cpu_relax();
>   
> -
>   		if (new_postmap >= arena->internal_nlba) {
>   			ret = -EIO;
>   			goto out_lane;
> @@ -1380,7 +1506,7 @@ static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   		}
>   
>   		lock_map(arena, premap);
> -		ret = btt_map_read(arena, premap, &old_postmap, NULL, &e_flag,
> +		ret = btt_map_read(arena, premap, &old_postmap, &z_flag, &e_flag,
>   				NVDIMM_IO_ATOMIC);
>   		if (ret)
>   			goto out_map;
> @@ -1388,17 +1514,25 @@ static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   			ret = -EIO;
>   			goto out_map;
>   		}
> -		if (e_flag)
> -			set_e_flag(old_postmap);
> -
> -		log.lba = cpu_to_le32(premap);
> -		log.old_map = cpu_to_le32(old_postmap);
> -		log.new_map = cpu_to_le32(new_postmap);
> -		log.seq = cpu_to_le32(arena->freelist[lane].seq);
> -		sub = arena->freelist[lane].sub;
> -		ret = btt_flog_write(arena, lane, sub, &log);
> -		if (ret)
> -			goto out_map;
> +
> +		if (arena->version_major == 2) {
> +			if (z_flag == 0 && e_flag == 0) /* initialization state (00)*/
> +				get_lane_aba(arena, lane, &old_postmap);
> +			else
> +				arena->lane_free[lane] = old_postmap;
> +		} else {
> +			if (e_flag && z_flag != 1) /* Error State (10) */
> +				set_e_flag(old_postmap);
> +
> +			log.lba = cpu_to_le32(premap);
> +			log.old_map = cpu_to_le32(old_postmap);
> +			log.new_map = cpu_to_le32(new_postmap);
> +			log.seq = cpu_to_le32(arena->freelist[lane].seq);
> +			sub = arena->freelist[lane].sub;
> +			ret = btt_flog_write(arena, lane, sub, &log);
> +			if (ret)
> +				goto out_map;
> +		}
>   
>   		ret = btt_map_write(arena, premap, new_postmap, 0, 0,
>   			NVDIMM_IO_ATOMIC|NVDIMM_NO_DEEPFLUSH);
> @@ -1408,8 +1542,11 @@ static int btt_write_pg(struct btt *btt, struct bio_integrity_payload *bip,
>   		unlock_map(arena, premap);
>   		nd_region_release_lane(btt->nd_region, lane);
>   
> -		if (e_flag) {
> -			ret = arena_clear_freelist_error(arena, lane);
> +		if (e_flag && z_flag != 1) {
> +			if (arena->version_major == 2)
> +				ret = arena_clear_error(arena, old_postmap);
> +			else
> +				ret = arena_clear_freelist_error(arena, lane);
>   			if (ret)
>   				return ret;
>   		}
> diff --git a/drivers/nvdimm/btt.h b/drivers/nvdimm/btt.h
> index 0c76c0333f6e..996af269f854 100644
> --- a/drivers/nvdimm/btt.h
> +++ b/drivers/nvdimm/btt.h
> @@ -8,6 +8,7 @@
>   #define _LINUX_BTT_H
>   
>   #include <linux/types.h>
> +#include "nd.h"
>   
>   #define BTT_SIG_LEN 16
>   #define BTT_SIG "BTT_ARENA_INFO\0"
> @@ -185,6 +186,20 @@ struct arena_info {
>   	u64 info2off;
>   	/* Pointers to other in-memory structures for this arena */
>   	struct free_entry *freelist;
> +
> +	/*divide the whole arena into #lanes zone. */
> +	struct zone_free {
> +		u32 free_num;
> +		u32 *free_array;
> +	} freezone_array;
> +	struct aligned_lock list_lock;
> +
> +	/*
> +	 * each lane, keep at least one free ABA
> +	 * if in the lane, no ABA, get one from freelist
> +	 */
> +	u32 lane_free[BTT_DEFAULT_NFREE];
> +
>   	u32 *rtt;
>   	struct aligned_lock *map_locks;
>   	struct nd_btt *nd_btt;

  reply	other threads:[~2022-07-11  2:31 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-30 13:42 [PATCH] BTT: Use dram freelist and remove bflog to otpimize perf dennis.wu
2022-07-11  2:31 ` dennis.wu [this message]
2022-07-12  5:06 ` Dan Williams
2022-07-19  6:01   ` dennis.wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=60ee6aef-3dd7-a2ce-1a04-c8f19e9efd59@intel.com \
    --to=dennis.wu@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox