From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-sn1nam01on0128.outbound.protection.outlook.com ([104.47.32.128]:30621
	"EHLO NAM01-SN1-obe.outbound.protection.outlook.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751251AbcGSSmo (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 19 Jul 2016 14:42:44 -0400
Message-ID: <578E7497.30602@hpe.com>
Date: Tue, 19 Jul 2016 14:42:31 -0400
From: Waiman Long <waiman.long@hpe.com>
MIME-Version: 1.0
To: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.com>,
	Jeff Layton <jlayton@poochiereds.net>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	<linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andi Kleen <andi@firstfloor.org>,
	Dave Chinner <dchinner@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH v3 1/4] lib/dlock-list: Distributed and lock-protected
 lists
References: <1468604383-40362-1-git-send-email-Waiman.Long@hpe.com> <1468604383-40362-2-git-send-email-Waiman.Long@hpe.com> <20160718233803.GN3078@mtj.duckdns.org>
In-Reply-To: <20160718233803.GN3078@mtj.duckdns.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 07/18/2016 07:38 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Jul 15, 2016 at 01:39:40PM -0400, Waiman Long wrote:
>> Suggested-by: Tejun Heo<tj@kernel.org>
> Not sure I should be on suggested-by given that this wasn't my idea at
> all.

I put the tag there because of your suggestion to restructure the data 
structure which did make the patch better. I can remove that tag if you 
think it is not appropriate.

>> +/*
>> + * include/linux/dlock-list.h
>> + *
>> + * A distributed (per-cpu) set of lists each of which is protected by its
>> + * own spinlock, but acts like a single consolidated list to the callers.
>> + *
>> + * The dlock_list_head_percpu structure contains the spinlock, the other
>> + * dlock_list_node structures only contains a pointer to the spinlock in
>> + * dlock_list_head_percpu.
>> + */
> The more I think about it, the more bothered I'm about the dlock_list
> name.  For the most part, this isn't different from other percpu data
> structures in the kernel.  Sure, it might benefit from doing Nth cpu,
> but so are other percpu data structures and it's not just "distributed
> lock" list either.  The list itself is percpu, not just locking.  Can
> we please go back to percpu_list?  Christoph, what do you think?
>

As I said before, I don't mind reverting the name back to percpu_list. I 
am just waiting for a final agreement.

>> +struct dlock_list_node {
>> +	struct list_head list;
>> +	spinlock_t *lockptr;
>> +};
> Wouldn't it be better to point to dlock_list_percpu?

I could. However, the only thing that matter is the spinlock that 
protects the list entry.

>> +#define DLOCK_LIST_HEAD_PERCPU_INIT(name)			\
>> +	{							\
>> +		.list.prev =&name.list,			\
>> +		.list.next =&name.list,			\
> Use LIST_HEAD_INIT()?  Also, why do we even need the initializers if
> the data structure can only be dynamically allocated.  In fact, do
> the definitions even need to be exposed in the header?

I put it there for completeness sake. You are right. That macro isn't 
currently used. I will remove it in the next iteration of the patch.

>> +		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
>> +	}
>> +
>> +/*
>> + * dlock list iteration state
>> + */
>> +struct dlock_list_iter {
>> +	int			 cpu;
>                                  ^
> 	I'm not sure lining up with space here is common in kernel.

OK, I will remove the extra spaces to make it more conformant to the 
kernel style.

>> +	spinlock_t		*lock;
>> +	struct list_head	*head;	/* List head of current per-cpu list */
>> +	struct dlock_list_node	*curr;
>> +	struct dlock_list_node	*next;
>> +};
>> +
>> +#define DLOCK_LIST_ITER_INIT()			\
>                                 ^
> 		Don't we usually omit () in these cases?

Good point. Will remove the unneeded ().

>> +	{					\
>> +		.cpu  = -1,			\
>> +	}
>> +
>> +#define DEFINE_DLOCK_LIST_ITER(s)		\
>> +	struct dlock_list_iter s = DLOCK_LIST_ITER_INIT()
>> +
>> +static inline void init_dlock_list_iter(struct dlock_list_iter *iter)
>> +{
>> +	*iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT();
>> +}
>> +
>> +#define DLOCK_LIST_NODE_INIT(name)		\
>> +	{					\
>> +		.list.prev =&name.list,	\
>> +		.list.next =&name.list,	\
> 		^
> 		LIST_HEAD_INIT()?

Will make the change.

>> +	}
>> +
>> +static inline void init_dlock_list_node(struct dlock_list_node *node)
>> +{
>> +	INIT_LIST_HEAD(&node->list);
>> +	node->lockptr = NULL;
> Why not use DLOCK_LIST_NODE_INIT()?
>

Yes, I can make the change.

>> +}
>> +
>> +/*
>> + * Check if all the dlock lists are empty
>> + *
>> + * This can be a pretty expensive function call. If this function is required
>> + * in a performance critical path, we may have to maintain a global count
>> + * of the list entries in the global dlock_list_head structure instead.
>> + */
> /** function comment please.

Sure.

>> +static inline bool dlock_list_empty(struct dlock_list_head *dlist)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		if (!list_empty(&per_cpu_ptr(dlist->head, cpu)->list))
>> +			return false;
>> +	return true;
>> +}
>> +
>> +/*
>> + * Allocation and freeing of dlock list
>> + */
>> +extern int  alloc_dlock_list_head(struct dlock_list_head *dlist);
>                ^
> 	      ditto with alignment

Will remove the extra space.

>> +extern void free_dlock_list_head(struct dlock_list_head *dlist);
>> +
>> +/*
>> + * The dlock list iteration functions which return true if iteration has
>> + * to be continued.
>> + */
>> +extern bool dlock_list_next(struct dlock_list_head *dlist,
>> +			    struct dlock_list_iter *iter);
>> +extern bool dlock_list_next_safe(struct dlock_list_head *dlist,
>> +				 struct dlock_list_iter *iter);
> Why not return dlock_list_node * for the current node?  That'd more
> conventional and allows dlock_list_iter to be opaque.

Yes, I can make it return dlock_list_node *.

However, to make dlock_list_iter opaque, I will have to dynamically 
allocate the structure. That will add an extra memory allocation and 
free calls as well as handling the error case of running out of memory. 
I don't think that is worth doing at this point.

>> diff --git a/lib/dlock-list.c b/lib/dlock-list.c
>> new file mode 100644
>> index 0000000..af4a9f3
>> --- /dev/null
>> +++ b/lib/dlock-list.c
> ...
>> +int alloc_dlock_list_head(struct dlock_list_head *dlist)
>> +{
>> +	struct dlock_list_head dlist_tmp;
>> +	int cpu;
>> +
>> +	dlist_tmp.head = alloc_percpu(struct dlock_list_head_percpu);
>> +	if (!dlist_tmp.head)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		struct dlock_list_head_percpu *head;
>> +
>> +		head = per_cpu_ptr(dlist_tmp.head, cpu);
>> +		INIT_LIST_HEAD(&head->list);
>> +		head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
>> +		lockdep_set_class(&head->lock,&dlock_list_key);
>> +	}
>> +
>> +	dlist->head = dlist_tmp.head;
> Just use dlist->head directly or use local __perpcu head pointer?

I just don't want to expose the structure to world until it is fully 
initialized. If you think I am over-cautious, I can use dlist->head as 
suggested.

>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(alloc_dlock_list_head);
> Does this actually need to be exported?  If so, it might be a better
> idea to start with EXPORT_SYMBOL_GPL().

For the current use case, we probably don't need to export the symbols. 
Other use cases may require that. I will change it to use the version 
instead.

>> +void dlock_list_add(struct dlock_list_node *node,
>> +		    struct dlock_list_head *dlist)
>> +{
>> +	struct dlock_list_head_percpu *head;
>                                        ^
>
> This probably requires __percpu annotation.  Have you run it through
> sparse and checked for address space warnings?

You are right. I probably miss the __percpu annotation. I haven't run it 
through sparse. I will do that to fix any warning found.

>> +
>> +	/*
>> +	 * Disable preemption to make sure that CPU won't gets changed.
>> +	 */
>> +	head = get_cpu_ptr(dlist->head);
>> +	spin_lock(&head->lock);
>> +	node->lockptr =&head->lock;
>> +	list_add(&node->list,&head->list);
>> +	spin_unlock(&head->lock);
>> +	put_cpu_ptr(dlist->head);
>> +}
>> +EXPORT_SYMBOL(dlock_list_add);
>> +
>> +/**
>> + * dlock_list_del - Delete a node from a dlock list
>> + * @node : The node to be deleted
>> + *
>> + * We need to check the lock pointer again after taking the lock to guard
>> + * against concurrent deletion of the same node. If the lock pointer changes
>> + * (becomes NULL or to a different one), we assume that the deletion was done
>> + * elsewhere. A warning will be printed if this happens as it is likely to be
>> + * a bug.
>> + */
>> +void dlock_list_del(struct dlock_list_node *node)
>> +{
>> +	spinlock_t *lock = READ_ONCE(node->lockptr);
>> +
>> +	if (unlikely(!lock)) {
>> +		WARN_ONCE(1,
>> +			"dlock_list_del: node 0x%lx has no associated lock\n",
>> +			(unsigned long)node);
> Maybe "if (WARN_ONCE(!lock...)"?  WARN_ONCE implies unlikely.

OK, will do that.

>> +		return;
>> +	}
>> +
>> +	spin_lock(lock);
>> +	if (likely(lock == node->lockptr)) {
>> +		list_del_init(&node->list);
>> +		node->lockptr = NULL;
>> +	} else {
>> +		/*
>> +		 * This path should never be executed.
>> +		 */
>> +		WARN_ON_ONCE(1);
>> +	}
> This still kinda bothers me because this pretty much requires the
> users to have strong synchronization around the operations and makes
> it unusable in situations where opportunistic behaviors are
> acceptable.  It negates the usefulness quite a bit.

I understand your concern. I will make it retry again with the new lock.

>> +	spin_unlock(lock);
>> +}
>> +EXPORT_SYMBOL(dlock_list_del);
>> +
>> +/*
>> + * Helper function to find the first entry of the next per-cpu list
>> + * It works somewhat like for_each_possible_cpu(cpu).
>> + *
>> + * Return: true if the entry is found, false if all the lists exhausted
>> + *
>> + */
>> +static inline bool dlock_list_next_cpu(struct dlock_list_head *dlist,
>                ^
> 	Just let the compiler figure it out.

Sure. Will remove the inline tag.

>> +				       struct dlock_list_iter *iter)
>> +{
>> +	if (iter->lock)
>> +		spin_unlock(iter->lock);
>> +next_cpu:
>> +	/*
>> +	 * for_each_possible_cpu(cpu)
>> +	 */
>> +	iter->cpu = cpumask_next(iter->cpu, cpu_possible_mask);
>> +	if (iter->cpu>= nr_cpu_ids)
>> +		return false;	/* All the per-cpu lists iterated */
>> +
>> +	iter->head =&per_cpu_ptr(dlist->head, iter->cpu)->list;
>> +	if (list_empty(iter->head))
>> +		goto next_cpu;
>> +
>> +	iter->lock =&per_cpu_ptr(dlist->head, iter->cpu)->lock;
>> +	spin_lock(iter->lock);
>> +	/*
>> +	 * There is a slight chance that the list may become empty just
>> +	 * before the lock is acquired. So an additional check is
>> +	 * needed to make sure that iter->curr points to a valid entry.
>> +	 */
>> +	if (list_empty(iter->head)) {
>> +		spin_unlock(iter->lock);
>> +		goto next_cpu;
>> +	}
>> +	iter->curr = list_entry(iter->head->next,
>> +				 struct dlock_list_node, list);
>> +	return true;
>> +}
> ...
>> +/**
>> + * dlock_list_next_safe - Removal-safe iterator of dlock list
>> + * @dlist: Pointer to the dlock_list_head structure
>> + * @iter : Pointer to the dlock list iterator structure
>> + * Return: true if the next entry is found, false if all the entries iterated
>> + *
>> + * The iterator has to be properly initialized before calling this function.
>> + * This iteration function is safe with respect to list entry removal.
>> + * However, it cannot correctly iterate newly added entries right after the
>> + * current one.
>> + */
> This still looks wrong to me.  If you want to provide the two variants
> of iterations, can't you just implement one next function and build
> the two types of iterations on top of it?

I have been thinking about making dlock_list_next_cpu()  the real 
external function and have 2 inline functions that implement 
dlock_list_next() and dlock_list_next_safe(). That may strike a better 
balance between performance and code abstraction. I will do so if you 
have no objection to that.

Cheers,
Longman