From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753622AbcFGAlT (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 20:41:19 -0400
Received: from zeniv.linux.org.uk ([195.92.253.2]:45018 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752480AbcFGAlS (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 20:41:18 -0400
Date: Tue, 7 Jun 2016 01:40:58 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>, "Chen, Tim C" <tim.c.chen@intel.com>,
        Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso <dbueso@suse.de>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Jason Low <jason.low2@hp.com>, Michel Lespinasse <walken@google.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Waiman Long <waiman.long@hp.com>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion
Message-ID: <20160607004058.GH14480@ZenIV.linux.org.uk>
References: <5755D671.9070908@intel.com>
 <CA+55aFxH_7wjo_BgUPK5iomWedE2=DaUZVX-yruHOWEk7OTiHQ@mail.gmail.com>
 <20160606211522.GF14480@ZenIV.linux.org.uk>
 <CA+55aFw6oTECJ-GvYwkz4+RBgBoKiwPXfygOCZzX1V=KUEMG-g@mail.gmail.com>
 <20160606220753.GG14480@ZenIV.linux.org.uk>
 <alpine.LFD.2.20.1606061649070.12258@i7>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <alpine.LFD.2.20.1606061649070.12258@i7>
User-Agent: Mutt/1.6.0 (2016-04-01)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jun 06, 2016 at 04:50:59PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 6 Jun 2016, Al Viro wrote:
> > 
> > True in general, but here we really do a lot under that ->d_lock - all
> > list traversals are under it.  So I suspect that contention on nested
> > lock is not an issue in that particular load.  It's certainly a separate
> > commit, so we'll see how much does it give on its own, but I doubt that
> > it'll be anywhere near enough.
> 
> Hmm. Maybe. 
> 
> But at least we can try to minimize everything that happens under the 
> dentry->d_lock spinlock. 
> 
> So how about this patch? It's entirely untested, but it rewrites that 
> readdir() function to try to do the minimum possible under the d_lock 
> spinlock.
> 
> I say "rewrite", because it really is totally different. It's not just 
> that the nested "next" locking is gone, it also treats the cursor very 
> differently and tries to avoid doing any unnecessary cursor list 
> operations.

Similar to what I've got here, except that mine has a couple of helper
functions usable in dcache_dir_lseek() as well:

next_positive(parent, child, n) - returns nth positive child after that one
or NULL if there's less than n such.  NULL as the second argument => search
from the beginning.

move_cursor(cursor, child) - moves cursor immediately past child *or* to
the very end if child is NULL.

The third commit in series will be the lockless replacement for
for next_positive().  move_cursor() is easy - it became simply
	struct dentry *parent = cursor->d_parent;
	unsigned n, *seq = &parent->d_inode->i_dir_seq;
	spin_lock(&parent->d_lock);
        for (;;) {
                n = *seq;
                if (!(n & 1) && cmpxchg(seq, n, n + 1) == n)  
                        break;
                cpu_relax();
        }
        __list_del(cursor->d_child.prev, cursor->d_child.next);
        if (child)
                list_add(&cursor->d_child, &child->d_child);
        else
                list_add_tail(&cursor->d_child, &parent->d_subdirs);
	smp_store_release(seq, n + 2);
	spin_unlock(&parent->d_lock);

with

static struct dentry *next_positive(struct dentry *parent,
                                    struct dentry *child, int count)
{
        struct list_head *p = child ? &child->d_child : &parent->d_subdirs;
        unsigned *seq = &parent->d_inode->i_dir_seq, n;
        do {
                int i = count;
                n = smp_load_acquire(seq) & ~1;
                rcu_read_lock();
                do {
                        p = p->next;
                        if (p == &parent->d_subdirs) {
                                child = NULL;
                                break;
                        }
                        child = list_entry(p, struct dentry, d_child);
                } while (!simple_positive(child) || --i);
                rcu_read_unlock();
        } while (unlikely(smp_load_acquire(seq) != n));
        return child;
}
as initial attempt at lockless next_positive(); barriers are probably wrong,
though...