All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] VFS: Cut down inode->i_op->xyz accesses in path walking
Date: Tue, 26 Jul 2011 04:05:31 +0100	[thread overview]
Message-ID: <20110726030531.GE22133@ZenIV.linux.org.uk> (raw)
In-Reply-To: <alpine.LFD.2.02.1107251852220.13796@i5.linux-foundation.org>

On Mon, Jul 25, 2011 at 07:04:11PM -0700, Linus Torvalds wrote:
> 
> 
> One of the biggest remaining unnecessary costs in path walking is the
> pointer chasing in inode operations.  We already avoided the
> dentry->d_op derferences with the DCACHE_OP_xyz flags, this just starts
> doing the same thing for the i_op->xyz cases and new IOP_xyz flags in 
> the new inode->i_opflag field.
> 
> To make sure the i_opflag field is consistent with i_op, we introduce a 
> simple wrapper function to do assignments to i_op. That makes the patch 
> big, but it was almost entirely mechanical and very straightforward.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> Ok, so this patch is pretty big, but the bulk if it is an entirely 
> mechanical replacement of
> 
> 	inode->i_op = xyz;
> 
> with
> 
> 	set_inode_ops(inode, xyz);
> 
> using a very stupid sed-script. It was followed by some manual editing 
> (a couple of fixups for my sed-script being stupid, and a couple of 
> whitespace cleanups for pre-existing whitespace problems that 'git diff' 
> just highlights).
> 
> This removes the 'inode->i_op->xyz' chain from the critical pathname walk, 
> although the *single* hottest instruction in link_path_walk on my Westmere 
> system remains the test of 'inode->i_opflag' for IOP_FOLLOW_LINK:
> 
>    14.70%: testb  $0x2,(%rax)
> 
> because it turns out that that instruction is what brings in the inode 
> into the L1 cache. But hey, it's certainly better than getting that i_op 
> and then following the pointer there and checking it for NULL.
> 
> And it's better to the tune of roughly 0.1s for my empty "make -j" test 
> (which is actually dominated by the user-space costs in 'make', but has a 
> largish pathname component to it too). That's out of about 4.3s, so it is 
> about 2%. Not huge, but it seems quite measurable.
> 
> Of course, those kinds of costs will depend very much on cache sizes and 
> microarchitectural details, and the "empty kernel make -j" may not be the 
> most relevant benchmark around, but it's more or a real load than some.
> 
> Hmm? Too painful?

We could actually cheat a bit.  *All* inodes that ever reach inode_permission()
have at some point been pointed to by ->d_inode of some dentry.  It's
inconvenient as hell (and inviting abuse) to pass such dentry instead
of inode; moreover, there would be nasty questions of ->d_inode stability
on RCU path.

However, we can greatly reduce that amount of changes if we do the
following:
	* one bit in your new field set when inode is put in ->d_inode.
Checked at inode_permission() time, BUG_ON() if not set.
	* other bit(s) set by checking ->i_op contents at the same time,
if bit hadn't been already set (->i_op can't change afterwards); checked
by inode_permission() and whatever else might want to (e.g. checks for
non-NULL ->lookup(); for those we also know that inode went through
some ->d_inode at some point).
	* we need to play with setting these flags only in two places -
__d_instantiate() and d_obtain_alias().

IOW, add

/* called with inode->i_lock held */
static inline void set_inode_flags(struct inode *inode)
{
	if (!(inode->i_opflag & Iop_valid)) {
		struct inode_operations *op = inode->i_op;
		int flags = Iop_valid;
		if (op->permission)
			flags |= Iop_permission;
		if (op->lookup)
			flags |= Iop_lookup;
		inode->i_opflag = flags;
	}
}

and turn __d_instantiate() into
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
        spin_lock(&dentry->d_lock);
        if (inode) {
                if (unlikely(IS_AUTOMOUNT(inode)))
                        dentry->d_flags |= DCACHE_NEED_AUTOMOUNT;
		set_inode_flags(inode);
                list_add(&dentry->d_alias, &inode->i_dentry);
        }
        dentry->d_inode = inode;
        dentry_rcuwalk_barrier(dentry);
        spin_unlock(&dentry->d_lock);
        fsnotify_d_instantiate(dentry, inode);
}

with set_inode_flags(inode); also added right after
        tmp->d_inode = inode;
in d_obtain_alias()

Voila - inode_permission() would do
	int flags;
	...
	flags = inode->i_opflag;
	BUG_ON(!(flags & Iop_valid));
	if (unlikely(flags & Iop_permission))
		res = inode->i_op->permission(...);
	else
		res = generic_permission(...)
and we are all set.  Individual fs code is not affected at all...  Sure, it's
a bit of cheating, but it avoids a lot of churn *and* a nasty class of bugs
in the future - somebody assigning ->i_op directly.  Dunno...

  parent reply	other threads:[~2011-07-26  3:05 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-22 17:37 VFS pathname walking cleanups (i_op and ACL access) Linus Torvalds
2011-07-22 17:37 ` [PATCH 1/2] VFS: Cut down inode->i_op->xyz accesses in path walking Linus Torvalds
2011-07-22 17:47   ` Christoph Hellwig
2011-07-22 23:40   ` Al Viro
2011-07-22 23:54     ` Linus Torvalds
2011-07-23  3:55   ` [PATCH] " Linus Torvalds
2011-07-23 13:35     ` Christoph Hellwig
2011-07-23 14:46       ` Al Viro
2011-07-23 14:51         ` Christoph Hellwig
2011-07-23 15:45         ` Linus Torvalds
     [not found]     ` <alpine.LFD.2.02.1107251852220.13796@i5.linux-foundation.org>
2011-07-26  3:05       ` Al Viro [this message]
2011-07-26  3:23         ` Linus Torvalds
2011-07-26 18:41           ` Al Viro
2011-07-26 18:45             ` Linus Torvalds
2011-08-07  6:06           ` Linus Torvalds
2011-08-07  6:51             ` Al Viro
2011-08-07 23:43               ` Linus Torvalds
2011-07-22 17:40 ` VFS pathname walking cleanups (i_op and ACL access) Christoph Hellwig
2011-07-22 17:45 ` [PATCH 2/2] vfs: move ACL cache lookup into generic code Linus Torvalds
2011-07-22 17:50   ` Christoph Hellwig
2011-07-22 17:54     ` Linus Torvalds
2011-07-23  2:34   ` [PATCH] " Linus Torvalds
2011-07-23  3:29     ` Al Viro
2011-07-23  3:42       ` Linus Torvalds
2011-07-23  4:31         ` Al Viro
2011-07-23  6:06           ` Al Viro
2011-07-25  8:15             ` Aneesh Kumar K.V
2011-07-25  8:16           ` Aneesh Kumar K.V
2011-07-23  7:47       ` Al Viro
2011-07-23 14:50         ` Christoph Hellwig
2011-07-23 15:32           ` Al Viro
2011-07-23 17:02             ` Al Viro
2011-07-23 17:31               ` Linus Torvalds
2011-07-23 18:20                 ` Al Viro
2011-07-23 18:29                   ` Linus Torvalds
2011-07-23 21:53                     ` Al Viro
2011-07-23 22:38                       ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110726030531.GE22133@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.