From mboxrd@z Thu Jan  1 00:00:00 1970
From: Nick Piggin <npiggin@gmail.com>
Subject: Re: Status of inode scaling work
Date: Thu, 27 Jan 2011 16:11:24 +1100
Message-ID: <AANLkTimyU47GB5Y-eM4rToFa23HXLFeeXRmTokw2cNwJ@mail.gmail.com>
References: <AANLkTinpZ2Qf1W2qocGhMNki_7b9G1xt=83b5QvwJXxb@mail.gmail.com>
	<20110125071011.GD28803@dastard>
	<AANLkTinkr78scr+22crPknHWARU0wGa7OkZLopP4VHAx@mail.gmail.com>
	<20110127045736.GJ21311@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Dave Chinner <dchinner@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-ww0-f44.google.com ([74.125.82.44]:42008 "EHLO
	mail-ww0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750703Ab1A0FL0 convert rfc822-to-8bit (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 27 Jan 2011 00:11:26 -0500
Received: by wwa36 with SMTP id 36so1781490wwa.1
        for <linux-fsdevel@vger.kernel.org>; Wed, 26 Jan 2011 21:11:24 -0800 (PST)
In-Reply-To: <20110127045736.GJ21311@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Thu, Jan 27, 2011 at 3:57 PM, Dave Chinner <david@fromorbit.com> wro=
te:
> On Thu, Jan 27, 2011 at 11:40:57AM +1100, Nick Piggin wrote:
>> On Tue, Jan 25, 2011 at 6:10 PM, Dave Chinner <david@fromorbit.com> =
wrote:
>> > On Fri, Jan 14, 2011 at 08:02:57PM +1100, Nick Piggin wrote:
>> >> I'd like to know about the status and plans of the inode scaling =
work.
>> >>
>> >> Aiming for 2.6.39 or 40 might be an idea.
>> >
>> > It was ready for 2.6.38....
>>
>> Can it go into linux-next? Unless Al can put it in his for-next bran=
ch?
>
> It'll go into linux-next whenever it is sufficiently reviewed and
> tested to get pulled into an upstream branch.
>
>> >> I'd like to be able to see what patches are proposed to finish br=
eaking
>> >> inode_lock into its constituent sub classes, and we can go from t=
here.
>> >
>> > Already done. I pointed you at the tree a while back:
>> >
>> > The following changes since commit c723fdab8aa728dc2bf0da6a0de8bb9=
c3f588d84:
>> >
>> > =C2=A0Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/=
cifs-2.6 (2011-01-25 14:23:54 +1000)
>> >
>> > are available in the git repository at:
>> >
>> > =C2=A0git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git=
 inode-scale
>> >
>> > Dave Chinner (9):
>> > =C2=A0 =C2=A0 =C2=A0fs: protect inode->i_state with inode->i_lock
>>
>> The places where you take i_lock to initialize i_state in newly
>> allocated inodes should not be required, because they can't be found
>> until they get added to lists, and adding to the list must have the =
right
>> barriers or lock synchronisation to make sure traversals don't see
>> uninitialised inodes.
>>
>> eg.
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0spin_lock(&inode->i_lock);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0inode->i_state =3D I_NEW;
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0hlist_add_head(&inode->i_hash, head);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0spin_unlock(&inode->i_lock);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0inode_sb_list_add(inode);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0spin_unlock(&inode_lock);
>>
>> That lightens new inode paths a little bit.
>
> Not sure what you are try to say here - the above code fragment is
> exactly how the code ends up after the patch....

I meant that the i_lock can be removed from that section.

>
> As it is, the reason it was done this way was suggested by Al:
>
> http://marc.info/?l=3Dlinux-fsdevel&m=3D128832928023866&w=3D2
>
> We use ->i_lock when checking inode_unhashed(), so we need to hold
> the ->i_lock when adding or removing the inode from the hash list.

Ok right. I thought you were avoiding that approach in favour of
checking i_state bits and using the data structure locks (as opposed
to i_lock).


> This is also important for when hash lookups get converted to RCU
> traversals, because then the only thing protecting a newly added
> inode from a concurrent traversal is the ->i_lock...

Well rcu_assign_pointer has a memory barrier in it. So it would be
possible for rcu lookups to do

spin_lock(&inode->i_lock);
if (inode->i_state & I_HASHED) {
  /* do stuff */
}

Even without the spin lock in the inode init code.


>> "Also remove the unlock_new_inode() memory barrier optimisation
>> required to avoid taking the inode_lock when clearing I_NEW.
>> Simplify the code by simply taking the inode->i_lock around the
>> state change and wakeup. Because the wakeup is no longer tricky,
>> remove the wake_up_inode() function and open code the wakeup where
>> necessary."
>>
>> Can you do that bit separately?
>
> We'd still need to modify that code to protect the i_state
> manipulations with ->i_lock in this patch, so it doesn't really make
> sense to me to undo the optimisation in one patch, then remove the
> rest of it in another patch...

I'm not sure that I understand. The unlock_new_inode() path, for
example, could retain the optimisation, couldn't it?


>> > =C2=A0 =C2=A0 =C2=A0fs: factor inode disposal
>>
>> - =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode->i_lock);
>> -
>> - =C2=A0 =C2=A0 =C2=A0 /*
>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0* Move the inode off the IO lists and L=
RU once I_FREEING is
>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0* set so that it won't get moved back o=
n there if it is dirty.
>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 inode_lru_list_del(inode);
>> - =C2=A0 =C2=A0 =C2=A0 list_del_init(&inode->i_wb_list);
>> -
>> - =C2=A0 =C2=A0 =C2=A0 __inode_sb_list_del(inode);
>> + =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode->i_lock);
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode_lock);
>> +
>>
>> Why do you change i_lock to cover inode_lru_list_del in this
>> patch?
>
> So that all LRU manipultions in the function are done consi=D1=95tent=
ly
> i.e. with the ->i_lock held.

Why do you need that?


>> > =C2=A0 =C2=A0 =C2=A0fs: Lock the inode LRU list separately
>>
>> BTW. have you found any issues with nesting locks inside i_lock? We
>> do that extensively in dcache now, but it can be trivially made to u=
se
>> another lock (eg. i_dcache_lock).
>
> No new ones.

But did you see actual problems that you thought might appear?


>> > =C2=A0 =C2=A0 =C2=A0fs: remove inode_lock from iput_final and prun=
e_icache
>> > =C2=A0 =C2=A0 =C2=A0fs: move i_sb_list out from under inode_lock
>>
>> I don't know if we should be including ../internal.h, although you
>> could argue that quota and notify is core code.
>
> I don't think there is much to argue about there.

So, what is the use of internal.h? And why not just put it in fs.h (lik=
e
other things which are only to be used by core code)?

I'm not saying one way or the other is "correct", but a single policy
should be applied.


>> > =C2=A0 =C2=A0 =C2=A0fs: move i_wb_list out from under inode_lock
>> > =C2=A0 =C2=A0 =C2=A0fs: rename inode_lock to inode_hash_lock
>> > =C2=A0 =C2=A0 =C2=A0fs: Clean up documentation references to inode=
_lock
>>
>> Can these be merged with the patches as they change the locking?
>> Some places also call it "inode lock" (eg. find_inode())
>
> Yeah, probably should.

Thanks.


>> > =C2=A0 =C2=A0 =C2=A0fs: pull inode->i_lock up out of writeback_sin=
gle_inode
>>
>> I think the series looks pretty good overall. Thanks.
>
> Don't thank me, thank Christoph, Al, Andrew, Eric and others for
> reviewing it over and over to beat it into this shape.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html