From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: Status of inode scaling work Date: Thu, 27 Jan 2011 16:11:24 +1100 Message-ID: References: <20110125071011.GD28803@dastard> <20110127045736.GJ21311@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Dave Chinner , Al Viro , Christoph Hellwig , Linus Torvalds , linux-fsdevel To: Dave Chinner Return-path: Received: from mail-ww0-f44.google.com ([74.125.82.44]:42008 "EHLO mail-ww0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750703Ab1A0FL0 convert rfc822-to-8bit (ORCPT ); Thu, 27 Jan 2011 00:11:26 -0500 Received: by wwa36 with SMTP id 36so1781490wwa.1 for ; Wed, 26 Jan 2011 21:11:24 -0800 (PST) In-Reply-To: <20110127045736.GJ21311@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Jan 27, 2011 at 3:57 PM, Dave Chinner wro= te: > On Thu, Jan 27, 2011 at 11:40:57AM +1100, Nick Piggin wrote: >> On Tue, Jan 25, 2011 at 6:10 PM, Dave Chinner = wrote: >> > On Fri, Jan 14, 2011 at 08:02:57PM +1100, Nick Piggin wrote: >> >> I'd like to know about the status and plans of the inode scaling = work. >> >> >> >> Aiming for 2.6.39 or 40 might be an idea. >> > >> > It was ready for 2.6.38.... >> >> Can it go into linux-next? Unless Al can put it in his for-next bran= ch? > > It'll go into linux-next whenever it is sufficiently reviewed and > tested to get pulled into an upstream branch. > >> >> I'd like to be able to see what patches are proposed to finish br= eaking >> >> inode_lock into its constituent sub classes, and we can go from t= here. >> > >> > Already done. I pointed you at the tree a while back: >> > >> > The following changes since commit c723fdab8aa728dc2bf0da6a0de8bb9= c3f588d84: >> > >> > =C2=A0Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/= cifs-2.6 (2011-01-25 14:23:54 +1000) >> > >> > are available in the git repository at: >> > >> > =C2=A0git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git= inode-scale >> > >> > Dave Chinner (9): >> > =C2=A0 =C2=A0 =C2=A0fs: protect inode->i_state with inode->i_lock >> >> The places where you take i_lock to initialize i_state in newly >> allocated inodes should not be required, because they can't be found >> until they get added to lists, and adding to the list must have the = right >> barriers or lock synchronisation to make sure traversals don't see >> uninitialised inodes. >> >> eg. >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0spin_lock(&inode->i_lock); >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0inode->i_state =3D I_NEW; >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0hlist_add_head(&inode->i_hash, head); >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0spin_unlock(&inode->i_lock); >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0inode_sb_list_add(inode); >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0spin_unlock(&inode_lock); >> >> That lightens new inode paths a little bit. > > Not sure what you are try to say here - the above code fragment is > exactly how the code ends up after the patch.... I meant that the i_lock can be removed from that section. > > As it is, the reason it was done this way was suggested by Al: > > http://marc.info/?l=3Dlinux-fsdevel&m=3D128832928023866&w=3D2 > > We use ->i_lock when checking inode_unhashed(), so we need to hold > the ->i_lock when adding or removing the inode from the hash list. Ok right. I thought you were avoiding that approach in favour of checking i_state bits and using the data structure locks (as opposed to i_lock). > This is also important for when hash lookups get converted to RCU > traversals, because then the only thing protecting a newly added > inode from a concurrent traversal is the ->i_lock... Well rcu_assign_pointer has a memory barrier in it. So it would be possible for rcu lookups to do spin_lock(&inode->i_lock); if (inode->i_state & I_HASHED) { /* do stuff */ } Even without the spin lock in the inode init code. >> "Also remove the unlock_new_inode() memory barrier optimisation >> required to avoid taking the inode_lock when clearing I_NEW. >> Simplify the code by simply taking the inode->i_lock around the >> state change and wakeup. Because the wakeup is no longer tricky, >> remove the wake_up_inode() function and open code the wakeup where >> necessary." >> >> Can you do that bit separately? > > We'd still need to modify that code to protect the i_state > manipulations with ->i_lock in this patch, so it doesn't really make > sense to me to undo the optimisation in one patch, then remove the > rest of it in another patch... I'm not sure that I understand. The unlock_new_inode() path, for example, could retain the optimisation, couldn't it? >> > =C2=A0 =C2=A0 =C2=A0fs: factor inode disposal >> >> - =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode->i_lock); >> - >> - =C2=A0 =C2=A0 =C2=A0 /* >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0* Move the inode off the IO lists and L= RU once I_FREEING is >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0* set so that it won't get moved back o= n there if it is dirty. >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 inode_lru_list_del(inode); >> - =C2=A0 =C2=A0 =C2=A0 list_del_init(&inode->i_wb_list); >> - >> - =C2=A0 =C2=A0 =C2=A0 __inode_sb_list_del(inode); >> + =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode->i_lock); >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&inode_lock); >> + >> >> Why do you change i_lock to cover inode_lru_list_del in this >> patch? > > So that all LRU manipultions in the function are done consi=D1=95tent= ly > i.e. with the ->i_lock held. Why do you need that? >> > =C2=A0 =C2=A0 =C2=A0fs: Lock the inode LRU list separately >> >> BTW. have you found any issues with nesting locks inside i_lock? We >> do that extensively in dcache now, but it can be trivially made to u= se >> another lock (eg. i_dcache_lock). > > No new ones. But did you see actual problems that you thought might appear? >> > =C2=A0 =C2=A0 =C2=A0fs: remove inode_lock from iput_final and prun= e_icache >> > =C2=A0 =C2=A0 =C2=A0fs: move i_sb_list out from under inode_lock >> >> I don't know if we should be including ../internal.h, although you >> could argue that quota and notify is core code. > > I don't think there is much to argue about there. So, what is the use of internal.h? And why not just put it in fs.h (lik= e other things which are only to be used by core code)? I'm not saying one way or the other is "correct", but a single policy should be applied. >> > =C2=A0 =C2=A0 =C2=A0fs: move i_wb_list out from under inode_lock >> > =C2=A0 =C2=A0 =C2=A0fs: rename inode_lock to inode_hash_lock >> > =C2=A0 =C2=A0 =C2=A0fs: Clean up documentation references to inode= _lock >> >> Can these be merged with the patches as they change the locking? >> Some places also call it "inode lock" (eg. find_inode()) > > Yeah, probably should. Thanks. >> > =C2=A0 =C2=A0 =C2=A0fs: pull inode->i_lock up out of writeback_sin= gle_inode >> >> I think the series looks pretty good overall. Thanks. > > Don't thank me, thank Christoph, Al, Andrew, Eric and others for > reviewing it over and over to beat it into this shape. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html