From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965833AbbCPPfE (ORCPT ); Mon, 16 Mar 2015 11:35:04 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:36839 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758202AbbCPPe5 (ORCPT ); Mon, 16 Mar 2015 11:34:57 -0400 Message-ID: <5506F811.5000004@fb.com> Date: Mon, 16 Mar 2015 11:34:41 -0400 From: Josef Bacik User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Jan Kara CC: , , , , Dave Chinner Subject: Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru References: <1426016724-23912-1-git-send-email-jbacik@fb.com> <1426016724-23912-9-git-send-email-jbacik@fb.com> <20150316122704.GJ4934@quack.suse.cz> In-Reply-To: <20150316122704.GJ4934@quack.suse.cz> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.52.13] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.13.68,1.0.33,0.0.0000 definitions=2015-03-16_03:2015-03-13,2015-03-16,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 kscore.is_bulkscore=0 kscore.compositescore=0 circleOfTrustscore=0 compositescore=0.925924926977281 suspectscore=2 recipient_domain_to_sender_totalscore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 rbsscore=0.925924926977281 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=0 spamscore=0 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.925924926977281 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1503160160 X-FB-Internal: deliver Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/16/2015 08:27 AM, Jan Kara wrote: > Hello, > > On Tue 10-03-15 15:45:23, Josef Bacik wrote: >> From: Dave Chinner >> >> The per-superblock inode list and lock is a bottleneck for systems >> that cycle inodes in and out of cache concurrently. The global lock >> is a limiting factor. >> >> Most of the additions to the sb inode list occur on the CPU that >> allocated the inode, and most of the removals occur during evict() >> calls as a result of memory reclaim. Both of these events are local >> to the node that the inode belongs to, so it maps to the per-node >> lists that the list_lru uses. >> >> There are several places where the inode list is walked. These can >> be converted easily to use list_lru_walk() to do their work on each >> inode on the list. > Err, after looking at the patch I'm not so sure list_lru is such a great > fit for inode sb list. We don't use any of its LRU features so arguably > that just makes things less readable, there's some memcg code which I hope > isn't activated for the inode list but how do I tell? We just use it as > general per-numa-node linked lists so if we go down the path of > per-numa-node linked list I'd prefer to have an abstraction just for that. > Also there are some lifetime issues I'll comment on in the code. > > And do we really want per-numa-node lists? Why not per-cpu? Is it the > memory cost we are concerned about? Or is it the cost of iterating all CPUs > when wanting to traverse all inodes? The changelog doesn't tell. The > traversal is used on unmount (inode eviction, fsnotify mark eviction), > quota on/off (adding / removing dquot pointers to / from inodes), > drop_caches, sync (iterate all block devices). Sync is IMO the only one > where CPU overhead might be a concern but we could create a better method > for iterating all block devices. After all creating / removing block > devices could happily use just ordinary lists. So CPU overhead shouldn't be > a huge problem. Memory overhead is (list_head + spinlock) per-sb-per-cpu. > That doesn't sound too terrible either. Yeah sorry, I had it in my head that the list was already per-cpu, but you're right its per-numa. > > When speaking about this with Andi Kleen, he was suggesting that we maybe > want to not have a per-cpu lists but a list per a batch of CPUs (say 8, the > actual number would be tuned during boot based on the number of actual > CPUs). That may be an option as well. > > I wouldn't like to block this series just for this patch since all the > others are fine (modulo some minor comments). So could you repost the > series without this patch so that it can get queued for the coming merge > window? It should be easy to rebase the patch 9/9 to not need this patch. > Yeah I can do that. >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 2eb2436..d23ce6f 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) >> } >> EXPORT_SYMBOL(__invalidate_device); >> >> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg) >> -{ >> - struct inode *inode, *old_inode = NULL; >> +struct bdev_iter { >> + void (*func)(struct block_device *, void *); >> + void *arg; >> + struct inode *toput_inode; >> +}; >> >> - spin_lock(&blockdev_superblock->s_inode_list_lock); >> - list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) { >> - struct address_space *mapping = inode->i_mapping; >> +static enum lru_status >> +bdev_iter_cb(struct list_head *item, struct list_lru_one *lru, >> + spinlock_t *lock, void *cb_arg) >> +{ >> + struct bdev_iter *iter = cb_arg; >> + struct inode *inode = container_of(item, struct inode, i_sb_list); >> >> - spin_lock(&inode->i_lock); >> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) || >> - mapping->nrpages == 0) { >> - spin_unlock(&inode->i_lock); >> - continue; >> - } >> - __iget(inode); >> + spin_lock(&inode->i_lock); >> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) || >> + inode->i_mapping->nrpages == 0) { >> spin_unlock(&inode->i_lock); >> - spin_unlock(&blockdev_superblock->s_inode_list_lock); >> - /* >> - * We hold a reference to 'inode' so it couldn't have been >> - * removed from s_inodes list while we dropped the >> - * s_inode_list_lock We cannot iput the inode now as we can >> - * be holding the last reference and we cannot iput it under >> - * s_inode_list_lock. So we keep the reference and iput it >> - * later. >> - */ >> - iput(old_inode); >> - old_inode = inode; >> + return LRU_SKIP; >> + } >> + __iget(inode); >> + spin_unlock(&inode->i_lock); >> + spin_unlock(lock); >> >> - func(I_BDEV(inode), arg); >> + iput(iter->toput_inode); >> + iter->toput_inode = inode; >> >> - spin_lock(&blockdev_superblock->s_inode_list_lock); >> - } >> - spin_unlock(&blockdev_superblock->s_inode_list_lock); >> - iput(old_inode); >> + iter->func(I_BDEV(inode), iter->arg); >> + >> + /* >> + * Even though we have dropped the lock here, we can return LRU_SKIP as >> + * we have a reference to the current inode and so it's next pointer is >> + * guaranteed to be valid even though we dropped the list lock. >> + */ >> + spin_lock(lock); >> + return LRU_SKIP; >> +} > So I actually think doing the iteration like this is buggy with list_lru > code. When we pin inode by grabbing i_count, we are only sure the inode > won't go away under us. However there's nothing which protects from list > modifications. So this method is only safe to be combined with > list_for_each[_entry](). Specifically it does *not* work when combined with > list_for_each_safe() which __list_lru_walk_one() uses because the next > pointer that is cached may become invalid once we drop the lock. And I'd > really hate to try to make these two work together - that would seem really > fragile since subtle changes to how lru_list iteration work could break > inode iteration. > > This is true for all the iterations over the inodes that are playing with > i_count. It's not just about this particular case of blockdev iteration. > >> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c >> index a4e1a8f..a0cdc66 100644 >> --- a/fs/notify/inode_mark.c >> +++ b/fs/notify/inode_mark.c >> @@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark, >> return ret; >> } >> >> -/** >> - * fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes. >> - * @sb: superblock being unmounted. >> - * >> - * Called during unmount with no locks held, so needs to be safe against >> - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block. >> - */ >> -void fsnotify_unmount_inodes(struct super_block *sb) >> -{ >> - struct inode *inode, *next_i, *need_iput = NULL; >> - >> - spin_lock(&sb->s_inode_list_lock); >> - list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) { >> - struct inode *need_iput_tmp; >> - >> - /* >> - * We cannot __iget() an inode in state I_FREEING, >> - * I_WILL_FREE, or I_NEW which is fine because by that point >> - * the inode cannot have any associated watches. >> - */ >> - spin_lock(&inode->i_lock); >> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { >> - spin_unlock(&inode->i_lock); >> - continue; >> - } >> +static enum lru_status >> +fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru, >> + spinlock_t *lock, void *cb_arg) >> + { >> + struct inode **toput_inode = cb_arg; >> + struct inode *inode = container_of(item, struct inode, i_sb_list); >> + >> + /* New or being freed inodes cannot have any associated watches. */ >> + spin_lock(&inode->i_lock); >> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { >> + spin_unlock(&inode->i_lock); >> + return LRU_SKIP; >> + } >> >> - /* >> - * If i_count is zero, the inode cannot have any watches and >> - * doing an __iget/iput with MS_ACTIVE clear would actually >> - * evict all inodes with zero i_count from icache which is >> - * unnecessarily violent and may in fact be illegal to do. >> - */ >> - if (!atomic_read(&inode->i_count)) { >> - spin_unlock(&inode->i_lock); >> - continue; >> - } >> - >> - need_iput_tmp = need_iput; >> - need_iput = NULL; >> - >> - /* In case fsnotify_inode_delete() drops a reference. */ >> - if (inode != need_iput_tmp) >> - __iget(inode); >> - else >> - need_iput_tmp = NULL; >> + /* If i_count is zero, the inode cannot have any watches */ >> + if (!atomic_read(&inode->i_count)) { >> spin_unlock(&inode->i_lock); >> + return LRU_SKIP; >> + } >> >> - /* In case the dropping of a reference would nuke next_i. */ >> - while (&next_i->i_sb_list != &sb->s_inodes) { >> - spin_lock(&next_i->i_lock); >> - if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) && >> - atomic_read(&next_i->i_count)) { >> - __iget(next_i); >> - need_iput = next_i; >> - spin_unlock(&next_i->i_lock); >> - break; >> - } >> - spin_unlock(&next_i->i_lock); >> - next_i = list_entry(next_i->i_sb_list.next, >> - struct inode, i_sb_list); >> - } >> + __iget(inode); >> + spin_unlock(&inode->i_lock); >> + spin_unlock(lock); >> >> - /* >> - * We can safely drop s_inode_list_lock here because either >> - * we actually hold references on both inode and next_i or >> - * end of list. Also no new inodes will be added since the >> - * umount has begun. >> - */ >> - spin_unlock(&sb->s_inode_list_lock); >> + iput(*toput_inode); >> + *toput_inode = inode; >> >> - if (need_iput_tmp) >> - iput(need_iput_tmp); >> + /* for each watch, send FS_UNMOUNT and then remove it */ >> + fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0); >> + fsnotify_inode_delete(inode); >> >> - /* for each watch, send FS_UNMOUNT and then remove it */ >> - fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0); >> + /* >> + * Even though we have dropped the lock here, we can return LRU_SKIP as >> + * we have a reference to the current inode and so it's next pointer is >> + * guaranteed to be valid even though we dropped the list lock. >> + */ >> + spin_lock(lock); >> + return LRU_SKIP; >> +} > So in the original code of fsnotify_unmount_inodes() you can see the > hoops you have to jump through when having to iterate inode list with > list_for_each_entry_safe(). Deleting that code without deeper thought > wasn't a good idea ;). > > As a side note, it should be possible to convert this code to use just > list_for_each() and avoid all that crap but it will require me to dwelve into > fsnotify magic again to verify it's correct. I guess I'll leave that after > your patches are queued so that I don't make life harder to you. > Well these two cases are just bugs, we are supposed to return LRU_RETRY every time we drop the lru list lock so we loop back to the beginning. But I agree with your other points, I'll just drop this one and find/create a batched per-cpu list to use instead and do that separately so we can still get the meat of this patchset in. Thanks, Josef