From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752301Ab3LCE2s (ORCPT ); Mon, 2 Dec 2013 23:28:48 -0500 Received: from zeniv.linux.org.uk ([195.92.253.2]:51579 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751696Ab3LCE2p (ORCPT ); Mon, 2 Dec 2013 23:28:45 -0500 Date: Tue, 3 Dec 2013 04:28:30 +0000 From: Al Viro To: Linus Torvalds Cc: Simon Kirby , Ingo Molnar , Peter Zijlstra , Waiman Long , Ian Applegate , Christoph Lameter , Pekka Enberg , LKML , Chris Mason Subject: Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Message-ID: <20131203042830.GI10323@ZenIV.linux.org.uk> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 02, 2013 at 06:58:57PM -0800, Linus Torvalds wrote: > In other words, it's unsafe to protect reference counts inside objects > with anything but spinlocks and/or atomic refcounts. Or you have to > have the lock *outside* the object you're protecting (which is often > what you want for other reasons anyway, notably lookup). > > So having a non-atomic refcount protected inside a sleeping lock is a > bug, and that's really the bug that ended up biting us for pipes. > > Now, the question is: do we have other such cases? How do we document > this? Do we try to make mutexes and other locks safe to use for things > like this? Umm... AFAICS, in VFS proper we have files_struct - atomic_dec_and_test fs_struct - spinlock + int file - atomic_long_dec_and_test (with delays after that, including RCU). super_block - global spinlock + int (s_count); the mutex in there (->s_umount) can be taken by anybody who holds an active ref *or* has bumped ->s_count while holding sb_lock. Exactly to prevent that kind of unpleasantness. Freeing RCU-delayed. vfsmount - percpu counter + flag + global seqlock, with quite a bit of contortions for the sake of avoiding cross-CPU stores on fastpath; discussed back in October, concluded to be safe. Freeing RCU-delayed. dentry - lockref, with RCU-delayed actual freeing. file_system_type, nls_table, linux_binfmt - module refcount of "owner"; search structures protected by global spinlocks or rwlocks, exiting module is responsible for unregistering first. inode - atomic_dec_and_lock, with actual freeing RCU-delayed (and evicting code waiting for pagecache references to be gone, with the rest being responsibility of fs method called before we free the sucker) block_device - part of bdevfs inode These should be safe, but damnit, we really need the lifecycle documented for all objects - the above is only a part of it (note that for e.g. superblocks we have additional rules re "->s_active can't be incremented for any reason once it drops to zero, it can't be incremented until superblock had been marked 'born' and it crosses over to zero only with ->s_umount held"; there's 6 stages in life cycle of struct super_block and we had interesting bugs due to messing the transitions up). The trouble is, attempt to write those down tends to stray into massive grep session, with usual results - some other crap gets found (e.g. in some odd driver) and needs to be dealt with ;-/ Sigh...