From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753119Ab3LBRG0 (ORCPT ); Mon, 2 Dec 2013 12:06:26 -0500 Received: from zeniv.linux.org.uk ([195.92.253.2]:47842 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752632Ab3LBRGZ (ORCPT ); Mon, 2 Dec 2013 12:06:25 -0500 Date: Mon, 2 Dec 2013 17:06:21 +0000 From: Al Viro To: Linus Torvalds Cc: Simon Kirby , Ian Applegate , Christoph Lameter , Pekka Enberg , LKML , Chris Mason Subject: Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Message-ID: <20131202170621.GG10323@ZenIV.linux.org.uk> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 02, 2013 at 08:00:38AM -0800, Linus Torvalds wrote: > On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds > wrote: > > > > I still don't see what could be wrong with the pipe_inode_info thing, > > but the fact that it's been so consistent in your traces does make me > > suspect it really is *that* particular slab. > > I think I finally found it. > > I've spent waaayy too much time looking at and thinking about that > code without seeing anything wrong, but this morning I woke up and > thought to myself "What if.." > > And looking at the code again, I went "BINGO". > > All our reference counting etc seems right, but we have one very > subtle bug: on the freeing path, we have a pattern like this: > > spin_lock(&inode->i_lock); > if (!--pipe->files) { > inode->i_pipe = NULL; > kill = 1; > } > spin_unlock(&inode->i_lock); > __pipe_unlock(pipe); > if (kill) > free_pipe_info(pipe); > > which on the face of it is trying to be very careful in not accessing > the pipe-info after it is released by having that "kill" flag, and > doing the release last. > > And it's complete garbage. > > Why? > > Because the thread that decrements "pipe->files" *without* releasing > it, will very much access it after it has been dropped: that > "__pipe_unlock(pipe)" happens *after* we've decremented the pipe > reference count and dropped the inode lock. So another CPU can come in > and free the structure concurrently with that __pipe_unlock(pipe). > > This happens in two places, and we don't actually need or want the > pipe lock for the pipe->files accesses (since pipe->files is protected > by inode->i_lock, not the pipe lock), so the solution is to just do > the __pipe_unlock() before the whole dance about the pipe->files > reference count. > > Patch appended. And no wonder nobody has ever seen it, because the > race is unlikely as hell to ever happen. Simon, I assume it will be > another few months before we can say "yeah, that fixed it", but I > really think this is it. It explains all the symptoms, including > "DEBUG_PAGEALLOC didn't catch it" (because the access happens just as > it is released, and DEBUG_PAGEALLOC takes too long to actually free > unmap the page etc). Nice catch. I'd probably add /* we are holding inode->i_pipe->files */ static void drop_pipe(struct inode *inode, struct pipe_inode_info *pipe) { int kill = 0; spin_lock(&inode->i_lock); WARN_ON(pipe != inode->i_pipe); if (!--pipe->files) { inode->i_pipe = NULL; kill = 1; } spin_unlock(&inode->i_lock); if (kill) free_pipe_info(pipe); } and use it in both places...