All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Simon Kirby <sim@hostway.ca>, Ian Applegate <ia@cloudflare.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Lameter <cl@gentwo.org>,
	Pekka Enberg <penberg@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chris Mason <chris.mason@fusionio.com>
Subject: Re: Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds())
Date: Mon, 2 Dec 2013 17:27:55 +0100	[thread overview]
Message-ID: <20131202162755.GB27781@gmail.com> (raw)
In-Reply-To: <CA+55aFyez7uZ6LeXrBRCqJzgJ=w4Xv+CV6QSYp5NkOJ2R9Xang@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Nov 30, 2013 at 1:08 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I still don't see what could be wrong with the pipe_inode_info thing,
> > but the fact that it's been so consistent in your traces does make me
> > suspect it really is *that* particular slab.
> 
> I think I finally found it.
> 
> I've spent waaayy too much time looking at and thinking about that
> code without seeing anything wrong, but this morning I woke up and
> thought to myself "What if.."
> 
> And looking at the code again, I went "BINGO".
> 
> All our reference counting etc seems right, but we have one very
> subtle bug: on the freeing path, we have a pattern like this:
> 
>         spin_lock(&inode->i_lock);
>         if (!--pipe->files) {
>                 inode->i_pipe = NULL;
>                 kill = 1;
>         }
>         spin_unlock(&inode->i_lock);
>         __pipe_unlock(pipe);
>         if (kill)
>                 free_pipe_info(pipe);
> 
> which on the face of it is trying to be very careful in not accessing
> the pipe-info after it is released by having that "kill" flag, and
> doing the release last.
> 
> And it's complete garbage.
> 
> Why?
> 
> Because the thread that decrements "pipe->files" *without* releasing 
> it, will very much access it after it has been dropped: that 
> "__pipe_unlock(pipe)" happens *after* we've decremented the pipe 
> reference count and dropped the inode lock. So another CPU can come 
> in and free the structure concurrently with that 
> __pipe_unlock(pipe).
> 
> This happens in two places, and we don't actually need or want the 
> pipe lock for the pipe->files accesses (since pipe->files is 
> protected by inode->i_lock, not the pipe lock), so the solution is 
> to just do the __pipe_unlock() before the whole dance about the 
> pipe->files reference count.
> 
> Patch appended. And no wonder nobody has ever seen it, because the 
> race is unlikely as hell to ever happen. Simon, I assume it will be 
> another few months before we can say "yeah, that fixed it", but I 
> really think this is it. It explains all the symptoms, including 
> "DEBUG_PAGEALLOC didn't catch it" (because the access happens just 
> as it is released, and DEBUG_PAGEALLOC takes too long to actually 
> free unmap the page etc).
> 
>                      Linus
>  fs/pipe.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index d2c45e14e6d8..18f1a4b2dbbc 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -743,13 +743,14 @@ pipe_release(struct inode *inode, struct file *file)
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>  		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>  	}
> +	__pipe_unlock(pipe);
> +
>  	spin_lock(&inode->i_lock);
>  	if (!--pipe->files) {
>  		inode->i_pipe = NULL;
>  		kill = 1;
>  	}
>  	spin_unlock(&inode->i_lock);
> -	__pipe_unlock(pipe);

I'm wondering why pipe-mutex was introduced. It was done fairly 
recently, with no justification given:

  From 72b0d9aacb89f3759931ec440e1b535671145bb4 Mon Sep 17 00:00:00 2001
  From: Al Viro <viro@zeniv.linux.org.uk>
  Date: Thu, 21 Mar 2013 02:32:24 -0400
  Subject: [PATCH] pipe: don't use ->i_mutex

  now it can be done - put mutex into pipe_inode_info, use it instead
  of ->i_mutex

  Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
  ---
   fs/ocfs2/file.c           | 6 ++----
   fs/pipe.c                 | 5 +++--
   include/linux/pipe_fs_i.h | 2 ++
   3 files changed, 7 insertions(+), 6 deletions(-)

It's not like there should be many (any?) VFS operations where a pipe 
is used via i_mutex and pipe->mutex in parallel, which would improve 
scalability - so I don't see the scalability advantage. (But I might 
be missing something)

Barring such kind of workload the extra mutex just adds extra 
micro-costs because now two locks have to be taken on 
creation/destruction, plus it adds extra complexity and races.

So unless I'm missing something obvious, another good fix would be to 
just revert pipe->mutex and rely on i_mutex as before?

Thanks,

	Ingo

  reply	other threads:[~2013-12-02 16:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-02 16:00 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Linus Torvalds
2013-12-02 16:27 ` Ingo Molnar [this message]
2013-12-02 16:46   ` Al Viro
2013-12-02 17:05     ` Ingo Molnar
2013-12-02 17:06 ` Al Viro
2013-12-03  2:58 ` Linus Torvalds
2013-12-03  4:28   ` Al Viro
2013-12-05  8:12     ` gfs2 deadlock (was Re: Found it) Al Viro
2013-12-05 10:19       ` [Cluster-devel] " Steven Whitehouse
2013-12-05 10:19         ` Steven Whitehouse
2013-12-03  8:52   ` [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races Ingo Molnar
2013-12-03 18:10     ` Linus Torvalds
2013-12-04  9:19       ` Simon Kirby
2013-12-04 21:14         ` Linus Torvalds
2013-12-05  8:06           ` Simon Kirby
2013-12-05  6:57     ` Simon Kirby
2013-12-11 15:03     ` Waiman Long
  -- strict thread matches above, loose matches on Subject: below --
2025-10-07  4:03 Found it! (was Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()) Steven Paul Jobs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131202162755.GB27781@gmail.com \
    --to=mingo@kernel.org \
    --cc=chris.mason@fusionio.com \
    --cc=cl@gentwo.org \
    --cc=ia@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penberg@kernel.org \
    --cc=sim@hostway.ca \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.