From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751994Ab3LJU2x (ORCPT ); Tue, 10 Dec 2013 15:28:53 -0500 Received: from mx1.redhat.com ([209.132.183.28]:13949 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750989Ab3LJU2w (ORCPT ); Tue, 10 Dec 2013 15:28:52 -0500 Date: Tue, 10 Dec 2013 15:27:52 -0500 From: Dave Jones To: Linus Torvalds Cc: Thomas Gleixner , Darren Hart , Andrea Arcangeli , Linux Kernel Mailing List , Peter Zijlstra , Mel Gorman , Oleg Nesterov Subject: Re: process 'stuck' at exit. Message-ID: <20131210202752.GA27373@redhat.com> Mail-Followup-To: Dave Jones , Linus Torvalds , Thomas Gleixner , Darren Hart , Andrea Arcangeli , Linux Kernel Mailing List , Peter Zijlstra , Mel Gorman , Oleg Nesterov References: <20131210154724.GA30020@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner wrote: > > > > So this is pretty unlikely. The retry requires: > > > > get_futex_value_locked() == EFAULT; > > > > Now we drop the hash bucket locks and do: > > > > get_user(); > > > > And if that get_user() faults again, we bail out. > > I think you need to look closer. > > We have at least also that "futex_proxy_trylock_atomic() returns > -EAGAIN" case. Which triggers at some exit condition. Another thread > in the same group, perhaps never completing the exit because it's > waiting for this one? I dunno, I didn't look any closer (but this does > make me think "Hey, we should add Oleg to the Cc too", since > PF_EXITING is involved).. So maybe there is some situation where that > EAGAIN will keep happening, forever.. > > Now, I'm *not* saying that that is it. It's quite possible/likely some > other loop, but I do have to say that it sure isn't _obvious_. And > that whole EAGAIN return case is quite deep and special, so ... > > Linus > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. btw, I've left the machine in that state, and will for as long as necesary in case someone has any ideas for further tracing experiments. Dave