From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752831AbbJGMi6 (ORCPT ); Wed, 7 Oct 2015 08:38:58 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:46795 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751302AbbJGMi5 (ORCPT ); Wed, 7 Oct 2015 08:38:57 -0400 Date: Wed, 7 Oct 2015 14:38:52 +0200 From: Peter Zijlstra To: Oleg Nesterov Cc: heiko.carstens@de.ibm.com, linux-kernel@vger.kernel.org, Tejun Heo , Ingo Molnar , Rik van Riel Subject: Re: [RFC][PATCH] sched: Start stopper early Message-ID: <20151007123852.GH17308@twins.programming.kicks-ass.net> References: <20151007084110.GX2881@worktop.programming.kicks-ass.net> <20151007123046.GA21460@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151007123046.GA21460@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 07, 2015 at 02:30:46PM +0200, Oleg Nesterov wrote: > On 10/07, Peter Zijlstra wrote: > > > > So Heiko reported some 'interesting' fail where stop_two_cpus() got > > stuck in multi_cpu_stop() with one cpu waiting for another that never > > happens. > > > > It _looks_ like the 'other' cpu isn't running and the current best > > theory is that we race on cpu-up and get the stop_two_cpus() call in > > before the stopper task is running. > > > > This _is_ possible because we set 'online && active' > > Argh. Can't really comment this change right now, but this reminds me > that stop_two_cpus() path should not rely on cpu_active() at all. I mean > we should not use this check to avoid the deadlock, migrate_swap_stop() > can check it itself. And cpu_stop_park()->cpu_stop_signal_done() should > be replaced by BUG_ON(). > > Probably slightly off-topic, but what do you finally think about the old > "[PATCH v2 6/6] stop_machine: kill stop_cpus_lock and lg_double_lock/unlock()" > we discussed in http://marc.info/?t=143750670300014 ? > > I won't really insist if you still dislike it, but it seems we both > agree that "lg_lock stop_cpus_lock" must die in any case, and after that > we can the cleanups mentioned above. Yes, I was looking at that, this issue reminded me we still had that issue open. > And, Peter, I see a lot of interesting emails from you, but currently > can't even read them. I hope very much I will read them later and perhaps > even reply ;) Sure, take your time.