From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752160AbbJGIlT (ORCPT ); Wed, 7 Oct 2015 04:41:19 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:38998 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751684AbbJGIlP (ORCPT ); Wed, 7 Oct 2015 04:41:15 -0400 Date: Wed, 7 Oct 2015 10:41:10 +0200 From: Peter Zijlstra To: heiko.carstens@de.ibm.com Cc: linux-kernel@vger.kernel.org, Tejun Heo , Oleg Nesterov , Ingo Molnar , Rik van Riel Subject: [RFC][PATCH] sched: Start stopper early Message-ID: <20151007084110.GX2881@worktop.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, So Heiko reported some 'interesting' fail where stop_two_cpus() got stuck in multi_cpu_stop() with one cpu waiting for another that never happens. It _looks_ like the 'other' cpu isn't running and the current best theory is that we race on cpu-up and get the stop_two_cpus() call in before the stopper task is running. This _is_ possible because we set 'online && active' _before_ we do the smpboot_unpark thing because of ONLINE notifier order. The below test patch manually starts the stopper task early. It boots and hotplugs a cpu on my test box so its not insta broken. --- kernel/sched/core.c | 7 ++++++- kernel/stop_machine.c | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1764a0f..9a56ef7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5542,14 +5542,19 @@ static void set_cpu_rq_start_time(void) rq->age_stamp = sched_clock_cpu(cpu); } +extern void cpu_stopper_unpark(unsigned int cpu); + static int sched_cpu_active(struct notifier_block *nfb, unsigned long action, void *hcpu) { + int cpu = (long)hcpu; + switch (action & ~CPU_TASKS_FROZEN) { case CPU_STARTING: set_cpu_rq_start_time(); return NOTIFY_OK; case CPU_ONLINE: + cpu_stopper_unpark(cpu); /* * At this point a starting CPU has marked itself as online via * set_cpu_online(). But it might not yet have marked itself @@ -5558,7 +5563,7 @@ static int sched_cpu_active(struct notifier_block *nfb, * Thus, fall-through and help the starting CPU along. */ case CPU_DOWN_FAILED: - set_cpu_active((long)hcpu, true); + set_cpu_active(cpu, true); return NOTIFY_OK; default: return NOTIFY_DONE; diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 12484e5..c674371 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -496,6 +496,11 @@ static struct smp_hotplug_thread cpu_stop_threads = { .selfparking = true, }; +void cpu_stopper_unpark(unsigned int cpu) +{ + kthread_unpark(per_cpu(cpu_stopper.thread, cpu)); +} + static int __init cpu_stop_init(void) { unsigned int cpu;