From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754156AbbCaCEl (ORCPT ); Mon, 30 Mar 2015 22:04:41 -0400 Received: from mail-wg0-f43.google.com ([74.125.82.43]:35538 "EHLO mail-wg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753915AbbCaCEi (ORCPT ); Mon, 30 Mar 2015 22:04:38 -0400 Message-ID: <1427767475.3147.32.camel@gmail.com> Subject: Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores From: Mike Galbraith To: Don Zickus Cc: cmetcalf@ezchip.com, Andrew Morton , Andrew Jones , chai wen , Ingo Molnar , Ulrich Obergfell , Fabian Frederick , Aaron Tomlin , Ben Zhang , Christoph Lameter , Frederic Weisbecker , Gilad Ben-Yossef , Steven Rostedt , open list Date: Tue, 31 Mar 2015 04:04:35 +0200 In-Reply-To: <20150330191245.GO162412@redhat.com> References: <1427741465-15747-1-git-send-email-cmetcalf@ezchip.com> <20150330191245.GO162412@redhat.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2015-03-30 at 15:12 -0400, Don Zickus wrote: > On Mon, Mar 30, 2015 at 02:51:05PM -0400, cmetcalf@ezchip.com wrote: > > From: Chris Metcalf > > > > Running watchdog can be a helpful debugging feature on regular > > cores, but it's incompatible with nohz_full, since it forces > > regular scheduling events. Accordingly, just exit out immediately > > from any nohz_full core. > > > > An alternate approach would be to add a flags field or function to > > smp_hotplug_thread to control on which cores the percpu threads > > are created, but it wasn't clear that much mechanism was useful. > > Hi Chris, > > It seems like the correct solution would be to hook into the idle_loop > somehow. If the cpu is idle, then it seems unlikely that a lockup could > occur. > > My fear with this apporach is a lockup would occur on the nohz cpu and it > would go undetected because that cpu is disabled. Further no printk is > thrown out to even indicate a cpu is disabled making it more difficult to > debug. Hm, I don't see why this is needed, for debugging/testing you turn it on, when you set up for critical operation, you turn it off. A bigger deal is the clocksource watchdog methinks. Measurement inspired me to make it dead yesterday. -Mike