From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755385Ab1HCTLv (ORCPT ); Wed, 3 Aug 2011 15:11:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:28663 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754568Ab1HCTLr (ORCPT ); Wed, 3 Aug 2011 15:11:47 -0400 Date: Wed, 3 Aug 2011 15:11:30 -0400 From: Don Zickus To: ZAK Magnus Cc: linux-kernel@vger.kernel.org, Ingo Molnar , Mandeep Singh Baines Subject: Re: [PATCH v3 2/2] Make hard lockup detection use timestamps Message-ID: <20110803191130.GC1972@redhat.com> References: <20110722195340.GF3765@redhat.com> <20110725124451.GA2866@redhat.com> <20110729205538.GD14343@redhat.com> <20110801125234.GE14343@redhat.com> <20110801192407.GE2581@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 01, 2011 at 01:11:27PM -0700, ZAK Magnus wrote: > On Mon, Aug 1, 2011 at 12:24 PM, Don Zickus wrote: > > One idea I thought of to workaround this is to save the timestamp and the > > watchdog bool and restore after the stack dump.  It's a cheap hack and I > > am not to sure about the locking as it might race with > > touch_nmi_watchdog().  But it gives you an idea what I was thinking. > Yes, I see. Is the hackiness of it okay? Hi, I don't think it is too bad. Most of the stuff is per_cpu and is intended to be per_cpu. There might be a random case where another cpu is trying to zero out the watchdog_nmi_touch or watchdog_touch_ts variables. I was trying to fix the cross-cpu case for watchdog_nmi_touch to eliminate that problem but Ingo wanted me to implement some panic ratelimit first (which I lost track of doing). And being in the NMI context and staying per_cpu should make that case safe I believe, despite the hackiness of it. The watchdog_touch_ts is only called on another cpu in the touch_all_softlockup_watchdogs() case, which only happens when the scheduler is spewing stats currently. This should happen rarely. This leaves the problem of softlockups being preempted in the interrupt context and touched by another interrupt handler. I don't know how to solve this reliably but I think it should be ok most of the time. The only downside is a premature softlockup I would think. I can't think of a better way to workaround the problem and still move forward with your idea of warning on future stalls. Then again I have been busy here and haven't put enough thought into it. Cheers, Don