All of lore.kernel.org
 help / color / mirror / Atom feed
From: dzickus@redhat.com (Don Zickus)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus
Date: Thu, 10 Jan 2013 13:17:51 -0500	[thread overview]
Message-ID: <20130110181751.GR88797@redhat.com> (raw)
In-Reply-To: <CAMbhsRT7q+DSKOdPMtUqtPZJrB_z-ixmv09TkT2ZweUJGXjkYg@mail.gmail.com>

On Thu, Jan 10, 2013 at 09:27:28AM -0800, Colin Cross wrote:
> On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> >> Emulate NMIs on systems where they are not available by using timer
> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> to check that the next cpu is processing hrtimer interrupts by
> >> verifying that a counter is increasing.
> >>
> >> This patch is useful on systems where the hardlockup detector is not
> >> available due to a lack of NMIs, for example most ARM SoCs.
> >
> > I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> > reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> > is something worth looking at here (or even possible).
> >
> >> Without this patch any cpu stuck with interrupts disabled can
> >> cause a hardware watchdog reset with no debugging information,
> >> but with this patch the kernel can detect the lockup and panic,
> >> which can result in useful debugging info.
> >
> > <SNIP>
> >> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
> >> +static int is_hardlockup_other_cpu(int cpu)
> >> +{
> >> +     unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
> >> +
> >> +     if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >> +             return 1;
> >> +
> >> +     per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> >> +     return 0;
> >
> > Will this race with the other cpu you are checking?  For example if cpuA
> > just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
> > hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
> > cpuA is stuck?
> 
> cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
> However, there may be a similar race condition during hotplug if cpuB
> updates hrtimer_interrupts_saved for cpuA, then goes offline, then
> cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
> hrtimer_interrupts.  I think this can be solved by setting
> watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

Ah, that is where my misunderstanding was.  I overlooked the fact that it
was only updated by the other cpu.  Sorry about that.

I'll re-review it again with that in mind.

Cheers,
Don

WARNING: multiple messages have this Message-ID (diff)
From: Don Zickus <dzickus@redhat.com>
To: Colin Cross <ccross@android.com>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	liu chuansheng <chuansheng.liu@intel.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus
Date: Thu, 10 Jan 2013 13:17:51 -0500	[thread overview]
Message-ID: <20130110181751.GR88797@redhat.com> (raw)
In-Reply-To: <CAMbhsRT7q+DSKOdPMtUqtPZJrB_z-ixmv09TkT2ZweUJGXjkYg@mail.gmail.com>

On Thu, Jan 10, 2013 at 09:27:28AM -0800, Colin Cross wrote:
> On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus <dzickus@redhat.com> wrote:
> > On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> >> Emulate NMIs on systems where they are not available by using timer
> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> to check that the next cpu is processing hrtimer interrupts by
> >> verifying that a counter is increasing.
> >>
> >> This patch is useful on systems where the hardlockup detector is not
> >> available due to a lack of NMIs, for example most ARM SoCs.
> >
> > I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> > reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> > is something worth looking at here (or even possible).
> >
> >> Without this patch any cpu stuck with interrupts disabled can
> >> cause a hardware watchdog reset with no debugging information,
> >> but with this patch the kernel can detect the lockup and panic,
> >> which can result in useful debugging info.
> >
> > <SNIP>
> >> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
> >> +static int is_hardlockup_other_cpu(int cpu)
> >> +{
> >> +     unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
> >> +
> >> +     if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >> +             return 1;
> >> +
> >> +     per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> >> +     return 0;
> >
> > Will this race with the other cpu you are checking?  For example if cpuA
> > just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
> > hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
> > cpuA is stuck?
> 
> cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
> However, there may be a similar race condition during hotplug if cpuB
> updates hrtimer_interrupts_saved for cpuA, then goes offline, then
> cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
> hrtimer_interrupts.  I think this can be solved by setting
> watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

Ah, that is where my misunderstanding was.  I overlooked the fact that it
was only updated by the other cpu.  Sorry about that.

I'll re-review it again with that in mind.

Cheers,
Don

  reply	other threads:[~2013-01-10 18:17 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-10  1:57 [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus Colin Cross
2013-01-10  1:57 ` Colin Cross
2013-01-10 14:02 ` Don Zickus
2013-01-10 14:02   ` Don Zickus
2013-01-10 14:22   ` Russell King - ARM Linux
2013-01-10 14:22     ` Russell King - ARM Linux
2013-01-10 16:18     ` Frederic Weisbecker
2013-01-10 16:18       ` Frederic Weisbecker
2013-01-10 17:00       ` Russell King - ARM Linux
2013-01-10 17:00         ` Russell King - ARM Linux
2013-01-10 17:27   ` Colin Cross
2013-01-10 17:27     ` Colin Cross
2013-01-10 18:17     ` Don Zickus [this message]
2013-01-10 18:17       ` Don Zickus
2013-01-10 20:38 ` Tony Lindgren
2013-01-10 20:38   ` Tony Lindgren
2013-01-10 22:34   ` Colin Cross
2013-01-10 22:34     ` Colin Cross
2013-01-10 23:42     ` Tony Lindgren
2013-01-10 23:42       ` Tony Lindgren
2013-01-11  1:39 ` Liu, Chuansheng
2013-01-11  1:39   ` Liu, Chuansheng
2013-01-11  5:34   ` Colin Cross
2013-01-11  5:34     ` Colin Cross
2013-01-11  5:57     ` Liu, Chuansheng
2013-01-11  5:57       ` Liu, Chuansheng
2013-01-11  6:17       ` Colin Cross
2013-01-11  6:17         ` Colin Cross
2013-01-11  6:27         ` Liu, Chuansheng
2013-01-11  6:27           ` Liu, Chuansheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130110181751.GR88797@redhat.com \
    --to=dzickus@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.