All of lore.kernel.org
 help / color / mirror / Atom feed
From: Prarit Bhargava <prarit@redhat.com>
To: rui wang <ruiv.wang@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Michel Lespinasse <walken@google.com>,
	Andi Kleen <ak@linux.intel.com>,
	Seiji Aguchi <seiji.aguchi@hds.com>,
	Yang Zhang <yang.z.zhang@intel.com>,
	Paul Gortmaker <paul.gortmaker@windriver.com>,
	janet.morgan@intel.com, tony.luck@intel.com
Subject: Re: [PATCH] x86, Fix do_IRQ interrupt warning for cpu hotplug retriggered irqs
Date: Mon, 23 Dec 2013 10:29:23 -0500	[thread overview]
Message-ID: <52B856D3.4030802@redhat.com> (raw)
In-Reply-To: <CANVTcTZ-ZkTvR0+=eFyNQ7E8R2UYo1qdA-RQ+9nzK2w=qCPkPQ@mail.gmail.com>



On 12/23/2013 04:41 AM, rui wang wrote:
> On 12/2/13, Prarit Bhargava <prarit@redhat.com> wrote:
>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64831
>>
>> When downing a cpu it is possible that there are unhandled irqs left in
>> the APIC IRR register.  fixup_irqs() goes through the IRR and retriggers
>> the IRQs left in the APIC IRR.  After this, the vector for the irq is set
>> to -1.  There is a possibility here, however, that the CPU does handle an
>> irq in the IRR and then calls the vector.
>>
> 
> The patch does not seem to root-cause the problem. It seems to hide
> the real problem.
> 
> It is not possible that a device-triggered irq can arrive to this cpu
> again after fixup_irqs() fills its vector_irq[vector] to -1, because
> we've done the following:
> 
> 1. We disabled interrupt on this cpu in stop_machine().
> 2. We called irq_set_affinity() to exclude this cpu as a target for the irq.
> 3. We checked APIC_IRR and re-triggered any pending irqs to other cpus.

... and we set the IRQ handler to -1 for the down'd cpu.

Rui, I think you're right up to here but I think this has nothing to do with IPI
or locking.

I assumed that the issue I was trying to fix was long standing and well-known
within the kernel given some of the comments I had read here-and-there about
people seeing the do_IRQ errors on LKML.  There have long been reports of the
do_IRQ warning output during cpu down.

Here's what the issue is after step 3 above...

4.  The APIC_IRR is still *set* in the down'd cpu with IRQs disabled.
5.  We continue executing the stop_machine "down" portion of the code, then
continue executing in normal context the "die" code (ie, __cpu_die()).

IRQ disable only pertains stop_machine down.  So after we leave that context,
IRR will still execute.  While the kernel is spinning in cpu_die(), the down'd
cpu attempts to execute handler for IRQ in IRR ... and can't find one because
we've set it to -1.  So we see the warning.

A few additional debug points:

1.  I put a printk in fixup_irq when we call the irq_retrigger on another cpu
that dumps the the down'd CPU and IRQ # in fixup_irqs().  I see that printk
*EVERYTIME* I see the do_IRQ warning.

2.  The do_IRQ warning *always* appears before I see the offline message ...

[  148.656016] Broke affinity for irq 634
[  148.660493] Broke affinity for irq 698
[  148.665739] kvm: disabling virtualization on CPU58
[  148.666732] PRARIT: 58.208 IRR entry ... irq_retrigger call.

at this point we've left the stop_machine() code and we're now continuing to
execute ... then we hit the cpu_die() ... which spins.

[  148.671106] do_IRQ: 58.208 No irq handler for vector (irq -1)
[  148.677544] smpboot: CPU 58 is now offline

I think I have root caused this to the IRR being set in the down'd cpu.  It is
admittedly a rare occurrence in the kernel.  I usually have to run about 1000 up
and down's before hitting it, however, on my current test system it seems to hit
much more frequently, almost 1 in 64 times.

P.

  reply	other threads:[~2013-12-23 15:29 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-02 13:23 [PATCH] x86, Fix do_IRQ interrupt warning for cpu hotplug retriggered irqs Prarit Bhargava
2013-12-23  9:41 ` rui wang
2013-12-23 15:29   ` Prarit Bhargava [this message]
2013-12-24  4:41     ` rui wang
2013-12-24 13:11       ` Prarit Bhargava
2013-12-25  8:22         ` rui wang
2013-12-27 16:14           ` Prarit Bhargava
  -- strict thread matches above, loose matches on Subject: below --
2013-11-19 16:24 Prarit Bhargava
2013-11-11 23:08 Prarit Bhargava

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52B856D3.4030802@redhat.com \
    --to=prarit@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=janet.morgan@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paul.gortmaker@windriver.com \
    --cc=ruiv.wang@gmail.com \
    --cc=seiji.aguchi@hds.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=walken@google.com \
    --cc=x86@kernel.org \
    --cc=yang.z.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.