All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@in.ibm.com>
To: Martin Wilck <martin.wilck@fujitsu-siemens.com>
Cc: Haren Myneni <hbabu@us.ibm.com>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence
Date: Wed, 8 Aug 2007 17:08:34 +0530	[thread overview]
Message-ID: <20070808113834.GA26145@in.ibm.com> (raw)
In-Reply-To: <46B73955.2080007@fujitsu-siemens.com>

On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
> PATCH/RFC: [kdump] fix APIC shutdown sequence
> 
> This patch fixes a problem that we have encountered
> with kdump under high I/O load on some machines.
> The machines showing the errors have an Intel ICH7
> chip set with a 6702PXH  PCI Express-to-PCI Bridge
> (8086:032c) containing an IO-APIC.
> 
> The bug symptom is that certain controllers connected
> to the 6702PXH bridge wouldn't receive any IRQs in the
> kdump kernel. In the error case (which is about 20% of
> all cases) the IRR bit of the IO-APIC pin for that
> controller is always set after the start of the kdump
> kernel, indicating an IRQ in progress. We haven't found
> a way to recover from this situation when it has once
> occured, except for a system reset.
> 
> The error is caused by IRQs arriving while the APIC
> subsystem is deactivated in machine_crash_shutdown().
> 
> Apparently, the IO-APIC gets stuck if it sends an IRQ
> message to a Local APIC and never receives an EOI for that
> message. This can have several possible reasons:
> 

Got this oops while testing your patch when I did
"echo c > /proc/sysrq-trigger"

[root@llm37 ~]# SysRq : Trigger a crashdump
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<0000000000000000>]
PGD 22229b067 PUD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 2947, comm: klogd Not tainted 2.6.23-rc1-apic-issue #2
RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
RSP: 0018:ffffffff807daf50  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff8073f480 RCX: ffffffff807ddea0
RDX: ffffffff807717c0 RSI: ffffffff8073f480 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff807ddd28 R11: 0000000000000150 R12: 0000000000000000
R13: ffffffff8073f4d0 R14: 000000000000000c R15: 0000000000000000
FS:  00002b204ea686f0(0000) GS:ffffffff8073c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000223e9e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process klogd (pid: 2947, threadinfo ffff81021b7b8000, task ffff8102251a27b0)
Stack:  ffffffff8025afd0 0000000000000046 ffffffff807dddf8 0000000000000000
 0000000000000030 0000000000000000 ffffffff8020e1ae ffffffff807d15c8
 0000000000000000 ffffffff807dde20 0000000000000000 ffffffff807ddee8
Call Trace:
 <IRQ>  [<ffffffff8025afd0>] handle_edge_irq+0x5c/0x127
 [<ffffffff8020e1ae>] do_IRQ+0xf1/0x15f
 [<ffffffff8020c191>] ret_from_intr+0x0/0xa
 <EOI>  <NMI>  [<ffffffff80351ae3>] __delay+0x6/0x10
 [<ffffffff8021e328>] crash_nmi_callback+0x4b/0x77
 [<ffffffff80557075>] notifier_call_chain+0x29/0x4c
 [<ffffffff80249a11>] notify_die+0x2d/0x34
 [<ffffffff805555a3>] default_do_nmi+0x55/0x197
 [<ffffffff80555f0e>] do_nmi+0x2e/0x44
 [<ffffffff805553af>] nmi+0x7f/0x90
 [<ffffffff8022e00a>] find_busiest_group+0x20e/0x6b3
 <<EOE>>  [<ffffffff80552d13>] __sched_text_start+0x1cb/0x5ba
 [<ffffffff80282294>] do_sync_write+0xc9/0x10c
 [<ffffffff80235c32>] do_syslog+0x11d/0x3a9
 [<ffffffff80246a03>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802731c6>] free_pages_and_swap_cache+0x73/0x8f
 [<ffffffff802bbeac>] kmsg_read+0x3a/0x44
 [<ffffffff802b5245>] proc_reg_read+0x7e/0x99
 [<ffffffff80282b08>] vfs_read+0xaa/0x132
 [<ffffffff80282ea4>] sys_read+0x45/0x6e
 [<ffffffff8020bc7e>] system_call+0x7e/0x83


Code:  Bad RIP value.
RIP  [<0000000000000000>]
 RSP <ffffffff807daf50>
CR2: 0000000000000000

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@in.ibm.com>
To: Martin Wilck <martin.wilck@fujitsu-siemens.com>
Cc: Haren Myneni <hbabu@us.ibm.com>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: PATCH/RFC: [kdump] fix APIC shutdown sequence
Date: Wed, 8 Aug 2007 17:08:34 +0530	[thread overview]
Message-ID: <20070808113834.GA26145@in.ibm.com> (raw)
In-Reply-To: <46B73955.2080007@fujitsu-siemens.com>

On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
> PATCH/RFC: [kdump] fix APIC shutdown sequence
> 
> This patch fixes a problem that we have encountered
> with kdump under high I/O load on some machines.
> The machines showing the errors have an Intel ICH7
> chip set with a 6702PXH  PCI Express-to-PCI Bridge
> (8086:032c) containing an IO-APIC.
> 
> The bug symptom is that certain controllers connected
> to the 6702PXH bridge wouldn't receive any IRQs in the
> kdump kernel. In the error case (which is about 20% of
> all cases) the IRR bit of the IO-APIC pin for that
> controller is always set after the start of the kdump
> kernel, indicating an IRQ in progress. We haven't found
> a way to recover from this situation when it has once
> occured, except for a system reset.
> 
> The error is caused by IRQs arriving while the APIC
> subsystem is deactivated in machine_crash_shutdown().
> 
> Apparently, the IO-APIC gets stuck if it sends an IRQ
> message to a Local APIC and never receives an EOI for that
> message. This can have several possible reasons:
> 

Got this oops while testing your patch when I did
"echo c > /proc/sysrq-trigger"

[root@llm37 ~]# SysRq : Trigger a crashdump
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
 [<0000000000000000>]
PGD 22229b067 PUD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 2947, comm: klogd Not tainted 2.6.23-rc1-apic-issue #2
RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
RSP: 0018:ffffffff807daf50  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff8073f480 RCX: ffffffff807ddea0
RDX: ffffffff807717c0 RSI: ffffffff8073f480 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff807ddd28 R11: 0000000000000150 R12: 0000000000000000
R13: ffffffff8073f4d0 R14: 000000000000000c R15: 0000000000000000
FS:  00002b204ea686f0(0000) GS:ffffffff8073c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000223e9e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process klogd (pid: 2947, threadinfo ffff81021b7b8000, task ffff8102251a27b0)
Stack:  ffffffff8025afd0 0000000000000046 ffffffff807dddf8 0000000000000000
 0000000000000030 0000000000000000 ffffffff8020e1ae ffffffff807d15c8
 0000000000000000 ffffffff807dde20 0000000000000000 ffffffff807ddee8
Call Trace:
 <IRQ>  [<ffffffff8025afd0>] handle_edge_irq+0x5c/0x127
 [<ffffffff8020e1ae>] do_IRQ+0xf1/0x15f
 [<ffffffff8020c191>] ret_from_intr+0x0/0xa
 <EOI>  <NMI>  [<ffffffff80351ae3>] __delay+0x6/0x10
 [<ffffffff8021e328>] crash_nmi_callback+0x4b/0x77
 [<ffffffff80557075>] notifier_call_chain+0x29/0x4c
 [<ffffffff80249a11>] notify_die+0x2d/0x34
 [<ffffffff805555a3>] default_do_nmi+0x55/0x197
 [<ffffffff80555f0e>] do_nmi+0x2e/0x44
 [<ffffffff805553af>] nmi+0x7f/0x90
 [<ffffffff8022e00a>] find_busiest_group+0x20e/0x6b3
 <<EOE>>  [<ffffffff80552d13>] __sched_text_start+0x1cb/0x5ba
 [<ffffffff80282294>] do_sync_write+0xc9/0x10c
 [<ffffffff80235c32>] do_syslog+0x11d/0x3a9
 [<ffffffff80246a03>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802731c6>] free_pages_and_swap_cache+0x73/0x8f
 [<ffffffff802bbeac>] kmsg_read+0x3a/0x44
 [<ffffffff802b5245>] proc_reg_read+0x7e/0x99
 [<ffffffff80282b08>] vfs_read+0xaa/0x132
 [<ffffffff80282ea4>] sys_read+0x45/0x6e
 [<ffffffff8020bc7e>] system_call+0x7e/0x83


Code:  Bad RIP value.
RIP  [<0000000000000000>]
 RSP <ffffffff807daf50>
CR2: 0000000000000000

Thanks
Vivek

  parent reply	other threads:[~2007-08-08 11:38 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-06 15:08 PATCH/RFC: [kdump] fix APIC shutdown sequence Martin Wilck
2007-08-06 15:08 ` Martin Wilck
2007-08-07 14:29 ` Vivek Goyal
2007-08-07 14:29   ` Vivek Goyal
2007-08-07 17:41   ` Martin Wilck
2007-08-07 17:41     ` Martin Wilck
2007-08-08  1:04     ` Eric W. Biederman
2007-08-08  1:04       ` Eric W. Biederman
2007-08-08  9:03       ` Martin Wilck
2007-08-08  9:03         ` Martin Wilck
2007-08-08  9:33         ` Vivek Goyal
2007-08-08  9:33           ` Vivek Goyal
2007-08-08 12:04           ` Martin Wilck
2007-08-08 12:04             ` Martin Wilck
2007-08-08 15:21         ` Eric W. Biederman
2007-08-08 15:21           ` Eric W. Biederman
2007-08-08 17:35           ` Martin Wilck
2007-08-08 17:35             ` Martin Wilck
2007-08-08 17:56             ` Eric W. Biederman
2007-08-08 17:56               ` Eric W. Biederman
2007-08-08 18:22               ` Martin Wilck
2007-08-08 18:22                 ` Martin Wilck
2007-08-08 18:38               ` Martin Wilck
2007-08-08 18:38                 ` Martin Wilck
2007-08-08 10:36     ` Vivek Goyal
2007-08-08 10:36       ` Vivek Goyal
2007-08-08 14:06       ` Chip Coldwell
2007-08-08 14:06         ` Chip Coldwell
2007-08-08 14:42         ` Vivek Goyal
2007-08-08 14:42           ` Vivek Goyal
2007-08-08 18:15           ` Martin Wilck
2007-08-08 18:15             ` Martin Wilck
2007-08-09 10:11             ` Vivek Goyal
2007-08-09 10:11               ` Vivek Goyal
2007-08-09 17:35               ` Martin Wilck
2007-08-09 17:35                 ` Martin Wilck
2007-08-07 19:44   ` Chip Coldwell
2007-08-07 19:44     ` Chip Coldwell
2007-08-08  0:29 ` Andrew Morton
2007-08-08  0:29   ` Andrew Morton
2007-08-08  8:32   ` Martin Wilck
2007-08-08  8:32     ` Martin Wilck
2007-08-08 11:38 ` Vivek Goyal [this message]
2007-08-08 11:38   ` Vivek Goyal
2007-08-08 18:07   ` Martin Wilck
2007-08-08 18:07     ` Martin Wilck
2007-08-08 21:25     ` Eric W. Biederman
2007-08-08 21:25       ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070808113834.GA26145@in.ibm.com \
    --to=vgoyal@in.ibm.com \
    --cc=hbabu@us.ibm.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.wilck@fujitsu-siemens.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.