linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Benjamin Herrenschmidt <benh@au1.ibm.com>
To: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: linuxppc-dev@ozlabs.org, Anton Blanchard <anton@samba.org>
Subject: Re: [PATCH v5 00/21] EEH reorganization
Date: Tue, 17 Apr 2012 11:57:51 +1000	[thread overview]
Message-ID: <1334627871.25353.26.camel@pasglop> (raw)
In-Reply-To: <20120417113738.0f091da4@kryten>

On Tue, 2012-04-17 at 11:37 +1000, Anton Blanchard wrote:
> 
> No. I replaced that backtrace in eeh_dn_check_failure with a WARN_ON()
> because the backtrace doesn't give us enough info. I'm submitting a
> patch for that today.
> 
> Bottom line is mstmread has been causing an EEH error since at least
> 3.0, but in 3.4 we now oops instead of recovering. The signs all point
> to the EEH rework in 3.4.

More precisely, the original oops reported by Anton decodes as such:

>Oops: Kernel access of bad area, sig: 11 [#1]

This is typically a bad memory access..

>SMP NR_CPUS=1024 NUMA pSeries
>Modules linked in:
>NIP: c000000000055af8 LR: c000000000033204 CTR: 0000000000000000
>REGS: c000001f42fb7990 TRAP: 0300   Tainted: G        W     (3.4.0-rc2-00065-gf549e08-dirty)

TRAP: 300 means that it's the result of a data access interrupts, ie,
load or store to a bad address

>MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 24008084  XER: 00000000
>SOFTE: 1
>CFAR: 00000000000049b8
>DAR: 0000000000000070, DSISR: 40000000

Here the DAR tells us what address was accessed. 0x70 is a strong indication
that this was an access to a NULL pointer (at offset 0x70 from that pointer).

It -might- be something else (such as a NULL passed to a list head or such)
but the idea that there's a NULL floating around is a good hint.

>TASK = c000001f6c7dfc40[19010] 'eehd' THREAD: c000001f42fb4000 CPU: 6
>GPR00: 0000000000000001 c000001f42fb7c10 c000000000bd3a28 c000001f80ab0800 
>GPR04: c000001f7c57d418 0000000000000380 c000001f7c57e070 c000000000ed5360 
>GPR08: 0000000000000000 c000000000c77088 0000000000000000 0000000000000001 
>GPR12: 0000000044008088 c00000000eda1500 00000000019ffa78 0000000000a70000 
>GPR16: 00000000000000bb c000000000a9f754 c000000000963230 000000000000005e 
>GPR20: 0000000001b37e80 00000000000000bb 0000000000000000 c000000000b0ad90 
>GPR24: 0000000000000000 c000000000b10588 0000000000000001 c000001f80ab0800 
>GPR28: 0000000000000000 c000001f80ab0828 0000000000000000 c000001f7ee10000 
>NIP [c000000000055af8] .eeh_add_device_tree_late+0x58/0xf0

This is the function where it happened (eeh_add_device_tree_late)

>LR [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>Call Trace:
>[c000001f42fb7c10] [00000000fdffffff] 0xfdffffff (unreliable)
>[c000001f42fb7ca0] [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>[c000001f42fb7d20] [c000000000059a5c] .pcibios_add_pci_devices+0x7c/0x190
>[c000001f42fb7db0] [c000000000057a6c] .eeh_reset_device+0xfc/0x1a0
>[c000001f42fb7e50] [c000000000057e18] .handle_eeh_events+0x308/0x480
>[c000001f42fb7f00] [c0000000000584dc] .eeh_event_handler+0x13c/0x1d0
>[c000001f42fb7f90] [c00000000002099c] .kernel_thread+0x54/0x70

And your backtrace. You can see that you got an eeh event, which triggered an
eeh reset, which triggered a pcibios_add_pci_devices() etc...

>Instruction dump:
>480000a8 60000000 ebff0000 7fbfe800 419e0098 2fbf0000 419e005c e9229eb0 
>80090008 2f800000 419e004c ebdf01d0 <e81e0070> 7fbf0000 3160ffff
>7d2b0110 

Cheers,
Ben.

  reply	other threads:[~2012-04-17  1:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28  6:03 [PATCH v5 00/21] EEH reorganization Gavin Shan
2012-02-28  6:03 ` [PATCH 01/21] Cleanup on comments of EEH core Gavin Shan
2012-02-28  6:03 ` [PATCH 02/21] Cleanup on function names " Gavin Shan
2012-02-28  6:03 ` [PATCH 03/21] Platform dependent EEH operations Gavin Shan
2012-02-28  6:03 ` [PATCH 04/21] pSeries platform EEH initialization Gavin Shan
2012-02-28  6:03 ` [PATCH 05/21] pSeries platform EEH operation Gavin Shan
2012-02-28  6:03 ` [PATCH 06/21] pSeries platform EEH PE address retrieval Gavin Shan
2012-02-28  6:03 ` [PATCH 07/21] pSeries platform PE state retrieval Gavin Shan
2012-02-28  6:03 ` [PATCH 08/21] pSeries platform EEH wait PE state Gavin Shan
2012-02-28  6:03 ` [PATCH 09/21] pSeries platform EEH reset PE Gavin Shan
2012-02-28  6:04 ` [PATCH 10/21] pSeries platform EEH error log retrieval Gavin Shan
2012-02-28  6:04 ` [PATCH 11/21] pSeries platform EEH configure bridge Gavin Shan
2012-02-28  6:04 ` [PATCH 12/21] Cleanup on comments of EEH aux components Gavin Shan
2012-02-28  6:04 ` [PATCH 13/21] Cleanup on function names " Gavin Shan
2012-02-28  6:04 ` [PATCH 14/21] Introduce EEH device Gavin Shan
2012-02-28  6:04 ` [PATCH 15/21] Replace pci_dn with eeh_dev for EEH sysfs Gavin Shan
2012-02-28  6:04 ` [PATCH 16/21] Replace pci_dn with eeh_dev for EEH address cache Gavin Shan
2012-02-28  6:04 ` [PATCH 17/21] Replace pci_dn with eeh_dev for EEH core Gavin Shan
2012-02-28  6:04 ` [PATCH 18/21] Replace pci_dn with eeh_dev for EEH aux components Gavin Shan
2012-02-28  6:04 ` [PATCH 19/21] Replace pci_dn with eeh_dev for EEH on pSeries Gavin Shan
2012-02-28  6:04 ` [PATCH 20/21] Introduce struct eeh_stats for EEH Gavin Shan
2012-02-28 10:04   ` David Laight
2012-02-29  1:08     ` Gavin Shan
2012-02-29  2:25   ` Gavin Shan
2012-02-29 12:56   ` Michael Ellerman
2012-03-01  1:14     ` Gavin Shan
2012-03-01  1:47   ` [PATCH 20/21] Introduce struct eeh_stats for EEH - Reworked Gavin Shan
2012-02-28  6:04 ` [PATCH 21/21] pSeries platform config space access in EEH Gavin Shan
2012-02-29  3:04 ` [PATCH v5 00/21] EEH reorganization Gavin Shan
2012-04-12 21:39 ` Anton Blanchard
2012-04-13  2:03   ` Anton Blanchard
2012-04-17  1:29     ` Gavin Shan
2012-04-17  1:37       ` Anton Blanchard
2012-04-17  1:57         ` Benjamin Herrenschmidt [this message]
2012-04-17  5:30           ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1334627871.25353.26.camel@pasglop \
    --to=benh@au1.ibm.com \
    --cc=anton@samba.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=shangw@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).