All of lore.kernel.org
 help / color / mirror / Atom feed
From: Benjamin Herrenschmidt <benh@au1.ibm.com>
To: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: linuxppc-dev@ozlabs.org, Anton Blanchard <anton@samba.org>
Subject: Re: [PATCH v5 00/21] EEH reorganization
Date: Tue, 17 Apr 2012 11:57:51 +1000	[thread overview]
Message-ID: <1334627871.25353.26.camel@pasglop> (raw)
In-Reply-To: <20120417113738.0f091da4@kryten>

On Tue, 2012-04-17 at 11:37 +1000, Anton Blanchard wrote:
> 
> No. I replaced that backtrace in eeh_dn_check_failure with a WARN_ON()
> because the backtrace doesn't give us enough info. I'm submitting a
> patch for that today.
> 
> Bottom line is mstmread has been causing an EEH error since at least
> 3.0, but in 3.4 we now oops instead of recovering. The signs all point
> to the EEH rework in 3.4.

More precisely, the original oops reported by Anton decodes as such:

>Oops: Kernel access of bad area, sig: 11 [#1]

This is typically a bad memory access..

>SMP NR_CPUS=1024 NUMA pSeries
>Modules linked in:
>NIP: c000000000055af8 LR: c000000000033204 CTR: 0000000000000000
>REGS: c000001f42fb7990 TRAP: 0300   Tainted: G        W     (3.4.0-rc2-00065-gf549e08-dirty)

TRAP: 300 means that it's the result of a data access interrupts, ie,
load or store to a bad address

>MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 24008084  XER: 00000000
>SOFTE: 1
>CFAR: 00000000000049b8
>DAR: 0000000000000070, DSISR: 40000000

Here the DAR tells us what address was accessed. 0x70 is a strong indication
that this was an access to a NULL pointer (at offset 0x70 from that pointer).

It -might- be something else (such as a NULL passed to a list head or such)
but the idea that there's a NULL floating around is a good hint.

>TASK = c000001f6c7dfc40[19010] 'eehd' THREAD: c000001f42fb4000 CPU: 6
>GPR00: 0000000000000001 c000001f42fb7c10 c000000000bd3a28 c000001f80ab0800 
>GPR04: c000001f7c57d418 0000000000000380 c000001f7c57e070 c000000000ed5360 
>GPR08: 0000000000000000 c000000000c77088 0000000000000000 0000000000000001 
>GPR12: 0000000044008088 c00000000eda1500 00000000019ffa78 0000000000a70000 
>GPR16: 00000000000000bb c000000000a9f754 c000000000963230 000000000000005e 
>GPR20: 0000000001b37e80 00000000000000bb 0000000000000000 c000000000b0ad90 
>GPR24: 0000000000000000 c000000000b10588 0000000000000001 c000001f80ab0800 
>GPR28: 0000000000000000 c000001f80ab0828 0000000000000000 c000001f7ee10000 
>NIP [c000000000055af8] .eeh_add_device_tree_late+0x58/0xf0

This is the function where it happened (eeh_add_device_tree_late)

>LR [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>Call Trace:
>[c000001f42fb7c10] [00000000fdffffff] 0xfdffffff (unreliable)
>[c000001f42fb7ca0] [c000000000033204] .pcibios_finish_adding_to_bus+0x34/0x50
>[c000001f42fb7d20] [c000000000059a5c] .pcibios_add_pci_devices+0x7c/0x190
>[c000001f42fb7db0] [c000000000057a6c] .eeh_reset_device+0xfc/0x1a0
>[c000001f42fb7e50] [c000000000057e18] .handle_eeh_events+0x308/0x480
>[c000001f42fb7f00] [c0000000000584dc] .eeh_event_handler+0x13c/0x1d0
>[c000001f42fb7f90] [c00000000002099c] .kernel_thread+0x54/0x70

And your backtrace. You can see that you got an eeh event, which triggered an
eeh reset, which triggered a pcibios_add_pci_devices() etc...

>Instruction dump:
>480000a8 60000000 ebff0000 7fbfe800 419e0098 2fbf0000 419e005c e9229eb0 
>80090008 2f800000 419e004c ebdf01d0 <e81e0070> 7fbf0000 3160ffff
>7d2b0110 

Cheers,
Ben.

  reply	other threads:[~2012-04-17  1:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28  6:03 [PATCH v5 00/21] EEH reorganization Gavin Shan
2012-02-28  6:03 ` [PATCH 01/21] Cleanup on comments of EEH core Gavin Shan
2012-02-28  6:03 ` [PATCH 02/21] Cleanup on function names " Gavin Shan
2012-02-28  6:03 ` [PATCH 03/21] Platform dependent EEH operations Gavin Shan
2012-02-28  6:03 ` [PATCH 04/21] pSeries platform EEH initialization Gavin Shan
2012-02-28  6:03 ` [PATCH 05/21] pSeries platform EEH operation Gavin Shan
2012-02-28  6:03 ` [PATCH 06/21] pSeries platform EEH PE address retrieval Gavin Shan
2012-02-28  6:03 ` [PATCH 07/21] pSeries platform PE state retrieval Gavin Shan
2012-02-28  6:03 ` [PATCH 08/21] pSeries platform EEH wait PE state Gavin Shan
2012-02-28  6:03 ` [PATCH 09/21] pSeries platform EEH reset PE Gavin Shan
2012-02-28  6:04 ` [PATCH 10/21] pSeries platform EEH error log retrieval Gavin Shan
2012-02-28  6:04 ` [PATCH 11/21] pSeries platform EEH configure bridge Gavin Shan
2012-02-28  6:04 ` [PATCH 12/21] Cleanup on comments of EEH aux components Gavin Shan
2012-02-28  6:04 ` [PATCH 13/21] Cleanup on function names " Gavin Shan
2012-02-28  6:04 ` [PATCH 14/21] Introduce EEH device Gavin Shan
2012-02-28  6:04 ` [PATCH 15/21] Replace pci_dn with eeh_dev for EEH sysfs Gavin Shan
2012-02-28  6:04 ` [PATCH 16/21] Replace pci_dn with eeh_dev for EEH address cache Gavin Shan
2012-02-28  6:04 ` [PATCH 17/21] Replace pci_dn with eeh_dev for EEH core Gavin Shan
2012-02-28  6:04 ` [PATCH 18/21] Replace pci_dn with eeh_dev for EEH aux components Gavin Shan
2012-02-28  6:04 ` [PATCH 19/21] Replace pci_dn with eeh_dev for EEH on pSeries Gavin Shan
2012-02-28  6:04 ` [PATCH 20/21] Introduce struct eeh_stats for EEH Gavin Shan
2012-02-28 10:04   ` David Laight
2012-02-29  1:08     ` Gavin Shan
2012-02-29  2:25   ` Gavin Shan
2012-02-29 12:56   ` Michael Ellerman
2012-03-01  1:14     ` Gavin Shan
2012-03-01  1:47   ` [PATCH 20/21] Introduce struct eeh_stats for EEH - Reworked Gavin Shan
2012-02-28  6:04 ` [PATCH 21/21] pSeries platform config space access in EEH Gavin Shan
2012-02-29  3:04 ` [PATCH v5 00/21] EEH reorganization Gavin Shan
2012-04-12 21:39 ` Anton Blanchard
2012-04-13  2:03   ` Anton Blanchard
2012-04-17  1:29     ` Gavin Shan
2012-04-17  1:37       ` Anton Blanchard
2012-04-17  1:57         ` Benjamin Herrenschmidt [this message]
2012-04-17  5:30           ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1334627871.25353.26.camel@pasglop \
    --to=benh@au1.ibm.com \
    --cc=anton@samba.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=shangw@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.