* eeh bug
@ 2007-05-17 4:46 Benjamin Herrenschmidt
2007-05-17 4:59 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-17 4:46 UTC (permalink / raw)
To: Linas Vepstas; +Cc: linuxppc-dev list
Hi Linas !
While debugging some other issues, I had a couple of oopses caused by
what looks like a bug in EEH:
When an RTAS PCI config space call returns all f's, we do an eeh error
check by calling eeh_dn_check_failure(pdn->node, NULL);
The problem is that second argument... NULL for the pci_dev *. It looks
like the EEH code will try to printk pci_name of that and later on
dereference it within eehd, thus causing an oops.
I'm not sure what's the best way to fix it and have no time to dig right
now though I suppose one could walk all PCI devices in the system to
look for a match with the dev node.
Though if for some reason we don't find a match, I think we should still
handle NULL's gracefully.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug
2007-05-17 4:46 eeh bug Benjamin Herrenschmidt
@ 2007-05-17 4:59 ` Benjamin Herrenschmidt
2007-05-17 16:44 ` Linas Vepstas
0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-17 4:59 UTC (permalink / raw)
To: Linas Vepstas; +Cc: linuxppc-dev list
On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote:
> Hi Linas !
>
> While debugging some other issues, I had a couple of oopses caused by
> what looks like a bug in EEH:
>
> When an RTAS PCI config space call returns all f's, we do an eeh error
> check by calling eeh_dn_check_failure(pdn->node, NULL);
>
> The problem is that second argument... NULL for the pci_dev *. It looks
> like the EEH code will try to printk pci_name of that and later on
> dereference it within eehd, thus causing an oops.
Ok, so I just added a
if (dev == NULL)
dev = pdn->pcidev;
To eeh_dn_check_failure(), and that fixes one of the NULL (name
printing), but I get another one a bit later, in pci_find_capability
called from eeh_slot_error_detail called from handle_eeh_events.
(Probably in gather_pci_data).
One thing that looks suspicions is that just before that I see:
EEH: of node=/pci/@8000000200000d3/pci@2,4
Which is not a device but the bridge above it... not sure why, maybe we
have a NULL pdn->pcidev at that level.. we should probably not sure
pci_find_capability in that code anyway and implent our own version
using RTAS in case we don't have a pci_dev around, don't you think ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug
2007-05-17 4:59 ` Benjamin Herrenschmidt
@ 2007-05-17 16:44 ` Linas Vepstas
2007-05-17 22:43 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 4+ messages in thread
From: Linas Vepstas @ 2007-05-17 16:44 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linuxppc-dev list
On Thu, May 17, 2007 at 02:59:06PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote:
> >
> > When an RTAS PCI config space call returns all f's, we do an eeh error
> > check by calling eeh_dn_check_failure(pdn->node, NULL);
> >
> > The problem is that second argument... NULL for the pci_dev *. It looks
> > like the EEH code will try to printk pci_name of that and later on
> > dereference it within eehd, thus causing an oops.
>
> Ok, so I just added a
>
> if (dev == NULL)
> dev = pdn->pcidev;
>
> To eeh_dn_check_failure(), and that fixes one of the NULL (name
> printing), but I get another one a bit later, in pci_find_capability
> called from eeh_slot_error_detail called from handle_eeh_events.
> (Probably in gather_pci_data).
OK, clearly I have been sloppy. The initial eeh design used pci_dev
for everything; and as time went on, I realized that the device node
made a better fit for what needed to be manipulated. So the code
migrated in that direction, but not unambiguously; it tried to
keep allegience to both ways of identifying a slot.
> One thing that looks suspicions is that just before that I see:
>
> EEH: of node=/pci/@8000000200000d3/pci@2,4
>
> Which is not a device but the bridge above it...
That's the "partition endpoint", which is what the firmware wants.
There's some ambiguity, as older systems with EADS and newer
direct-attached P5IOC slots have different relationships between
the "partition endpoint", the device, the slot, the bridge and
PHB; which of these are equivalent and which are subordinate
can be confusing.
> we should probably not sure
> pci_find_capability in that code anyway and implent our own version
> using RTAS in case we don't have a pci_dev around, don't you think ?
I'll take a look. Usually, there's no pci_dev only when its a slot
with no device plugged into it; these can still receive EEH errors
during config space i/o to the bridge (I presume that the justification
is when aluminum scrap shorts out a pci connector or something like
that). In all other cases, there's a pci_dev, which is why the
bug slipped by.
--linas
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug
2007-05-17 16:44 ` Linas Vepstas
@ 2007-05-17 22:43 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-17 22:43 UTC (permalink / raw)
To: Linas Vepstas; +Cc: linuxppc-dev list
On Thu, 2007-05-17 at 11:44 -0500, Linas Vepstas wrote:
> I'll take a look. Usually, there's no pci_dev only when its a slot
> with no device plugged into it; these can still receive EEH errors
> during config space i/o to the bridge (I presume that the
> justification
> is when aluminum scrap shorts out a pci connector or something like
> that). In all other cases, there's a pci_dev, which is why the
> bug slipped by.
Hrm... in that case, there's a device in. I'll try to track down why we
don't get it.
Ben.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-05-17 22:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-17 4:46 eeh bug Benjamin Herrenschmidt
2007-05-17 4:59 ` Benjamin Herrenschmidt
2007-05-17 16:44 ` Linas Vepstas
2007-05-17 22:43 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).