* eeh bug @ 2007-05-17 4:46 Benjamin Herrenschmidt 2007-05-17 4:59 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 4+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-17 4:46 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc-dev list Hi Linas ! While debugging some other issues, I had a couple of oopses caused by what looks like a bug in EEH: When an RTAS PCI config space call returns all f's, we do an eeh error check by calling eeh_dn_check_failure(pdn->node, NULL); The problem is that second argument... NULL for the pci_dev *. It looks like the EEH code will try to printk pci_name of that and later on dereference it within eehd, thus causing an oops. I'm not sure what's the best way to fix it and have no time to dig right now though I suppose one could walk all PCI devices in the system to look for a match with the dev node. Though if for some reason we don't find a match, I think we should still handle NULL's gracefully. Cheers, Ben. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug 2007-05-17 4:46 eeh bug Benjamin Herrenschmidt @ 2007-05-17 4:59 ` Benjamin Herrenschmidt 2007-05-17 16:44 ` Linas Vepstas 0 siblings, 1 reply; 4+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-17 4:59 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc-dev list On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote: > Hi Linas ! > > While debugging some other issues, I had a couple of oopses caused by > what looks like a bug in EEH: > > When an RTAS PCI config space call returns all f's, we do an eeh error > check by calling eeh_dn_check_failure(pdn->node, NULL); > > The problem is that second argument... NULL for the pci_dev *. It looks > like the EEH code will try to printk pci_name of that and later on > dereference it within eehd, thus causing an oops. Ok, so I just added a if (dev == NULL) dev = pdn->pcidev; To eeh_dn_check_failure(), and that fixes one of the NULL (name printing), but I get another one a bit later, in pci_find_capability called from eeh_slot_error_detail called from handle_eeh_events. (Probably in gather_pci_data). One thing that looks suspicions is that just before that I see: EEH: of node=/pci/@8000000200000d3/pci@2,4 Which is not a device but the bridge above it... not sure why, maybe we have a NULL pdn->pcidev at that level.. we should probably not sure pci_find_capability in that code anyway and implent our own version using RTAS in case we don't have a pci_dev around, don't you think ? Cheers, Ben. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug 2007-05-17 4:59 ` Benjamin Herrenschmidt @ 2007-05-17 16:44 ` Linas Vepstas 2007-05-17 22:43 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 4+ messages in thread From: Linas Vepstas @ 2007-05-17 16:44 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev list On Thu, May 17, 2007 at 02:59:06PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote: > > > > When an RTAS PCI config space call returns all f's, we do an eeh error > > check by calling eeh_dn_check_failure(pdn->node, NULL); > > > > The problem is that second argument... NULL for the pci_dev *. It looks > > like the EEH code will try to printk pci_name of that and later on > > dereference it within eehd, thus causing an oops. > > Ok, so I just added a > > if (dev == NULL) > dev = pdn->pcidev; > > To eeh_dn_check_failure(), and that fixes one of the NULL (name > printing), but I get another one a bit later, in pci_find_capability > called from eeh_slot_error_detail called from handle_eeh_events. > (Probably in gather_pci_data). OK, clearly I have been sloppy. The initial eeh design used pci_dev for everything; and as time went on, I realized that the device node made a better fit for what needed to be manipulated. So the code migrated in that direction, but not unambiguously; it tried to keep allegience to both ways of identifying a slot. > One thing that looks suspicions is that just before that I see: > > EEH: of node=/pci/@8000000200000d3/pci@2,4 > > Which is not a device but the bridge above it... That's the "partition endpoint", which is what the firmware wants. There's some ambiguity, as older systems with EADS and newer direct-attached P5IOC slots have different relationships between the "partition endpoint", the device, the slot, the bridge and PHB; which of these are equivalent and which are subordinate can be confusing. > we should probably not sure > pci_find_capability in that code anyway and implent our own version > using RTAS in case we don't have a pci_dev around, don't you think ? I'll take a look. Usually, there's no pci_dev only when its a slot with no device plugged into it; these can still receive EEH errors during config space i/o to the bridge (I presume that the justification is when aluminum scrap shorts out a pci connector or something like that). In all other cases, there's a pci_dev, which is why the bug slipped by. --linas > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: eeh bug 2007-05-17 16:44 ` Linas Vepstas @ 2007-05-17 22:43 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 4+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-17 22:43 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc-dev list On Thu, 2007-05-17 at 11:44 -0500, Linas Vepstas wrote: > I'll take a look. Usually, there's no pci_dev only when its a slot > with no device plugged into it; these can still receive EEH errors > during config space i/o to the bridge (I presume that the > justification > is when aluminum scrap shorts out a pci connector or something like > that). In all other cases, there's a pci_dev, which is why the > bug slipped by. Hrm... in that case, there's a device in. I'll try to track down why we don't get it. Ben. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-05-17 22:43 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-17 4:46 eeh bug Benjamin Herrenschmidt 2007-05-17 4:59 ` Benjamin Herrenschmidt 2007-05-17 16:44 ` Linas Vepstas 2007-05-17 22:43 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).