* Seeing DMAR errors after multiple load/unload with SR-IOV
@ 2011-06-06 9:09 padmanabh ratnakar
2011-06-06 22:17 ` Alex Williamson
0 siblings, 1 reply; 10+ messages in thread
From: padmanabh ratnakar @ 2011-06-06 9:09 UTC (permalink / raw)
To: linux-kernel, kvm
Hi,
I am using linux kernel 2.6.39. I have a IBM x3650 M3 system.
I have used following boot options -
intel_iommu=on iommu=pt
I was loading/unloading my NIC driver(be2net) with num_vfs=7.
After some iterations I get following DMAR errors -
Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason
2d on CPU 0.
Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled?
Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue
Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2
Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2]
fault addr 78077000
Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in
context entry is clear
I was trying to debug this. I dont understand iommu code much.
The physical address belongs the printed PCI function and there should
not have been an error.
I am unable to see pci_dev(pdev) of VFs getting removed from
si_domain->devices list(intel-iommu.c)
when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs.
Looks like issue happens when when freed pdev is allocated again and
as it is already in list,
required initializations dont happen.
I dont know if my understanding is correct. Can anyone point me to
what the issue may be?
Thanks,
Padmanabh
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-06 9:09 Seeing DMAR errors after multiple load/unload with SR-IOV padmanabh ratnakar @ 2011-06-06 22:17 ` Alex Williamson 2011-06-06 22:34 ` Chris Wright 0 siblings, 1 reply; 10+ messages in thread From: Alex Williamson @ 2011-06-06 22:17 UTC (permalink / raw) To: padmanabh ratnakar; +Cc: linux-kernel, kvm, iommu, dwmw2 On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: > Hi, > I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. > I have used following boot options - > intel_iommu=on iommu=pt > > I was loading/unloading my NIC driver(be2net) with num_vfs=7. > > After some iterations I get following DMAR errors - > Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason > 2d on CPU 0. > Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? > Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue > Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 > Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] > fault addr 78077000 > Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in > context entry is clear > > I was trying to debug this. I dont understand iommu code much. > The physical address belongs the printed PCI function and there should > not have been an error. > > I am unable to see pci_dev(pdev) of VFs getting removed from > si_domain->devices list(intel-iommu.c) > when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. > Looks like issue happens when when freed pdev is allocated again and > as it is already in list, > required initializations dont happen. > > I dont know if my understanding is correct. Can anyone point me to > what the issue may be? Typically devices are removed from the domain via drivers/pci/intel-iommu.c:device_notifier(), which is called as the device is unbound from the driver. However, this seems to get skipped when running in passthrough mode, so I'm not sure where that's supposed to occur. Does it happen w/o passthrough? Also note that some intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update and see if anything is better there. Thanks, Alex ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-06 22:17 ` Alex Williamson @ 2011-06-06 22:34 ` Chris Wright 2011-06-07 6:23 ` padmanabh ratnakar 0 siblings, 1 reply; 10+ messages in thread From: Chris Wright @ 2011-06-06 22:34 UTC (permalink / raw) To: Alex Williamson; +Cc: padmanabh ratnakar, iommu, dwmw2, linux-kernel, kvm * Alex Williamson (alex.williamson@redhat.com) wrote: > On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: > > Hi, > > I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. > > I have used following boot options - > > intel_iommu=on iommu=pt > > > > I was loading/unloading my NIC driver(be2net) with num_vfs=7. > > > > After some iterations I get following DMAR errors - > > Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason > > 2d on CPU 0. > > Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? > > Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue > > Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 > > Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] > > fault addr 78077000 > > Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in > > context entry is clear > > > > I was trying to debug this. I dont understand iommu code much. > > The physical address belongs the printed PCI function and there should > > not have been an error. > > > > I am unable to see pci_dev(pdev) of VFs getting removed from > > si_domain->devices list(intel-iommu.c) > > when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. > > Looks like issue happens when when freed pdev is allocated again and > > as it is already in list, > > required initializations dont happen. > > > > I dont know if my understanding is correct. Can anyone point me to > > what the issue may be? Yes, that's correct. The (now replaced) check identity_mapping() will succeed when the pci_dev is recycled (it's freed, but never removed from the list, this is an issue with passtrhough mode and device creation/desctruction). This false match happens w/ a brand new pci_dev which still has default 32bit DMA mask, so it is removed from pt domain. During removal domain_remove_one_dev_info() test that matches only on bus/devfn (now also segment) will match despite the fact that the info->pdev != pdev->dev.archdata.iommu. Then...Oops > Typically devices are removed from the domain via > drivers/pci/intel-iommu.c:device_notifier(), which is called as the > device is unbound from the driver. However, this seems to get skipped > when running in passthrough mode, so I'm not sure where that's supposed > to occur. Does it happen w/o passthrough? If you blacklist the driver then a create/delete may do similar (haven't tested that idea). > Also note that some > intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update > and see if anything is better there. Thanks, The change in identity_mapping() means we won't demote to 32-bit DMA (drop out of pt domain), so I don't think we'll see the same issue. thanks, -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-06 22:34 ` Chris Wright @ 2011-06-07 6:23 ` padmanabh ratnakar 2011-06-07 13:38 ` Chris Wright 0 siblings, 1 reply; 10+ messages in thread From: padmanabh ratnakar @ 2011-06-07 6:23 UTC (permalink / raw) To: Chris Wright; +Cc: Alex Williamson, iommu, dwmw2, linux-kernel, kvm On Tue, Jun 7, 2011 at 4:04 AM, Chris Wright <chrisw@sous-sol.org> wrote: > * Alex Williamson (alex.williamson@redhat.com) wrote: >> On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: >> > Hi, >> > I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. >> > I have used following boot options - >> > intel_iommu=on iommu=pt >> > >> > I was loading/unloading my NIC driver(be2net) with num_vfs=7. >> > >> > After some iterations I get following DMAR errors - >> > Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason >> > 2d on CPU 0. >> > Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? >> > Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue >> > Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 >> > Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] >> > fault addr 78077000 >> > Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in >> > context entry is clear >> > >> > I was trying to debug this. I dont understand iommu code much. >> > The physical address belongs the printed PCI function and there should >> > not have been an error. >> > >> > I am unable to see pci_dev(pdev) of VFs getting removed from >> > si_domain->devices list(intel-iommu.c) >> > when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. >> > Looks like issue happens when when freed pdev is allocated again and >> > as it is already in list, >> > required initializations dont happen. >> > >> > I dont know if my understanding is correct. Can anyone point me to >> > what the issue may be? > > Yes, that's correct. The (now replaced) check identity_mapping() > will succeed when the pci_dev is recycled (it's freed, but never > removed from the list, this is an issue with passtrhough mode and device > creation/desctruction). This false match happens w/ a brand new pci_dev > which still has default 32bit DMA mask, so it is removed from pt domain. > During removal domain_remove_one_dev_info() test that matches only > on bus/devfn (now also segment) will match despite the fact that the > info->pdev != pdev->dev.archdata.iommu. Then...Oops > >> Typically devices are removed from the domain via >> drivers/pci/intel-iommu.c:device_notifier(), which is called as the >> device is unbound from the driver. However, this seems to get skipped >> when running in passthrough mode, so I'm not sure where that's supposed >> to occur. Does it happen w/o passthrough? > I had tried without passthrough on RHEL 6.1 GA kernel. Was seeing hangs and panics. Will check if non passthrough mode works on latest kernel. > If you blacklist the driver then a create/delete may do similar (haven't > tested that idea). > >> Also note that some >> intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update >> and see if anything is better there. Thanks, > > The change in identity_mapping() means we won't demote to 32-bit DMA > (drop out of pt domain), so I don't think we'll see the same issue. > For testing I had made a hack in 2.6.39 kernel which will prevent demoting to 32bit DMA mask and thereby prevent calling of domain_remove_one_dev_info() for the specific VF device I was using and it had worked. So as you said I may not hit the issue in latest kernel. Will try that. > thanks, > -chris > Thanks for the response and suggestions. Padmanabh ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 6:23 ` padmanabh ratnakar @ 2011-06-07 13:38 ` Chris Wright 2011-06-07 13:46 ` David Woodhouse 0 siblings, 1 reply; 10+ messages in thread From: Chris Wright @ 2011-06-07 13:38 UTC (permalink / raw) To: padmanabh ratnakar Cc: Chris Wright, Alex Williamson, iommu, dwmw2, linux-kernel, kvm * padmanabh ratnakar (pratnakarlx@gmail.com) wrote: > On Tue, Jun 7, 2011 at 4:04 AM, Chris Wright <chrisw@sous-sol.org> wrote: > > * Alex Williamson (alex.williamson@redhat.com) wrote: > >> On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: > >> > Hi, > >> > I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. > >> > I have used following boot options - > >> > intel_iommu=on iommu=pt > >> > > >> > I was loading/unloading my NIC driver(be2net) with num_vfs=7. > >> > > >> > After some iterations I get following DMAR errors - > >> > Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason > >> > 2d on CPU 0. > >> > Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? > >> > Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue > >> > Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 > >> > Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] > >> > fault addr 78077000 > >> > Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in > >> > context entry is clear > >> > > >> > I was trying to debug this. I dont understand iommu code much. > >> > The physical address belongs the printed PCI function and there should > >> > not have been an error. > >> > > >> > I am unable to see pci_dev(pdev) of VFs getting removed from > >> > si_domain->devices list(intel-iommu.c) > >> > when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. > >> > Looks like issue happens when when freed pdev is allocated again and > >> > as it is already in list, > >> > required initializations dont happen. > >> > > >> > I dont know if my understanding is correct. Can anyone point me to > >> > what the issue may be? > > > > Yes, that's correct. The (now replaced) check identity_mapping() > > will succeed when the pci_dev is recycled (it's freed, but never > > removed from the list, this is an issue with passtrhough mode and device > > creation/desctruction). This false match happens w/ a brand new pci_dev > > which still has default 32bit DMA mask, so it is removed from pt domain. > > During removal domain_remove_one_dev_info() test that matches only > > on bus/devfn (now also segment) will match despite the fact that the > > info->pdev != pdev->dev.archdata.iommu. Then...Oops > > > >> Typically devices are removed from the domain via > >> drivers/pci/intel-iommu.c:device_notifier(), which is called as the > >> device is unbound from the driver. However, this seems to get skipped > >> when running in passthrough mode, so I'm not sure where that's supposed > >> to occur. Does it happen w/o passthrough? > > > I had tried without passthrough on RHEL 6.1 GA kernel. Was seeing > hangs and panics. Will check if non passthrough mode works on latest kernel. > > > If you blacklist the driver then a create/delete may do similar (haven't > > tested that idea). > > > >> Also note that some > >> intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update > >> and see if anything is better there. Thanks, > > > > The change in identity_mapping() means we won't demote to 32-bit DMA > > (drop out of pt domain), so I don't think we'll see the same issue. > > > For testing I had made a hack in 2.6.39 kernel which will prevent > demoting to 32bit DMA mask > and thereby prevent calling of domain_remove_one_dev_info() for the > specific VF device I was using > and it had worked. > So as you said I may not hit the issue in latest kernel. Will try that. I think we still leak the list entry though. Bottom line is that we need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. thanks, -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 13:38 ` Chris Wright @ 2011-06-07 13:46 ` David Woodhouse 2011-06-07 15:10 ` Chris Wright 0 siblings, 1 reply; 10+ messages in thread From: David Woodhouse @ 2011-06-07 13:46 UTC (permalink / raw) To: Chris Wright Cc: padmanabh ratnakar, Alex Williamson, iommu, linux-kernel, kvm On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: > I think we still leak the list entry though. Bottom line is that we > need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We > happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We should figure out the matching DMAR unit directly from the ACPI table at ADD_DEVICE time, and store it in pdev->archdata.iommu. I saw patches which were going in that direction... -- dwmw2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 13:46 ` David Woodhouse @ 2011-06-07 15:10 ` Chris Wright 2011-06-07 15:33 ` David Woodhouse 0 siblings, 1 reply; 10+ messages in thread From: Chris Wright @ 2011-06-07 15:10 UTC (permalink / raw) To: David Woodhouse Cc: Chris Wright, padmanabh ratnakar, Alex Williamson, iommu, linux-kernel, kvm * David Woodhouse (dwmw2@infradead.org) wrote: > On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: > > I think we still leak the list entry though. Bottom line is that we > > need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We > > happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. > > Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We > should figure out the matching DMAR unit directly from the ACPI table at > ADD_DEVICE time, and store it in pdev->archdata.iommu. > > I saw patches which were going in that direction... Cool, where are they? I'm working on something similar, and missed them. thanks, -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 15:10 ` Chris Wright @ 2011-06-07 15:33 ` David Woodhouse 2011-06-07 15:35 ` Chris Wright 2011-06-07 15:40 ` Alex Williamson 0 siblings, 2 replies; 10+ messages in thread From: David Woodhouse @ 2011-06-07 15:33 UTC (permalink / raw) To: Chris Wright Cc: padmanabh ratnakar, Alex Williamson, iommu, linux-kernel, kvm On Tue, 2011-06-07 at 08:10 -0700, Chris Wright wrote: > * David Woodhouse (dwmw2@infradead.org) wrote: > > On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: > > > I think we still leak the list entry though. Bottom line is that we > > > need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We > > > happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. > > > > Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We > > should figure out the matching DMAR unit directly from the ACPI table at > > ADD_DEVICE time, and store it in pdev->archdata.iommu. > > > > I saw patches which were going in that direction... > > Cool, where are they? I'm working on something similar, and missed them. [PATCH] pci, dmar: Update dmar units devices list during hotplug Alex was working on it. -- dwmw2 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 15:33 ` David Woodhouse @ 2011-06-07 15:35 ` Chris Wright 2011-06-07 15:40 ` Alex Williamson 1 sibling, 0 replies; 10+ messages in thread From: Chris Wright @ 2011-06-07 15:35 UTC (permalink / raw) To: David Woodhouse Cc: Chris Wright, padmanabh ratnakar, Alex Williamson, iommu, linux-kernel, kvm * David Woodhouse (dwmw2@infradead.org) wrote: > On Tue, 2011-06-07 at 08:10 -0700, Chris Wright wrote: > > * David Woodhouse (dwmw2@infradead.org) wrote: > > > On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: > > > > I think we still leak the list entry though. Bottom line is that we > > > > need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We > > > > happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. > > > > > > Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We > > > should figure out the matching DMAR unit directly from the ACPI table at > > > ADD_DEVICE time, and store it in pdev->archdata.iommu. > > > > > > I saw patches which were going in that direction... > > > > Cool, where are they? I'm working on something similar, and missed them. > > [PATCH] pci, dmar: Update dmar units devices list during hotplug Oh yeah, thanks for the reminder. thanks, -chris ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Seeing DMAR errors after multiple load/unload with SR-IOV 2011-06-07 15:33 ` David Woodhouse 2011-06-07 15:35 ` Chris Wright @ 2011-06-07 15:40 ` Alex Williamson 1 sibling, 0 replies; 10+ messages in thread From: Alex Williamson @ 2011-06-07 15:40 UTC (permalink / raw) To: David Woodhouse Cc: Chris Wright, padmanabh ratnakar, iommu, linux-kernel, kvm On Tue, 2011-06-07 at 16:33 +0100, David Woodhouse wrote: > On Tue, 2011-06-07 at 08:10 -0700, Chris Wright wrote: > > * David Woodhouse (dwmw2@infradead.org) wrote: > > > On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: > > > > I think we still leak the list entry though. Bottom line is that we > > > > need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We > > > > happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. > > > > > > Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We > > > should figure out the matching DMAR unit directly from the ACPI table at > > > ADD_DEVICE time, and store it in pdev->archdata.iommu. > > > > > > I saw patches which were going in that direction... > > > > Cool, where are they? I'm working on something similar, and missed them. > > [PATCH] pci, dmar: Update dmar units devices list during hotplug > > Alex was working on it. Nope, I had a wip patch that did an on-the-fly lookup, that I handed off to Yinghai, but it didn't actually work. That's when the suggestion was made to do it at hotplug, but I'm not pursuing that right now, maybe Yinghai is? Thanks, Alex Alex ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-06-07 15:40 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-06 9:09 Seeing DMAR errors after multiple load/unload with SR-IOV padmanabh ratnakar 2011-06-06 22:17 ` Alex Williamson 2011-06-06 22:34 ` Chris Wright 2011-06-07 6:23 ` padmanabh ratnakar 2011-06-07 13:38 ` Chris Wright 2011-06-07 13:46 ` David Woodhouse 2011-06-07 15:10 ` Chris Wright 2011-06-07 15:33 ` David Woodhouse 2011-06-07 15:35 ` Chris Wright 2011-06-07 15:40 ` Alex Williamson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox