iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]       ` <njbvjb$40r$1@ger.gmane.org>
@ 2016-06-09 16:03         ` Alexander Duyck
  2016-06-09 16:57           ` Lutz Vieweg
       [not found]           ` <CAKgT0UeFM1jYTU83YFohxUHWuJeTYfWDpdFM2CDQCutmf_vXvA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Alexander Duyck @ 2016-06-09 16:03 UTC (permalink / raw)
  To: Lutz Vieweg, jroedel-l3A5Bk7waGM
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	open list:INTEL IOMMU (VT-d)

On Thu, Jun 9, 2016 at 7:48 AM, Lutz Vieweg <lvml-i6VILw57VWU@public.gmane.org> wrote:
> Bad news: It happened again today:
>> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050]
>> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050]
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit Hang#012  Tx Queue             <2>#012  TDH, TDT             <186>, <194>#012  next_to_use          <194>#012  next_to_clean        <186>#012tx_buffer_info[next_to_clean]#012  time_stamp           <11df79bf7>#012  jiffies              <11df7aac8>
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit Hang#012  Tx Queue             <3>#012  TDH, TDT             <1e4>, <2>#012  next_to_use          <2>#012  next_to_clean        <1e4>#012tx_buffer_info[next_to_clean]#012  time_stamp           <11df79a0f>#012  jiffies              <11df7aac8>
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 detected on queue 3, resetting adapter
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit Hang#012  Tx Queue             <24>#012  TDH, TDT             <1ec>, <2>#012  next_to_use          <2>#012  next_to_clean        <1ec>#012tx_buffer_info[next_to_clean]#012  time_stamp           <11df79a0f>#012  jiffies              <11df7aac8>
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset due to tx timeout
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 detected on queue 24, resetting adapter
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset due to tx timeout
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter
>> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2 detected on queue 2, resetting adapter
>> Jun  9 14:40:14 computer kernel: ixgbe 0000:04:00.0: master disable timed out
>   ...
>
> And today, no other NIC connected to the same switch saw any "glitch".
>
> I got you an "lspci -vvv" output, however, some interesting
>> "pcilib: sysfs_read_vpd: read failed: Input/output error"
> message is reported while lspci is emitting data on the NIC:
>
>> 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
>>         Subsystem: Intel Corporation Ethernet Converged Network Adapter X540-T1
>>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort+ <MAbort- >SERR- <PERR- INTx-
>>         Latency: 0, Cache Line Size: 64 bytes
>>         Interrupt: pin A routed to IRQ 59
>>         Region 0: Memory at dce00000 (64-bit, prefetchable) [size=2M]
>>         Region 4: Memory at dcdfc000 (64-bit, prefetchable) [size=16K]
>>         Expansion ROM at dfd80000 [disabled] [size=512K]
>>         Capabilities: [40] Power Management version 3
>>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
>>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>>                 Address: 0000000000000000  Data: 0000
>>                 Masking: 00000000  Pending: 00000000
>>         Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
>>                 Vector table: BAR=4 offset=00000000
>>                 PBA: BAR=4 offset=00002000
>>         Capabilities: [a0] Express (v2) Endpoint, MSI 00
>>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
>>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>>                 LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <8us
>>                         ClockPM- Surprise- LLActRep- BwNot-
>>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                 LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
>>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>>                 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>>                          Compliance De-emphasis: -6dB
>>                 LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
>>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>         Capabilities: [100 v2] Advanced Error Reporting
>>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>                 AERCap: First Error Pointer: 00, Gpcilib: sysfs_read_vpd: read failed: Input/output error
>> enCap+ CGenEn- ChkCap+ ChkEn-
>>         Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-80-xx-xx
>>         Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
>>                 ARICap: MFVC- ACS-, Next Function: 0
>>                 ARICtl: MFVC- ACS-, Function Group: 0
>>         Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
>>                 IOVCap: Migration-, Interrupt Message Number: 000
>>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
>>                 IOVSta: Migration-
>>                 Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
>>                 VF offset: 128, stride: 2, Device ID: 1515
>>                 Supported Page Size: 00000553, System Page Size: 00000001
>>                 Region 0: Memory at 0000000000000000 (64-bit, non-prefetchable)
>>                 Region 3: Memory at 0000000000000000 (64-bit, non-prefetchable)
>>                 VF Migration: offset: 00000000, BIR: 0
>>         Capabilities: [1d0 v1] Access Control Services
>>                 ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>>                 ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>>         Kernel driver in use: ixgbe
>
> This time I'll reboot the machine, and also try "iommu=pt" as suggested
> in different places for use with 10G NICs.

That might be a good place to start.

I'm adding, or at least attempting to, the mailing list and maintainer
for the IOMMU code.  You might want to check with the AMD-Vi IOMMU
maintainers to see if they have any other advice as this seems like
something that may have been introduced with changes to the IOMMU as
the ixgbe driver hasn't had any updates to the DMA mapping/unmapping
code in some time and it was working in the 4.4 kernel series and
still works on my system which runs an Intel IOMMU so I am wondering
if this may be something specifically related to changes in the AMD
IOMMU code.

- Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
  2016-06-09 16:03         ` [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out Alexander Duyck
@ 2016-06-09 16:57           ` Lutz Vieweg
       [not found]             ` <5759A009.8040200-i6VILw57VWU@public.gmane.org>
       [not found]           ` <CAKgT0UeFM1jYTU83YFohxUHWuJeTYfWDpdFM2CDQCutmf_vXvA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Lutz Vieweg @ 2016-06-09 16:57 UTC (permalink / raw)
  To: e1000-devel; +Cc: iommu

On 06/09/2016 06:03 PM, Alexander Duyck wrote:
>> This time I'll reboot the machine, and also try "iommu=pt" as suggested
>> in different places for use with 10G NICs.
>
> That might be a good place to start.
>
> I'm adding, or at least attempting to, the mailing list and maintainer
> for the IOMMU code.  You might want to check with the AMD-Vi IOMMU
> maintainers to see if they have any other advice as this seems like
> something that may have been introduced with changes to the IOMMU as
> the ixgbe driver hasn't had any updates to the DMA mapping/unmapping
> code in some time and it was working in the 4.4 kernel series and
> still works on my system which runs an Intel IOMMU so I am wondering
> if this may be something specifically related to changes in the AMD
> IOMMU code.

After having rebooted the system with "iommu=pt", the following change
of dmesg-output looks curious to me:

Without "iommu=pt":
> [    4.869591] iommu: Adding device 0000:04:00.0 to group 13
...
> [    4.873105] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
> [    4.873347] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
> [    4.873586] AMD-Vi: Interrupt remapping enabled
> [    4.874108] AMD-Vi: Lazy IO/TLB flushing enabled

With "iommu=pt":
> [    4.832580] iommu: Adding device 0000:04:00.0 to group 13
> [    4.832838] iommu: Using direct mapping for device 0000:04:00.0
...
> [    4.837074] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
> [    4.837305] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
> [    4.837535] AMD-Vi: Interrupt remapping enabled
> [    4.838062] AMD-Vi: Lazy IO/TLB flushing enabled
> [    4.838291] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [    4.838533] software IO TLB [mem 0xd3e80000-0xd7e80000] (64MB) mapped at [ffff8800d3e80000-ffff8800d7e7ffff]

I hope that doesn't mean all my network data is now passing through
an additional copy-by-CPU... that would be kind of the opposite of what
"iommu=pt" seemed to promise :-)

One more thing I find curious, but this didn't change with "iommu=pt":
> [    0.000000] AGP: Checking aperture...
> [    0.000000] AGP: No AGP bridge found
> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff] (32MB)
> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
> [    0.000000] AGP: This costs you 64MB of RAM
> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff] (65536KB)
I checked and the IOMMU-option is definitely enabled in the BIOS setup.
So I assume right that these message are irrelevant (since AGP as a whole
is irrelevant on this server)?

Regards,

Lutz Vieweg



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]             ` <5759A009.8040200-i6VILw57VWU@public.gmane.org>
@ 2016-06-13  2:46               ` Wan ZongShun
       [not found]                 ` <CAKT61h9cNnGDNugoWXYcpN1VjVK3Hn-VOW+TwHahj5EXzfsXgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Wan ZongShun @ 2016-06-13  2:46 UTC (permalink / raw)
  To: Lutz Vieweg
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

2016-06-10 0:57 GMT+08:00 Lutz Vieweg <lvml-i6VILw57VWU@public.gmane.org>:
> On 06/09/2016 06:03 PM, Alexander Duyck wrote:
>>>
>>> This time I'll reboot the machine, and also try "iommu=pt" as suggested
>>> in different places for use with 10G NICs.
>>
>>
>> That might be a good place to start.
>>
>> I'm adding, or at least attempting to, the mailing list and maintainer
>> for the IOMMU code.  You might want to check with the AMD-Vi IOMMU
>> maintainers to see if they have any other advice as this seems like
>> something that may have been introduced with changes to the IOMMU as
>> the ixgbe driver hasn't had any updates to the DMA mapping/unmapping
>> code in some time and it was working in the 4.4 kernel series and
>> still works on my system which runs an Intel IOMMU so I am wondering
>> if this may be something specifically related to changes in the AMD
>> IOMMU code.
>
>
> After having rebooted the system with "iommu=pt", the following change
> of dmesg-output looks curious to me:
>
> Without "iommu=pt":
>>
>> [    4.869591] iommu: Adding device 0000:04:00.0 to group 13
>
> ...
>>
>> [    4.873105] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
>> [    4.873347] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
>> [    4.873586] AMD-Vi: Interrupt remapping enabled
>> [    4.874108] AMD-Vi: Lazy IO/TLB flushing enabled
>

Ok, so there are two iommus controller in your system.

>
> With "iommu=pt":
>>
>> [    4.832580] iommu: Adding device 0000:04:00.0 to group 13
>> [    4.832838] iommu: Using direct mapping for device 0000:04:00.0
>

That is right, you will pass through AMD IOMMU when you set iommu=pt.

> ...
>>
>> [    4.837074] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
>> [    4.837305] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
>> [    4.837535] AMD-Vi: Interrupt remapping enabled
>> [    4.838062] AMD-Vi: Lazy IO/TLB flushing enabled
>> [    4.838291] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>> [    4.838533] software IO TLB [mem 0xd3e80000-0xd7e80000] (64MB) mapped
>> at [ffff8800d3e80000-ffff8800d7e7ffff]
>
>
> I hope that doesn't mean all my network data is now passing through
> an additional copy-by-CPU... that would be kind of the opposite of what
> "iommu=pt" seemed to promise :-)

It depends.

Firstly, I need to know if your ethernet card works well now or not
after you set iommu=pt.

If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
is ok, you will not be impacted by bounce buffer. But iommu=pt is a
terrible option, that make all devices bypass the iommu.

If you want to get further help, Please try:

(1)Please add 'amd_iommu_dump' option in your kernel boot option, and
send your full kernel logs, lspci info, don't add iommu=pt.
(2) Add amd_iommu=fullflush option to kernel boot option, just try it.


>
> One more thing I find curious, but this didn't change with "iommu=pt":
>>
>> [    0.000000] AGP: Checking aperture...
>> [    0.000000] AGP: No AGP bridge found
>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>> (32MB)
>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>> [    0.000000] AGP: This costs you 64MB of RAM
>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>> (65536KB)
>
> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
> So I assume right that these message are irrelevant (since AGP as a whole
> is irrelevant on this server)?

Please cat /proc/iomem, send the information.

>
> Regards,
>
> Lutz Vieweg
>
>
>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu



-- 
---
Vincent Wan(Zongshun)
www.mcuos.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]           ` <CAKgT0UeFM1jYTU83YFohxUHWuJeTYfWDpdFM2CDQCutmf_vXvA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-13  9:08             ` Joerg Roedel
  2016-06-13 17:46               ` Lutz Vieweg
  0 siblings, 1 reply; 12+ messages in thread
From: Joerg Roedel @ 2016-06-13  9:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	open list:INTEL IOMMU (VT-d), Lutz Vieweg

On Thu, Jun 09, 2016 at 09:03:40AM -0700, Alexander Duyck wrote:
> >> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050]
> >> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050]

Some more context would be helpful. Which kernel version was the last
that worked and with which version do you start to see these messages?


	Joerg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                 ` <CAKT61h9cNnGDNugoWXYcpN1VjVK3Hn-VOW+TwHahj5EXzfsXgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-13 17:40                   ` Lutz Vieweg
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
  2016-08-29 12:30                     ` Lutz Vieweg
  0 siblings, 2 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-06-13 17:40 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> With "iommu=pt":
>>>
>>> [    4.832580] iommu: Adding device 0000:04:00.0 to group 13
>>> [    4.832838] iommu: Using direct mapping for device 0000:04:00.0
>>
>
> That is right, you will pass through AMD IOMMU when you set iommu=pt.
>
>> ...
>>>
>>> [    4.837074] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
>>> [    4.837305] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
>>> [    4.837535] AMD-Vi: Interrupt remapping enabled
>>> [    4.838062] AMD-Vi: Lazy IO/TLB flushing enabled
>>> [    4.838291] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>> [    4.838533] software IO TLB [mem 0xd3e80000-0xd7e80000] (64MB) mapped
>>> at [ffff8800d3e80000-ffff8800d7e7ffff]
>>
>>
>> I hope that doesn't mean all my network data is now passing through
>> an additional copy-by-CPU... that would be kind of the opposite of what
>> "iommu=pt" seemed to promise :-)
>
> It depends.
>
> Firstly, I need to know if your ethernet card works well now or not
> after you set iommu=pt.

Too early to tell - the NIC worked for the last 4 days now without
failing, however, that is only about the same time as it took after
the upgrade to linux-4.6.1 before the bug was encountered, first.

I'd say celebration of "works with iommu=pt" has to wait for at least
two weeks or so before it is reasonably probable it works for this reason.

> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
> is ok, you will not be impacted by bounce buffer.

> But iommu=pt is a terrible option, that make all devices bypass the iommu.

Why is that terrible? The documentation I found on what iommu=pt actually
means were pretty scarce, but I noticed how many places recommended to use
this option for 10G NICs.

> If you want to get further help, Please try:
>
> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
> send your full kernel logs, lspci info, don't add iommu=pt.
> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.

Will try that when the NIC becomes unavailable again.

>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>
>>> [    0.000000] AGP: Checking aperture...
>>> [    0.000000] AGP: No AGP bridge found
>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>> (32MB)
>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>> [    0.000000] AGP: This costs you 64MB of RAM
>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>> (65536KB)
>>
>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>> So I assume right that these message are irrelevant (since AGP as a whole
>> is irrelevant on this server)?
>
> Please cat /proc/iomem, send the information.

Here it is:
> 00000000-00000fff : reserved
> 00001000-00097bff : System RAM
> 00097c00-0009ffff : reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c7fff : Video ROM
> 000ce800-000d43ff : Adapter ROM
> 000d4800-000d57ff : Adapter ROM
> 000e6000-000fffff : reserved
>   000f0000-000fffff : System ROM
> 00100000-d7e7ffff : System RAM
>   01000000-01688c05 : Kernel code
>   01688c06-01d4f53f : Kernel data
>   01eea000-02174fff : Kernel bss
> d7e80000-d7e8dfff : RAM buffer
> d7e8e000-d7e8ffff : reserved
> d7e90000-d7eb3fff : ACPI Tables
> d7eb4000-d7edffff : ACPI Non-volatile Storage
> d7ee0000-d7ffffff : reserved
> d9000000-daffffff : PCI Bus 0000:40
>   d9000000-d90003ff : IOAPIC 2
>   d9010000-d9013fff : amd_iommu
> db000000-dcffffff : PCI Bus 0000:00
>   db000000-dbffffff : PCI Bus 0000:01
>     db000000-dbffffff : 0000:01:04.0
>       db000000-dbffffff : mgadrmfb_vram
>   dcd00000-dcffffff : PCI Bus 0000:04
>     dcdfc000-dcdfffff : 0000:04:00.0
>       dcdfc000-dcdfffff : ixgbe
>     dce00000-dcffffff : 0000:04:00.0
>       dce00000-dcffffff : ixgbe
> dd000000-dfffffff : PCI Bus 0000:00
>   def00000-df7fffff : PCI Bus 0000:01
>     deffc000-deffffff : 0000:01:04.0
>       deffc000-deffffff : mgadrmfb_mmio
>     df000000-df7fffff : 0000:01:04.0
>   dfaf6000-dfaf6fff : 0000:00:12.1
>     dfaf6000-dfaf6fff : ohci_hcd
>   dfaf7000-dfaf7fff : 0000:00:12.0
>     dfaf7000-dfaf7fff : ohci_hcd
>   dfaf8400-dfaf87ff : 0000:00:11.0
>     dfaf8400-dfaf87ff : ahci
>   dfaf8800-dfaf88ff : 0000:00:12.2
>     dfaf8800-dfaf88ff : ehci_hcd
>   dfaf8c00-dfaf8cff : 0000:00:13.2
>     dfaf8c00-dfaf8cff : ehci_hcd
>   dfaf9000-dfaf9fff : 0000:00:13.1
>     dfaf9000-dfaf9fff : ohci_hcd
>   dfafa000-dfafafff : 0000:00:13.0
>     dfafa000-dfafafff : ohci_hcd
>   dfafb000-dfafbfff : 0000:00:14.5
>     dfafb000-dfafbfff : ohci_hcd
>   dfb00000-dfbfffff : PCI Bus 0000:02
>     dfb1c000-dfb1ffff : 0000:02:00.1
>       dfb1c000-dfb1ffff : igb
>     dfb20000-dfb3ffff : 0000:02:00.1
>     dfb40000-dfb5ffff : 0000:02:00.1
>       dfb40000-dfb5ffff : igb
>     dfb60000-dfb7ffff : 0000:02:00.1
>       dfb60000-dfb7ffff : igb
>     dfb9c000-dfb9ffff : 0000:02:00.0
>       dfb9c000-dfb9ffff : igb
>     dfba0000-dfbbffff : 0000:02:00.0
>     dfbc0000-dfbdffff : 0000:02:00.0
>       dfbc0000-dfbdffff : igb
>     dfbe0000-dfbfffff : 0000:02:00.0
>       dfbe0000-dfbfffff : igb
>   dfc00000-dfcfffff : PCI Bus 0000:03
>     dfc3c000-dfc3ffff : 0000:03:00.0
>       dfc3c000-dfc3ffff : mpt2sas
>     dfc40000-dfc7ffff : 0000:03:00.0
>       dfc40000-dfc7ffff : mpt2sas
>     dfc80000-dfcfffff : 0000:03:00.0
>   dfd00000-dfdfffff : PCI Bus 0000:04
>     dfd80000-dfdfffff : 0000:04:00.0
>   dfe00000-dfffffff : PCI Bus 0000:05
>     dfeb0000-dfebffff : 0000:05:00.0
>       dfeb0000-dfebffff : mpt2sas
>     dfec0000-dfefffff : 0000:05:00.0
>       dfec0000-dfefffff : mpt2sas
>     dff00000-dfffffff : 0000:05:00.0
> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>   e0000000-efffffff : reserved
>     e0000000-efffffff : pnp 00:0a
> f6000000-f6003fff : amd_iommu
> fec00000-fec003ff : IOAPIC 0
> fec10000-fec1001f : pnp 00:04
> fec20000-fec203ff : IOAPIC 1
> fed00000-fed003ff : HPET 2
>   fed00000-fed003ff : PNP0103:00
> fed40000-fed44fff : PCI Bus 0000:00
> fee00000-fee00fff : Local APIC
>   fee00000-fee00fff : pnp 00:03
> ffb80000-ffbfffff : pnp 00:04
> ffe00000-ffffffff : reserved
>   ffe50000-ffe5e05f : pnp 00:04
> 100000000-2026ffffff : System RAM
> 2027000000-2027ffffff : RAM buffer

Regards,

Lutz Vieweg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
  2016-06-13  9:08             ` [E1000-devel] " Joerg Roedel
@ 2016-06-13 17:46               ` Lutz Vieweg
  0 siblings, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-06-13 17:46 UTC (permalink / raw)
  To: e1000-devel; +Cc: iommu

On 06/13/2016 11:08 AM, Joerg Roedel wrote:
> On Thu, Jun 09, 2016 at 09:03:40AM -0700, Alexander Duyck wrote:
>>>> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050]
>>>> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050]
>
> Some more context would be helpful. Which kernel version was the last
> that worked and with which version do you start to see these messages?

Two servers were running linux-4.4.2 for many months,
both with 10Gbase-T NICs connected to the same switch, without
any such outage.

Both servers were recently upgraded to linux-4.6.1, and one
of the servers so far twice showed this "IO_PAGE_FAULT" symptom
within a period of ~ 7 days.

The hardware of the two servers is the same except for the
model of the Intel 10Gbase-T NIC: The server with the two fails
runs a fairly new
  Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
while the other server (without symptoms so far) runs a much older
  Intel Corporation 82598EB 10-Gigabit AT Network Connection (rev 01)
both using the same ixgbe driver module.

(Since both servers are working as a shared-nothing-cluster, they
pretty much do the same.)

Regards,

Lutz Vieweg


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
@ 2016-06-14  3:01                       ` Wan ZongShun
  2016-08-29 12:29                       ` Lutz Vieweg
                                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Wan ZongShun @ 2016-06-14  3:01 UTC (permalink / raw)
  To: Lutz Vieweg
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

2016-06-14 1:40 GMT+08:00 Lutz Vieweg <lvml-i6VILw57VWU@public.gmane.org>:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>>>
>>> With "iommu=pt":
>>>>
>>>>
>>>> [    4.832580] iommu: Adding device 0000:04:00.0 to group 13
>>>> [    4.832838] iommu: Using direct mapping for device 0000:04:00.0
>>>
>>>
>>
>> That is right, you will pass through AMD IOMMU when you set iommu=pt.
>>
>>> ...
>>>>
>>>>
>>>> [    4.837074] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
>>>> [    4.837305] AMD-Vi: Found IOMMU at 0000:40:00.2 cap 0x40
>>>> [    4.837535] AMD-Vi: Interrupt remapping enabled
>>>> [    4.838062] AMD-Vi: Lazy IO/TLB flushing enabled
>>>> [    4.838291] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>>> [    4.838533] software IO TLB [mem 0xd3e80000-0xd7e80000] (64MB) mapped
>>>> at [ffff8800d3e80000-ffff8800d7e7ffff]
>>>
>>>
>>>
>>> I hope that doesn't mean all my network data is now passing through
>>> an additional copy-by-CPU... that would be kind of the opposite of what
>>> "iommu=pt" seemed to promise :-)
>>
>>
>> It depends.
>>
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.
>
> I'd say celebration of "works with iommu=pt" has to wait for at least
> two weeks or so before it is reasonably probable it works for this reason.
>
>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.

I supposed it will work well for your card after set iommu=pt, but it
is not rootcause for your issue.
The iommu=pt just let your all system devices bypassed the iommu, if
there are some device with 32bit DMA addressable cap in your system,
they will be impacted by bounce buffer, it is bad for performance.


Wan Zongshun.

>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem
>>>> 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>>
>> Please cat /proc/iomem, send the information.
>

This AGP should be used for old GPU, so I don't think it will impact
your this issue.

>
> Here it is:
>>
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
>
> Regards,
>
> Lutz Vieweg
>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu



-- 
---
Vincent Wan(Zongshun)
www.mcuos.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
  2016-06-14  3:01                       ` Wan ZongShun
@ 2016-08-29 12:29                       ` Lutz Vieweg
  2016-08-29 12:29                       ` Lutz Vieweg
                                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-08-29 12:29 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 06/13/2016 07:40 PM, Lutz Vieweg wrote:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.

I can now say that after using the option iommu=pt with linux-4.6.1,
the machine ran for > 2 months without problems.

For other reasons (btrfs-stuff) I had to upgrade the machine to
linux-4.7.2 last week, and the "iommu=pt" option wasn't active
after this upgrade.
It only took 4 days until the
  "AMD-Vi: Event logged IO_PAGE_FAULT...  ixgbe Detected Tx Unit Hang"
issue occured again.

So this evening, I'll reboot linux-4.7.2 with "iommu=pt" again,
as that really seemed to help.

Regards,

Lutz Vieweg



>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.
>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>> Please cat /proc/iomem, send the information.
>
> Here it is:
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
> Regards,
>
> Lutz Vieweg
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
  2016-06-14  3:01                       ` Wan ZongShun
  2016-08-29 12:29                       ` Lutz Vieweg
@ 2016-08-29 12:29                       ` Lutz Vieweg
  2016-08-29 12:30                       ` Lutz Vieweg
  2016-08-29 12:30                       ` Lutz Vieweg
  4 siblings, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-08-29 12:29 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 06/13/2016 07:40 PM, Lutz Vieweg wrote:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.

I can now say that after using the option iommu=pt with linux-4.6.1,
the machine ran for > 2 months without problems.

For other reasons (btrfs-stuff) I had to upgrade the machine to
linux-4.7.2 last week, and the "iommu=pt" option wasn't active
after this upgrade.
It only took 4 days until the
  "AMD-Vi: Event logged IO_PAGE_FAULT...  ixgbe Detected Tx Unit Hang"
issue occured again.

So this evening, I'll reboot linux-4.7.2 with "iommu=pt" again,
as that really seemed to help.

Regards,

Lutz Vieweg



>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.
>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>> Please cat /proc/iomem, send the information.
>
> Here it is:
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
> Regards,
>
> Lutz Vieweg
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
  2016-06-13 17:40                   ` Lutz Vieweg
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
@ 2016-08-29 12:30                     ` Lutz Vieweg
  1 sibling, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-08-29 12:30 UTC (permalink / raw)
  To: e1000-devel; +Cc: iommu

On 06/13/2016 07:40 PM, Lutz Vieweg wrote:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.

I can now say that after using the option iommu=pt with linux-4.6.1,
the machine ran for > 2 months without problems.

For other reasons (btrfs-stuff) I had to upgrade the machine to
linux-4.7.2 last week, and the "iommu=pt" option wasn't active
after this upgrade.
It only took 4 days until the
  "AMD-Vi: Event logged IO_PAGE_FAULT...  ixgbe Detected Tx Unit Hang"
issue occured again.

So this evening, I'll reboot linux-4.7.2 with "iommu=pt" again,
as that really seemed to help.

Regards,

Lutz Vieweg



>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.
>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>> Please cat /proc/iomem, send the information.
>
> Here it is:
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
> Regards,
>
> Lutz Vieweg
>


------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
                                         ` (2 preceding siblings ...)
  2016-08-29 12:29                       ` Lutz Vieweg
@ 2016-08-29 12:30                       ` Lutz Vieweg
  2016-08-29 12:30                       ` Lutz Vieweg
  4 siblings, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-08-29 12:30 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 06/13/2016 07:40 PM, Lutz Vieweg wrote:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.

I can now say that after using the option iommu=pt with linux-4.6.1,
the machine ran for > 2 months without problems.

For other reasons (btrfs-stuff) I had to upgrade the machine to
linux-4.7.2 last week, and the "iommu=pt" option wasn't active
after this upgrade.
It only took 4 days until the
  "AMD-Vi: Event logged IO_PAGE_FAULT...  ixgbe Detected Tx Unit Hang"
issue occured again.

So this evening, I'll reboot linux-4.7.2 with "iommu=pt" again,
as that really seemed to help.

Regards,

Lutz Vieweg



>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.
>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>> Please cat /proc/iomem, send the information.
>
> Here it is:
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
> Regards,
>
> Lutz Vieweg
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out
       [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
                                         ` (3 preceding siblings ...)
  2016-08-29 12:30                       ` Lutz Vieweg
@ 2016-08-29 12:30                       ` Lutz Vieweg
  4 siblings, 0 replies; 12+ messages in thread
From: Lutz Vieweg @ 2016-08-29 12:30 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 06/13/2016 07:40 PM, Lutz Vieweg wrote:
> On 06/13/2016 04:46 AM, Wan ZongShun wrote:
>> Firstly, I need to know if your ethernet card works well now or not
>> after you set iommu=pt.
>
> Too early to tell - the NIC worked for the last 4 days now without
> failing, however, that is only about the same time as it took after
> the upgrade to linux-4.6.1 before the bug was encountered, first.

I can now say that after using the option iommu=pt with linux-4.6.1,
the machine ran for > 2 months without problems.

For other reasons (btrfs-stuff) I had to upgrade the machine to
linux-4.7.2 last week, and the "iommu=pt" option wasn't active
after this upgrade.
It only took 4 days until the
  "AMD-Vi: Event logged IO_PAGE_FAULT...  ixgbe Detected Tx Unit Hang"
issue occured again.

So this evening, I'll reboot linux-4.7.2 with "iommu=pt" again,
as that really seemed to help.

Regards,

Lutz Vieweg



>> If your ethernet card with 64bit(not 32bit) DMA addressable cap, that
>> is ok, you will not be impacted by bounce buffer.
>
>> But iommu=pt is a terrible option, that make all devices bypass the iommu.
>
> Why is that terrible? The documentation I found on what iommu=pt actually
> means were pretty scarce, but I noticed how many places recommended to use
> this option for 10G NICs.
>
>> If you want to get further help, Please try:
>>
>> (1)Please add 'amd_iommu_dump' option in your kernel boot option, and
>> send your full kernel logs, lspci info, don't add iommu=pt.
>> (2) Add amd_iommu=fullflush option to kernel boot option, just try it.
>
> Will try that when the NIC becomes unavailable again.
>
>>> One more thing I find curious, but this didn't change with "iommu=pt":
>>>>
>>>> [    0.000000] AGP: Checking aperture...
>>>> [    0.000000] AGP: No AGP bridge found
>>>> [    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff]
>>>> (32MB)
>>>> [    0.000000] AGP: Your BIOS doesn't leave an aperture memory hole
>>>> [    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
>>>> [    0.000000] AGP: This costs you 64MB of RAM
>>>> [    0.000000] AGP: Mapping aperture over RAM [mem 0xcc000000-0xcfffffff]
>>>> (65536KB)
>>>
>>> I checked and the IOMMU-option is definitely enabled in the BIOS setup.
>>> So I assume right that these message are irrelevant (since AGP as a whole
>>> is irrelevant on this server)?
>>
>> Please cat /proc/iomem, send the information.
>
> Here it is:
>> 00000000-00000fff : reserved
>> 00001000-00097bff : System RAM
>> 00097c00-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>> 000c0000-000c7fff : Video ROM
>> 000ce800-000d43ff : Adapter ROM
>> 000d4800-000d57ff : Adapter ROM
>> 000e6000-000fffff : reserved
>>   000f0000-000fffff : System ROM
>> 00100000-d7e7ffff : System RAM
>>   01000000-01688c05 : Kernel code
>>   01688c06-01d4f53f : Kernel data
>>   01eea000-02174fff : Kernel bss
>> d7e80000-d7e8dfff : RAM buffer
>> d7e8e000-d7e8ffff : reserved
>> d7e90000-d7eb3fff : ACPI Tables
>> d7eb4000-d7edffff : ACPI Non-volatile Storage
>> d7ee0000-d7ffffff : reserved
>> d9000000-daffffff : PCI Bus 0000:40
>>   d9000000-d90003ff : IOAPIC 2
>>   d9010000-d9013fff : amd_iommu
>> db000000-dcffffff : PCI Bus 0000:00
>>   db000000-dbffffff : PCI Bus 0000:01
>>     db000000-dbffffff : 0000:01:04.0
>>       db000000-dbffffff : mgadrmfb_vram
>>   dcd00000-dcffffff : PCI Bus 0000:04
>>     dcdfc000-dcdfffff : 0000:04:00.0
>>       dcdfc000-dcdfffff : ixgbe
>>     dce00000-dcffffff : 0000:04:00.0
>>       dce00000-dcffffff : ixgbe
>> dd000000-dfffffff : PCI Bus 0000:00
>>   def00000-df7fffff : PCI Bus 0000:01
>>     deffc000-deffffff : 0000:01:04.0
>>       deffc000-deffffff : mgadrmfb_mmio
>>     df000000-df7fffff : 0000:01:04.0
>>   dfaf6000-dfaf6fff : 0000:00:12.1
>>     dfaf6000-dfaf6fff : ohci_hcd
>>   dfaf7000-dfaf7fff : 0000:00:12.0
>>     dfaf7000-dfaf7fff : ohci_hcd
>>   dfaf8400-dfaf87ff : 0000:00:11.0
>>     dfaf8400-dfaf87ff : ahci
>>   dfaf8800-dfaf88ff : 0000:00:12.2
>>     dfaf8800-dfaf88ff : ehci_hcd
>>   dfaf8c00-dfaf8cff : 0000:00:13.2
>>     dfaf8c00-dfaf8cff : ehci_hcd
>>   dfaf9000-dfaf9fff : 0000:00:13.1
>>     dfaf9000-dfaf9fff : ohci_hcd
>>   dfafa000-dfafafff : 0000:00:13.0
>>     dfafa000-dfafafff : ohci_hcd
>>   dfafb000-dfafbfff : 0000:00:14.5
>>     dfafb000-dfafbfff : ohci_hcd
>>   dfb00000-dfbfffff : PCI Bus 0000:02
>>     dfb1c000-dfb1ffff : 0000:02:00.1
>>       dfb1c000-dfb1ffff : igb
>>     dfb20000-dfb3ffff : 0000:02:00.1
>>     dfb40000-dfb5ffff : 0000:02:00.1
>>       dfb40000-dfb5ffff : igb
>>     dfb60000-dfb7ffff : 0000:02:00.1
>>       dfb60000-dfb7ffff : igb
>>     dfb9c000-dfb9ffff : 0000:02:00.0
>>       dfb9c000-dfb9ffff : igb
>>     dfba0000-dfbbffff : 0000:02:00.0
>>     dfbc0000-dfbdffff : 0000:02:00.0
>>       dfbc0000-dfbdffff : igb
>>     dfbe0000-dfbfffff : 0000:02:00.0
>>       dfbe0000-dfbfffff : igb
>>   dfc00000-dfcfffff : PCI Bus 0000:03
>>     dfc3c000-dfc3ffff : 0000:03:00.0
>>       dfc3c000-dfc3ffff : mpt2sas
>>     dfc40000-dfc7ffff : 0000:03:00.0
>>       dfc40000-dfc7ffff : mpt2sas
>>     dfc80000-dfcfffff : 0000:03:00.0
>>   dfd00000-dfdfffff : PCI Bus 0000:04
>>     dfd80000-dfdfffff : 0000:04:00.0
>>   dfe00000-dfffffff : PCI Bus 0000:05
>>     dfeb0000-dfebffff : 0000:05:00.0
>>       dfeb0000-dfebffff : mpt2sas
>>     dfec0000-dfefffff : 0000:05:00.0
>>       dfec0000-dfefffff : mpt2sas
>>     dff00000-dfffffff : 0000:05:00.0
>> e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
>>   e0000000-efffffff : reserved
>>     e0000000-efffffff : pnp 00:0a
>> f6000000-f6003fff : amd_iommu
>> fec00000-fec003ff : IOAPIC 0
>> fec10000-fec1001f : pnp 00:04
>> fec20000-fec203ff : IOAPIC 1
>> fed00000-fed003ff : HPET 2
>>   fed00000-fed003ff : PNP0103:00
>> fed40000-fed44fff : PCI Bus 0000:00
>> fee00000-fee00fff : Local APIC
>>   fee00000-fee00fff : pnp 00:03
>> ffb80000-ffbfffff : pnp 00:04
>> ffe00000-ffffffff : reserved
>>   ffe50000-ffe5e05f : pnp 00:04
>> 100000000-2026ffffff : System RAM
>> 2027000000-2027ffffff : RAM buffer
>
> Regards,
>
> Lutz Vieweg
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-29 12:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <loom.20160606T232112-817@post.gmane.org>
     [not found] ` <CAKgT0UfEGS_QzM1phGKRV1hDgcnAwX-BqMkyQ6KJUOv82_kCiA@mail.gmail.com>
     [not found]   ` <nj64hf$9v5$1@ger.gmane.org>
     [not found]     ` <CAKgT0UfKUkrsXqLm4KdjXgLZ6QXZp5Rf-yYA3pBSzc1=ghJ4CQ@mail.gmail.com>
     [not found]       ` <njbvjb$40r$1@ger.gmane.org>
2016-06-09 16:03         ` [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out Alexander Duyck
2016-06-09 16:57           ` Lutz Vieweg
     [not found]             ` <5759A009.8040200-i6VILw57VWU@public.gmane.org>
2016-06-13  2:46               ` [E1000-devel] " Wan ZongShun
     [not found]                 ` <CAKT61h9cNnGDNugoWXYcpN1VjVK3Hn-VOW+TwHahj5EXzfsXgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-13 17:40                   ` Lutz Vieweg
     [not found]                     ` <575EEFFB.20004-i6VILw57VWU@public.gmane.org>
2016-06-14  3:01                       ` Wan ZongShun
2016-08-29 12:29                       ` Lutz Vieweg
2016-08-29 12:29                       ` Lutz Vieweg
2016-08-29 12:30                       ` Lutz Vieweg
2016-08-29 12:30                       ` Lutz Vieweg
2016-08-29 12:30                     ` Lutz Vieweg
     [not found]           ` <CAKgT0UeFM1jYTU83YFohxUHWuJeTYfWDpdFM2CDQCutmf_vXvA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-13  9:08             ` [E1000-devel] " Joerg Roedel
2016-06-13 17:46               ` Lutz Vieweg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).