From: Jochen Henneberg <jh@henneberg-systemdesign.com>
To: Giuseppe Cavallaro <peppe.cavallaro@st.com>,
Alexandre Torgue <alexandre.torgue@foss.st.com>,
Jose Abreu <joabreu@synopsys.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Maxime Coquelin <mcoquelin.stm32@gmail.com>,
Ong Boon Leong <boon.leong.ong@intel.com>,
netdev@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: Issue: stmmac reset by netdev watchdog due to tx queue timeout
Date: Tue, 14 Mar 2023 13:40:11 +0100 [thread overview]
Message-ID: <87cz5bh5ew.fsf@henneberg-systemdesign.com> (raw)
In-Reply-To: <87cz6164ld.fsf@henneberg-systemdesign.com>
Jochen Henneberg <jh@henneberg-systemdesign.com> writes:
> Jochen Henneberg <jh@henneberg-systemdesign.com> writes:
>
>> I have been debugging an issue with the stmmac network driver and the
>> Intel Elkhart Lake SoC for quite some time now. The problem comes up
>> when a port forwarding is configured with iptables NAT like this:
>>
>> iptables -t nat -A PREROUTING -p tcp --dport 222 \
>> -j DNAT --to-destination 192.168.178.134:22
>> iptables -t nat -A POSTROUTING -p tcp --dst 192.168.178.134 \
>> --dport 22 j SNAT --to-source 192.168.178.138
>>
>> If I 'ssh -p 222 192.168.178.138' the sttmac is reset after some seconds
>> with:
>>
>> [ 553.050018] NETDEV WATCHDOG: eno1 (intel-eth-pci): transmit queue 0
>> timed out
>> [ 553.050048] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:525
>> dev_watchdog+0x23a/0x250
>> [ 553.050059] Modules linked in: nft_chain_nat xt_nat nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp nft_compat
>> nf_tables libcrc32c nfnetlink snd_seq_dummy snd_hrtimer
>> snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic
>> ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common
>> soundwire_intel soundwire_generic_allocation soundwire_cadence
>> snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils
>> snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match binfmt_misc
>> snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus
>> snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi
>> snd_hda_codec intel_rapl_msr snd_hda_core intel_rapl_common snd_hwdep
>> x86_pkg_temp_thermal intel_powerclamp snd_pcm coretemp nls_iso8859_1
>> snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi mei_hdcp i915
>> kvm mei_pxp snd_seq crct10dif_pclmul ghash_clmulni_intel
>> snd_seq_device sha512_ssse3 drm_buddy ttm cmdlinepart aesni_intel
>> snd_timer spi_nor crypto_simd
>> [ 553.050187] drm_display_helper mtd cryptd snd intel_cstate cec
>> rc_core joydev intel_wmi_thunderbolt drm_kms_helper soundcore
>> i2c_algo_bit mei_me syscopyarea input_leds sysfillrect mei 8250_dw
>> sysimgblt igen6_edac mac_hid intel_hid acpi_pad acpi_tad sparse_keymap
>> msr parport_pc ppdev lp parport drm pstore_blk ramoops pstore_zone
>> reed_solomon efi_pstore ip_tables x_tables autofs4 hid_logitech_hidpp
>> hid_logitech_dj hid_generic usbhid uas hid usb_storage mxl_gpy
>> polynomial mmc_block dwmac_intel spi_intel_pci i2c_i801 intel_ish_ipc
>> gpio_kempld i2c_kempld crc32_pclmul spi_intel intel_ishtp i2c_smbus
>> stmmac intel_lpss_pci ahci sdhci_pci xhci_pci intel_lpss
>> xhci_pci_renesas pcs_xpcs cqhci libahci video phylink sdhci idma64
>> kempld_core i2c_scmi wmi pinctrl_elkhartlake(+)
>> [ 553.050354] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 6.2.0-rc8+
>> #22
>> [ 553.050360] Hardware name: Default string Default string/COMe-mEL10
>> E2, BIOS MEL1R904 11/02/2022
>> [ 553.050363] RIP: 0010:dev_watchdog+0x23a/0x250
>> [ 553.050370] Code: 00 e9 2b ff ff ff 48 89 df c6 05 de 01 83 01 01 e8
>> fb 25 f8 ff 44 89 f1 48 89 de 48 c7 c7 a8 cd c0 a6 48 89 c2 e8 24 53
>> 20 00 <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
>> 00
>> [ 553.050373] RSP: 0018:ffffbfc34018ce38 EFLAGS: 00010246
>> [ 553.050378] RAX: 0000000000000000 RBX: ffffa0d392b68000 RCX:
>> 0000000000000000
>> [ 553.050381] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> 0000000000000000
>> [ 553.050384] RBP: ffffbfc34018ce68 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 553.050386] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffffa0d392b684c8
>> [ 553.050389] R13: ffffa0d392b6841c R14: 0000000000000000 R15:
>> 0000000000000000
>> [ 553.050392] FS: 0000000000000000(0000) GS:ffffa0d4e4300000(0000)
>> knlGS:0000000000000000
>> [ 553.050395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 553.050398] CR2: 000055da1a873000 CR3: 000000023fc10000 CR4:
>> 0000000000350ee0
>> [ 553.050402] Call Trace:
>> [ 553.050406] <IRQ>
>> [ 553.050413] ? __pfx_dev_watchdog+0x10/0x10
>> [ 553.050419] call_timer_fn+0x29/0x160
>> [ 553.050425] ? __pfx_dev_watchdog+0x10/0x10
>> [ 553.050429] __run_timers+0x259/0x310
>> [ 553.050434] run_timer_softirq+0x1d/0x40
>> [ 553.050437] __do_softirq+0xd6/0x346
>> [ 553.050444] ? hrtimer_interrupt+0x11f/0x230
>> [ 553.050449] __irq_exit_rcu+0xa2/0xd0
>> [ 553.050455] irq_exit_rcu+0xe/0x20
>> [ 553.050459] sysvec_apic_timer_interrupt+0x92/0xd0
>> [ 553.050465] </IRQ>
>> [ 553.050467] <TASK>
>> [ 553.050469] asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> [ 553.050475] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
>> [ 553.050481] Code: 7f 1b 5a e8 14 2f 4e ff 8b 53 04 49 89 c7 0f 1f 44
>> 00 00 31 ff e8 22 40 4d ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44
>> 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00
>> 00
>> [ 553.050484] RSP: 0018:ffffbfc340113e38 EFLAGS: 00000246
>> [ 553.050489] RAX: 0000000000000000 RBX: ffffa0d4e433c930 RCX:
>> 0000000000000000
>> [ 553.050492] RDX: 0000000000000002 RSI: 0000000000000000 RDI:
>> 0000000000000000
>> [ 553.050494] RBP: ffffbfc340113e88 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 553.050496] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffffffffa76bb700
>> [ 553.050499] R13: 0000000000000003 R14: 0000000000000003 R15:
>> 00000080c45910fe
>> [ 553.050504] ? cpuidle_enter_state+0xce/0x6f0
>> [ 553.050508] cpuidle_enter+0x2e/0x50
>> [ 553.050512] do_idle+0x216/0x2a0
>> [ 553.050517] cpu_startup_entry+0x1d/0x20
>> [ 553.050521] start_secondary+0x122/0x160
>> [ 553.050527] secondary_startup_64_no_verify+0xe5/0xeb
>> [ 553.050535] </TASK>
>> [ 553.050537] ---[ end trace 0000000000000000 ]---
>> [ 553.050584] intel-eth-pci 0000:00:1d.1 eno1: Reset adapter.
>> [ 553.064639] intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue stop
>> [ 553.065416] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-0
>> [ 553.066003] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-1
>> [ 553.066469] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-2
>> [ 553.066933] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-3
>> [ 553.067428] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-4
>> [ 553.067895] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-5
>> [ 553.068466] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-6
>> [ 553.069025] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-7
>> [ 553.082011] dwmac4: Master AXI performs any burst length
>> [ 553.082094] intel-eth-pci 0000:00:1d.1 eno1: Enabling Safety Features
>> [ 553.082138] intel-eth-pci 0000:00:1d.1 eno1: IEEE 1588-2008 Advanced
>> Timestamp supported
>> [ 553.082438] intel-eth-pci 0000:00:1d.1 eno1: registered PTP clock
>> [ 553.082641] intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue start
>> [ 553.082649] intel-eth-pci 0000:00:1d.1 eno1: configuring for
>> inband/sgmii link mode
>> [ 553.083700] intel-eth-pci 0000:00:1d.1 eno1: Link is Up - 1Gbps/Full
>> - flow control off
>>
>> This does not happen during normal ssh to the EHL board, this does not
>> happen if I do port forwarding through userspace, e. g. with socat. And
>> this does not happen if two independent network interfaces are used in
>> the iptables rules. I have not observed the issue with other network
>> chips with the same kernel version so I think the issue must be in the
>> stmmac driver.
>>
>> What happens is that the OWNED bit of a DMA descriptor is not reset
>> (detected in dwmac4_wrback_get_tx_status()) and the queue's txtimer is
>> looping endlessly until the timeout resets everything. Sometimes the
>> situation is solved before the timeout, most of the times it is not but
>> i always takes long until the OWNED bit is reset. This can be reproduced
>> with 100% success.
>>
>> I have tried to understand where the issue comes from and I think it may
>> have something to do with the calls to dma_wmb() and wmb() (or more
>> precisely with missing barriers) but so far I had no success to solve it
>> and provide a patch.
>>
>> The kernel that I am running is a quite recent linux-net
>> (b60417a9f2b8). I have also tried with linux-net-next without
>> success. The issue can already be observed with v5.15.39.
>>
>> Any help or suggestion how I can debug this further would be
>> appreciated. Or if somebody else can reproduce or not reproduce the
>> issue on the given platform may help as well.
>>
>> Regards
>> -Jochen
>
> Some more details about the devices involved:
>
> intel-eth-pci 0000:00:1d.1: enabling device (0000 -> 0002)
> intel-eth-pci 0000:00:1d.1: stmmac_config_multi_msi: multi MSI enablement successful
> intel-eth-pci 0000:00:1d.1: User ID: 0x51, Synopsys ID: 0x52
> intel-eth-pci 0000:00:1d.1: DWMAC4/5
> intel-eth-pci 0000:00:1d.1: DMA HW capability register supported
> intel-eth-pci 0000:00:1d.1: RX Checksum Offload Engine supported
> intel-eth-pci 0000:00:1d.1: TX Checksum insertion supported
> intel-eth-pci 0000:00:1d.1: TSO supported
> intel-eth-pci 0000:00:1d.1: Enable RX Mitigation via HW Watchdog Timer
> intel-eth-pci 0000:00:1d.1: device MAC address 00:e0:4b:74:93:db
> intel-eth-pci 0000:00:1d.1: Enabled L3L4 Flow TC (entries=2)
> intel-eth-pci 0000:00:1d.1: Enabled RFS Flow TC (entries=10)
> intel-eth-pci 0000:00:1d.1: Enabling HW TC (entries=256, max_off=256)
> intel-eth-pci 0000:00:1d.1: TSO feature enabled
> intel-eth-pci 0000:00:1d.1: Using 32 bits DMA width
> Maxlinear Ethernet GPY115B stmmac-2:01: Firmware Version: 7.110 (0x876E)
> Maxlinear Ethernet GPY115B stmmac-2:01: attached PHY driver (mii_bus:phy_addr=stmmac-2:01>
> intel-eth-pci 0000:00:1d.1 eno1: renamed from eth0
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-0
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-1
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-2
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-3
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-4
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-5
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-6
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-7
> intel-eth-pci 0000:00:1d.1 eno1: Enabling Safety Features
> intel-eth-pci 0000:00:1d.1 eno1: IEEE 1588-2008 Advanced Timestamp supported
> intel-eth-pci 0000:00:1d.1 eno1: registered PTP clock
> intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue start
> intel-eth-pci 0000:00:1d.1 eno1: configuring for inband/sgmii link mode
> intel-eth-pci 0000:00:1d.1 eno1: Link is Up - 1Gbps/Full - flow control off
>
No further help needed. I have found the root cause and will provide a
fix.
-Jochen
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Jochen Henneberg <jh@henneberg-systemdesign.com>
To: Giuseppe Cavallaro <peppe.cavallaro@st.com>,
Alexandre Torgue <alexandre.torgue@foss.st.com>,
Jose Abreu <joabreu@synopsys.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Maxime Coquelin <mcoquelin.stm32@gmail.com>,
Ong Boon Leong <boon.leong.ong@intel.com>,
netdev@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: Issue: stmmac reset by netdev watchdog due to tx queue timeout
Date: Tue, 14 Mar 2023 13:40:11 +0100 [thread overview]
Message-ID: <87cz5bh5ew.fsf@henneberg-systemdesign.com> (raw)
In-Reply-To: <87cz6164ld.fsf@henneberg-systemdesign.com>
Jochen Henneberg <jh@henneberg-systemdesign.com> writes:
> Jochen Henneberg <jh@henneberg-systemdesign.com> writes:
>
>> I have been debugging an issue with the stmmac network driver and the
>> Intel Elkhart Lake SoC for quite some time now. The problem comes up
>> when a port forwarding is configured with iptables NAT like this:
>>
>> iptables -t nat -A PREROUTING -p tcp --dport 222 \
>> -j DNAT --to-destination 192.168.178.134:22
>> iptables -t nat -A POSTROUTING -p tcp --dst 192.168.178.134 \
>> --dport 22 j SNAT --to-source 192.168.178.138
>>
>> If I 'ssh -p 222 192.168.178.138' the sttmac is reset after some seconds
>> with:
>>
>> [ 553.050018] NETDEV WATCHDOG: eno1 (intel-eth-pci): transmit queue 0
>> timed out
>> [ 553.050048] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:525
>> dev_watchdog+0x23a/0x250
>> [ 553.050059] Modules linked in: nft_chain_nat xt_nat nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp nft_compat
>> nf_tables libcrc32c nfnetlink snd_seq_dummy snd_hrtimer
>> snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic
>> ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common
>> soundwire_intel soundwire_generic_allocation soundwire_cadence
>> snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils
>> snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match binfmt_misc
>> snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus
>> snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi
>> snd_hda_codec intel_rapl_msr snd_hda_core intel_rapl_common snd_hwdep
>> x86_pkg_temp_thermal intel_powerclamp snd_pcm coretemp nls_iso8859_1
>> snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi mei_hdcp i915
>> kvm mei_pxp snd_seq crct10dif_pclmul ghash_clmulni_intel
>> snd_seq_device sha512_ssse3 drm_buddy ttm cmdlinepart aesni_intel
>> snd_timer spi_nor crypto_simd
>> [ 553.050187] drm_display_helper mtd cryptd snd intel_cstate cec
>> rc_core joydev intel_wmi_thunderbolt drm_kms_helper soundcore
>> i2c_algo_bit mei_me syscopyarea input_leds sysfillrect mei 8250_dw
>> sysimgblt igen6_edac mac_hid intel_hid acpi_pad acpi_tad sparse_keymap
>> msr parport_pc ppdev lp parport drm pstore_blk ramoops pstore_zone
>> reed_solomon efi_pstore ip_tables x_tables autofs4 hid_logitech_hidpp
>> hid_logitech_dj hid_generic usbhid uas hid usb_storage mxl_gpy
>> polynomial mmc_block dwmac_intel spi_intel_pci i2c_i801 intel_ish_ipc
>> gpio_kempld i2c_kempld crc32_pclmul spi_intel intel_ishtp i2c_smbus
>> stmmac intel_lpss_pci ahci sdhci_pci xhci_pci intel_lpss
>> xhci_pci_renesas pcs_xpcs cqhci libahci video phylink sdhci idma64
>> kempld_core i2c_scmi wmi pinctrl_elkhartlake(+)
>> [ 553.050354] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 6.2.0-rc8+
>> #22
>> [ 553.050360] Hardware name: Default string Default string/COMe-mEL10
>> E2, BIOS MEL1R904 11/02/2022
>> [ 553.050363] RIP: 0010:dev_watchdog+0x23a/0x250
>> [ 553.050370] Code: 00 e9 2b ff ff ff 48 89 df c6 05 de 01 83 01 01 e8
>> fb 25 f8 ff 44 89 f1 48 89 de 48 c7 c7 a8 cd c0 a6 48 89 c2 e8 24 53
>> 20 00 <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
>> 00
>> [ 553.050373] RSP: 0018:ffffbfc34018ce38 EFLAGS: 00010246
>> [ 553.050378] RAX: 0000000000000000 RBX: ffffa0d392b68000 RCX:
>> 0000000000000000
>> [ 553.050381] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> 0000000000000000
>> [ 553.050384] RBP: ffffbfc34018ce68 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 553.050386] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffffa0d392b684c8
>> [ 553.050389] R13: ffffa0d392b6841c R14: 0000000000000000 R15:
>> 0000000000000000
>> [ 553.050392] FS: 0000000000000000(0000) GS:ffffa0d4e4300000(0000)
>> knlGS:0000000000000000
>> [ 553.050395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 553.050398] CR2: 000055da1a873000 CR3: 000000023fc10000 CR4:
>> 0000000000350ee0
>> [ 553.050402] Call Trace:
>> [ 553.050406] <IRQ>
>> [ 553.050413] ? __pfx_dev_watchdog+0x10/0x10
>> [ 553.050419] call_timer_fn+0x29/0x160
>> [ 553.050425] ? __pfx_dev_watchdog+0x10/0x10
>> [ 553.050429] __run_timers+0x259/0x310
>> [ 553.050434] run_timer_softirq+0x1d/0x40
>> [ 553.050437] __do_softirq+0xd6/0x346
>> [ 553.050444] ? hrtimer_interrupt+0x11f/0x230
>> [ 553.050449] __irq_exit_rcu+0xa2/0xd0
>> [ 553.050455] irq_exit_rcu+0xe/0x20
>> [ 553.050459] sysvec_apic_timer_interrupt+0x92/0xd0
>> [ 553.050465] </IRQ>
>> [ 553.050467] <TASK>
>> [ 553.050469] asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> [ 553.050475] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
>> [ 553.050481] Code: 7f 1b 5a e8 14 2f 4e ff 8b 53 04 49 89 c7 0f 1f 44
>> 00 00 31 ff e8 22 40 4d ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44
>> 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00
>> 00
>> [ 553.050484] RSP: 0018:ffffbfc340113e38 EFLAGS: 00000246
>> [ 553.050489] RAX: 0000000000000000 RBX: ffffa0d4e433c930 RCX:
>> 0000000000000000
>> [ 553.050492] RDX: 0000000000000002 RSI: 0000000000000000 RDI:
>> 0000000000000000
>> [ 553.050494] RBP: ffffbfc340113e88 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 553.050496] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffffffffa76bb700
>> [ 553.050499] R13: 0000000000000003 R14: 0000000000000003 R15:
>> 00000080c45910fe
>> [ 553.050504] ? cpuidle_enter_state+0xce/0x6f0
>> [ 553.050508] cpuidle_enter+0x2e/0x50
>> [ 553.050512] do_idle+0x216/0x2a0
>> [ 553.050517] cpu_startup_entry+0x1d/0x20
>> [ 553.050521] start_secondary+0x122/0x160
>> [ 553.050527] secondary_startup_64_no_verify+0xe5/0xeb
>> [ 553.050535] </TASK>
>> [ 553.050537] ---[ end trace 0000000000000000 ]---
>> [ 553.050584] intel-eth-pci 0000:00:1d.1 eno1: Reset adapter.
>> [ 553.064639] intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue stop
>> [ 553.065416] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-0
>> [ 553.066003] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-1
>> [ 553.066469] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-2
>> [ 553.066933] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-3
>> [ 553.067428] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-4
>> [ 553.067895] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-5
>> [ 553.068466] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-6
>> [ 553.069025] intel-eth-pci 0000:00:1d.1 eno1: Register
>> MEM_TYPE_PAGE_POOL RxQ-7
>> [ 553.082011] dwmac4: Master AXI performs any burst length
>> [ 553.082094] intel-eth-pci 0000:00:1d.1 eno1: Enabling Safety Features
>> [ 553.082138] intel-eth-pci 0000:00:1d.1 eno1: IEEE 1588-2008 Advanced
>> Timestamp supported
>> [ 553.082438] intel-eth-pci 0000:00:1d.1 eno1: registered PTP clock
>> [ 553.082641] intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue start
>> [ 553.082649] intel-eth-pci 0000:00:1d.1 eno1: configuring for
>> inband/sgmii link mode
>> [ 553.083700] intel-eth-pci 0000:00:1d.1 eno1: Link is Up - 1Gbps/Full
>> - flow control off
>>
>> This does not happen during normal ssh to the EHL board, this does not
>> happen if I do port forwarding through userspace, e. g. with socat. And
>> this does not happen if two independent network interfaces are used in
>> the iptables rules. I have not observed the issue with other network
>> chips with the same kernel version so I think the issue must be in the
>> stmmac driver.
>>
>> What happens is that the OWNED bit of a DMA descriptor is not reset
>> (detected in dwmac4_wrback_get_tx_status()) and the queue's txtimer is
>> looping endlessly until the timeout resets everything. Sometimes the
>> situation is solved before the timeout, most of the times it is not but
>> i always takes long until the OWNED bit is reset. This can be reproduced
>> with 100% success.
>>
>> I have tried to understand where the issue comes from and I think it may
>> have something to do with the calls to dma_wmb() and wmb() (or more
>> precisely with missing barriers) but so far I had no success to solve it
>> and provide a patch.
>>
>> The kernel that I am running is a quite recent linux-net
>> (b60417a9f2b8). I have also tried with linux-net-next without
>> success. The issue can already be observed with v5.15.39.
>>
>> Any help or suggestion how I can debug this further would be
>> appreciated. Or if somebody else can reproduce or not reproduce the
>> issue on the given platform may help as well.
>>
>> Regards
>> -Jochen
>
> Some more details about the devices involved:
>
> intel-eth-pci 0000:00:1d.1: enabling device (0000 -> 0002)
> intel-eth-pci 0000:00:1d.1: stmmac_config_multi_msi: multi MSI enablement successful
> intel-eth-pci 0000:00:1d.1: User ID: 0x51, Synopsys ID: 0x52
> intel-eth-pci 0000:00:1d.1: DWMAC4/5
> intel-eth-pci 0000:00:1d.1: DMA HW capability register supported
> intel-eth-pci 0000:00:1d.1: RX Checksum Offload Engine supported
> intel-eth-pci 0000:00:1d.1: TX Checksum insertion supported
> intel-eth-pci 0000:00:1d.1: TSO supported
> intel-eth-pci 0000:00:1d.1: Enable RX Mitigation via HW Watchdog Timer
> intel-eth-pci 0000:00:1d.1: device MAC address 00:e0:4b:74:93:db
> intel-eth-pci 0000:00:1d.1: Enabled L3L4 Flow TC (entries=2)
> intel-eth-pci 0000:00:1d.1: Enabled RFS Flow TC (entries=10)
> intel-eth-pci 0000:00:1d.1: Enabling HW TC (entries=256, max_off=256)
> intel-eth-pci 0000:00:1d.1: TSO feature enabled
> intel-eth-pci 0000:00:1d.1: Using 32 bits DMA width
> Maxlinear Ethernet GPY115B stmmac-2:01: Firmware Version: 7.110 (0x876E)
> Maxlinear Ethernet GPY115B stmmac-2:01: attached PHY driver (mii_bus:phy_addr=stmmac-2:01>
> intel-eth-pci 0000:00:1d.1 eno1: renamed from eth0
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-0
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-1
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-2
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-3
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-4
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-5
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-6
> intel-eth-pci 0000:00:1d.1 eno1: Register MEM_TYPE_PAGE_POOL RxQ-7
> intel-eth-pci 0000:00:1d.1 eno1: Enabling Safety Features
> intel-eth-pci 0000:00:1d.1 eno1: IEEE 1588-2008 Advanced Timestamp supported
> intel-eth-pci 0000:00:1d.1 eno1: registered PTP clock
> intel-eth-pci 0000:00:1d.1 eno1: FPE workqueue start
> intel-eth-pci 0000:00:1d.1 eno1: configuring for inband/sgmii link mode
> intel-eth-pci 0000:00:1d.1 eno1: Link is Up - 1Gbps/Full - flow control off
>
No further help needed. I have found the root cause and will provide a
fix.
-Jochen
next prev parent reply other threads:[~2023-03-14 12:42 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-22 15:53 Issue: stmmac reset by netdev watchdog due to tx queue timeout Jochen Henneberg
2023-02-22 15:53 ` Jochen Henneberg
2023-02-22 16:32 ` Jochen Henneberg
2023-02-22 16:32 ` Jochen Henneberg
2023-03-14 12:40 ` Jochen Henneberg [this message]
2023-03-14 12:40 ` Jochen Henneberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87cz5bh5ew.fsf@henneberg-systemdesign.com \
--to=jh@henneberg-systemdesign.com \
--cc=alexandre.torgue@foss.st.com \
--cc=boon.leong.ong@intel.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=joabreu@synopsys.com \
--cc=kuba@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-stm32@st-md-mailman.stormreply.com \
--cc=mcoquelin.stm32@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=peppe.cavallaro@st.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.