* net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot @ 2022-02-02 20:18 Erico Nunes 2022-02-03 13:53 ` Vyacheslav 2022-02-07 10:41 ` Jerome Brunet 0 siblings, 2 replies; 17+ messages in thread From: Erico Nunes @ 2022-02-02 20:18 UTC (permalink / raw) To: Alexandre Torgue, Giuseppe Cavallaro, Jerome Brunet, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev Hello, I've been tracking down an issue with network interfaces from meson8b-dwmac sometimes not coming up properly at boot. The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have a group of them as part of a CI test farm that uses nfsroot. After hopefully ruling out potential platform/firmware and network issues I managed to bisect this commit in the kernel to make a big difference: 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved link config in mac_link_up() With a kernel before that commit, I am able to submit hundreds of test jobs and the boards always start the network interface properly. After that commit, around 30% of the jobs start hitting this: [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register MEM_TYPE_PAGE_POOL RxQ-0 [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety Features support found [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for phy/rmii link mode [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off [ 3.783162] Sending DHCP requests ...... timed out! [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down [ 93.685712] IP-Config: Retrying forever (NFS root)... [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register MEM_TYPE_PAGE_POOL RxQ-0 [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety Features support found [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for phy/rmii link mode [ 93.807459] random: fast init done [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off This still happens with a kernel from master, currently 5.17-rc2 (less frequently but still often hit by CI test jobs). The jobs still usually get to work after restarting the interface a couple of times, but sometimes it takes 3-4 attempts. Here is one example and full dmesg: https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw Note that DHCP does not seem to be an issue here, besides the fact that the problem only happens since the mentioned commit under the same setup, I did try to set up the boards to use a static ip but then the interfaces just don't communicate at all from boot. For test purposes I attempted to revert 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that does not apply trivially anymore, and by trying to revert it manually I haven't been able to get a working interface. Any advice on how to further debug or fix this? Thanks Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-02 20:18 net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot Erico Nunes @ 2022-02-03 13:53 ` Vyacheslav 2022-02-07 10:41 ` Jerome Brunet 1 sibling, 0 replies; 17+ messages in thread From: Vyacheslav @ 2022-02-03 13:53 UTC (permalink / raw) To: Erico Nunes, Alexandre Torgue, Giuseppe Cavallaro, Jerome Brunet, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev Hi I have same problem with meson8b on S905W Amlogic SoC. "ethtool -r" fixes problem after start 02.02.2022 23:18, Erico Nunes wrote: > Hello, > > I've been tracking down an issue with network interfaces from > meson8b-dwmac sometimes not coming up properly at boot. > The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have > a group of them as part of a CI test farm that uses nfsroot. > > After hopefully ruling out potential platform/firmware and network > issues I managed to bisect this commit in the kernel to make a big > difference: > > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved > link config in mac_link_up() > > With a kernel before that commit, I am able to submit hundreds of test > jobs and the boards always start the network interface properly. > > After that commit, around 30% of the jobs start hitting this: > > [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > 100Mbps/Full - flow control off > [ 3.783162] Sending DHCP requests ...... timed out! > [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down > [ 93.685712] IP-Config: Retrying forever (NFS root)... > [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 93.807459] random: fast init done > [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > 100Mbps/Full - flow control off > > This still happens with a kernel from master, currently 5.17-rc2 (less > frequently but still often hit by CI test jobs). > The jobs still usually get to work after restarting the interface a > couple of times, but sometimes it takes 3-4 attempts. > > Here is one example and full dmesg: > https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw > > Note that DHCP does not seem to be an issue here, besides the fact > that the problem only happens since the mentioned commit under the > same setup, I did try to set up the boards to use a static ip but then > the interfaces just don't communicate at all from boot. > > For test purposes I attempted to revert > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that > does not apply trivially anymore, and by trying to revert it manually > I haven't been able to get a working interface. > > Any advice on how to further debug or fix this? > > Thanks > > Erico > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-02 20:18 net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot Erico Nunes 2022-02-03 13:53 ` Vyacheslav @ 2022-02-07 10:41 ` Jerome Brunet 2022-02-20 16:51 ` Erico Nunes 1 sibling, 1 reply; 17+ messages in thread From: Jerome Brunet @ 2022-02-07 10:41 UTC (permalink / raw) To: Erico Nunes, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev, linux-rockchip, linux-sunxi On Wed 02 Feb 2022 at 21:18, Erico Nunes <nunes.erico@gmail.com> wrote: > Hello, > > I've been tracking down an issue with network interfaces from > meson8b-dwmac sometimes not coming up properly at boot. > The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have > a group of them as part of a CI test farm that uses nfsroot. > > After hopefully ruling out potential platform/firmware and network > issues I managed to bisect this commit in the kernel to make a big > difference: > > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved > link config in mac_link_up() > > With a kernel before that commit, I am able to submit hundreds of test > jobs and the boards always start the network interface properly. > > After that commit, around 30% of the jobs start hitting this: > > [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > 100Mbps/Full - flow control off > [ 3.783162] Sending DHCP requests ...... timed out! > [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down > [ 93.685712] IP-Config: Retrying forever (NFS root)... > [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 93.807459] random: fast init done > [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > 100Mbps/Full - flow control off > > This still happens with a kernel from master, currently 5.17-rc2 (less > frequently but still often hit by CI test jobs). > The jobs still usually get to work after restarting the interface a > couple of times, but sometimes it takes 3-4 attempts. > > Here is one example and full dmesg: > https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw > > Note that DHCP does not seem to be an issue here, besides the fact > that the problem only happens since the mentioned commit under the > same setup, I did try to set up the boards to use a static ip but then > the interfaces just don't communicate at all from boot. > > For test purposes I attempted to revert > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that > does not apply trivially anymore, and by trying to revert it manually > I haven't been able to get a working interface. > > Any advice on how to further debug or fix this? Hi Erico, Thanks a lot for digging into this topic. I'm seeing exactly the same behavior on the g12 based khadas-vim3: * Boot stalled waiting for DHCP - with an NFS based filesystem * Every minute, the network driver gets a reset and try again Sometimes it works on the first attempt, sometimes it takes up to 5 attempts. Eventually, it reaches the prompt which might be why it went unnoticed so far. I think that NFS just makes the problem easier to see. On devices with an eMMC based filesystem, I noticed that, sometimes, I had unplug/plug the ethernet cable to make it go. So far, the problem is reported on all the Amlogic SoC generation we support. I think a way forward is to ask the the other users of stmmac whether they have this problem or not - adding Allwinner and Rockchip ML. Since the commit you have identified is in the generic part of the stmmac code, Maybe Jose can help us understand what is going on. > > Thanks > > Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-07 10:41 ` Jerome Brunet @ 2022-02-20 16:51 ` Erico Nunes 2022-02-22 2:30 ` Samuel Holland 2022-02-26 13:53 ` Heiner Kallweit 0 siblings, 2 replies; 17+ messages in thread From: Erico Nunes @ 2022-02-20 16:51 UTC (permalink / raw) To: Jerome Brunet Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Mon, Feb 7, 2022 at 11:56 AM Jerome Brunet <jbrunet@baylibre.com> wrote: > > > On Wed 02 Feb 2022 at 21:18, Erico Nunes <nunes.erico@gmail.com> wrote: > > > Hello, > > > > I've been tracking down an issue with network interfaces from > > meson8b-dwmac sometimes not coming up properly at boot. > > The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have > > a group of them as part of a CI test farm that uses nfsroot. > > > > After hopefully ruling out potential platform/firmware and network > > issues I managed to bisect this commit in the kernel to make a big > > difference: > > > > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved > > link config in mac_link_up() > > > > With a kernel before that commit, I am able to submit hundreds of test > > jobs and the boards always start the network interface properly. > > > > After that commit, around 30% of the jobs start hitting this: > > > > [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY > > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > > [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register > > MEM_TYPE_PAGE_POOL RxQ-0 > > [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety > > Features support found > > [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > > [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for > > phy/rmii link mode > > [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > > 100Mbps/Full - flow control off > > [ 3.783162] Sending DHCP requests ...... timed out! > > [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down > > [ 93.685712] IP-Config: Retrying forever (NFS root)... > > [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY > > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > > [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register > > MEM_TYPE_PAGE_POOL RxQ-0 > > [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety > > Features support found > > [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > > [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for > > phy/rmii link mode > > [ 93.807459] random: fast init done > > [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > > 100Mbps/Full - flow control off > > > > This still happens with a kernel from master, currently 5.17-rc2 (less > > frequently but still often hit by CI test jobs). > > The jobs still usually get to work after restarting the interface a > > couple of times, but sometimes it takes 3-4 attempts. > > > > Here is one example and full dmesg: > > https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw > > > > Note that DHCP does not seem to be an issue here, besides the fact > > that the problem only happens since the mentioned commit under the > > same setup, I did try to set up the boards to use a static ip but then > > the interfaces just don't communicate at all from boot. > > > > For test purposes I attempted to revert > > 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that > > does not apply trivially anymore, and by trying to revert it manually > > I haven't been able to get a working interface. > > > > Any advice on how to further debug or fix this? > > Hi Erico, > > Thanks a lot for digging into this topic. > I'm seeing exactly the same behavior on the g12 based khadas-vim3: > > * Boot stalled waiting for DHCP - with an NFS based filesystem > * Every minute, the network driver gets a reset and try again > > Sometimes it works on the first attempt, sometimes it takes up to 5 > attempts. Eventually, it reaches the prompt which might be why it went > unnoticed so far. > > I think that NFS just makes the problem easier to see. > On devices with an eMMC based filesystem, I noticed that, sometimes, I > had unplug/plug the ethernet cable to make it go. > > So far, the problem is reported on all the Amlogic SoC generation we > support. I think a way forward is to ask the the other users of > stmmac whether they have this problem or not - adding Allwinner and > Rockchip ML. > > Since the commit you have identified is in the generic part of the > stmmac code, Maybe Jose can help us understand what is going on. Hi all, thanks for the feedback so far, good to know that this is not only on my board farm. Any more feedback about this from the people in cc? Thanks Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-20 16:51 ` Erico Nunes @ 2022-02-22 2:30 ` Samuel Holland 2022-02-26 13:53 ` Heiner Kallweit 1 sibling, 0 replies; 17+ messages in thread From: Samuel Holland @ 2022-02-22 2:30 UTC (permalink / raw) To: Erico Nunes, Jerome Brunet Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 2/20/22 10:51 AM, Erico Nunes wrote: > On Mon, Feb 7, 2022 at 11:56 AM Jerome Brunet <jbrunet@baylibre.com> wrote: >> >> >> On Wed 02 Feb 2022 at 21:18, Erico Nunes <nunes.erico@gmail.com> wrote: >> >>> Hello, >>> >>> I've been tracking down an issue with network interfaces from >>> meson8b-dwmac sometimes not coming up properly at boot. >>> The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have >>> a group of them as part of a CI test farm that uses nfsroot. >>> >>> After hopefully ruling out potential platform/firmware and network >>> issues I managed to bisect this commit in the kernel to make a big >>> difference: >>> >>> 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved >>> link config in mac_link_up() >>> >>> With a kernel before that commit, I am able to submit hundreds of test >>> jobs and the boards always start the network interface properly. >>> >>> After that commit, around 30% of the jobs start hitting this: >>> >>> [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY >>> [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) >>> [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register >>> MEM_TYPE_PAGE_POOL RxQ-0 >>> [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety >>> Features support found >>> [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW >>> [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for >>> phy/rmii link mode >>> [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - >>> 100Mbps/Full - flow control off >>> [ 3.783162] Sending DHCP requests ...... timed out! >>> [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down >>> [ 93.685712] IP-Config: Retrying forever (NFS root)... >>> [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY >>> [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) >>> [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register >>> MEM_TYPE_PAGE_POOL RxQ-0 >>> [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety >>> Features support found >>> [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW >>> [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for >>> phy/rmii link mode >>> [ 93.807459] random: fast init done >>> [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - >>> 100Mbps/Full - flow control off >>> >>> This still happens with a kernel from master, currently 5.17-rc2 (less >>> frequently but still often hit by CI test jobs). >>> The jobs still usually get to work after restarting the interface a >>> couple of times, but sometimes it takes 3-4 attempts. >>> >>> Here is one example and full dmesg: >>> https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw >>> >>> Note that DHCP does not seem to be an issue here, besides the fact >>> that the problem only happens since the mentioned commit under the >>> same setup, I did try to set up the boards to use a static ip but then >>> the interfaces just don't communicate at all from boot. >>> >>> For test purposes I attempted to revert >>> 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that >>> does not apply trivially anymore, and by trying to revert it manually >>> I haven't been able to get a working interface. >>> >>> Any advice on how to further debug or fix this? >> >> Hi Erico, >> >> Thanks a lot for digging into this topic. >> I'm seeing exactly the same behavior on the g12 based khadas-vim3: >> >> * Boot stalled waiting for DHCP - with an NFS based filesystem >> * Every minute, the network driver gets a reset and try again >> >> Sometimes it works on the first attempt, sometimes it takes up to 5 >> attempts. Eventually, it reaches the prompt which might be why it went >> unnoticed so far. >> >> I think that NFS just makes the problem easier to see. >> On devices with an eMMC based filesystem, I noticed that, sometimes, I >> had unplug/plug the ethernet cable to make it go. >> >> So far, the problem is reported on all the Amlogic SoC generation we >> support. I think a way forward is to ask the the other users of >> stmmac whether they have this problem or not - adding Allwinner and >> Rockchip ML. >> >> Since the commit you have identified is in the generic part of the >> stmmac code, Maybe Jose can help us understand what is going on. > > Hi all, > > thanks for the feedback so far, good to know that this is not only on > my board farm. > > Any more feedback about this from the people in cc? The commit in question appears to have been merged in v5.7. I have been using kernels newer than that (including up to v5.17-rc) on various Allwinner platforms -- A64, H3, H6, D1 -- and I have not seen anything similar. I also don't remember seeing reports of others having Ethernet issues at boot on Allwinner boards either. The only issue that's come up recently for us was related to runtime PM, but that issue was traced to a commit a year later than the one you referenced here (5ec55823438e). Regards, Samuel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-20 16:51 ` Erico Nunes 2022-02-22 2:30 ` Samuel Holland @ 2022-02-26 13:53 ` Heiner Kallweit 2022-03-02 10:33 ` Erico Nunes 1 sibling, 1 reply; 17+ messages in thread From: Heiner Kallweit @ 2022-02-26 13:53 UTC (permalink / raw) To: Erico Nunes, Jerome Brunet Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 20.02.2022 17:51, Erico Nunes wrote: > On Mon, Feb 7, 2022 at 11:56 AM Jerome Brunet <jbrunet@baylibre.com> wrote: >> >> >> On Wed 02 Feb 2022 at 21:18, Erico Nunes <nunes.erico@gmail.com> wrote: >> >>> Hello, >>> >>> I've been tracking down an issue with network interfaces from >>> meson8b-dwmac sometimes not coming up properly at boot. >>> The target systems are AML-S805X-CC boards (Amlogic S805X SoC), I have >>> a group of them as part of a CI test farm that uses nfsroot. >>> >>> After hopefully ruling out potential platform/firmware and network >>> issues I managed to bisect this commit in the kernel to make a big >>> difference: >>> >>> 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a net: stmmac: Use resolved >>> link config in mac_link_up() >>> >>> With a kernel before that commit, I am able to submit hundreds of test >>> jobs and the boards always start the network interface properly. >>> >>> After that commit, around 30% of the jobs start hitting this: >>> >>> [ 2.178078] meson8b-dwmac c9410000.ethernet eth0: PHY >>> [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) >>> [ 2.183505] meson8b-dwmac c9410000.ethernet eth0: Register >>> MEM_TYPE_PAGE_POOL RxQ-0 >>> [ 2.200784] meson8b-dwmac c9410000.ethernet eth0: No Safety >>> Features support found >>> [ 2.202713] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW >>> [ 2.209825] meson8b-dwmac c9410000.ethernet eth0: configuring for >>> phy/rmii link mode >>> [ 3.762108] meson8b-dwmac c9410000.ethernet eth0: Link is Up - >>> 100Mbps/Full - flow control off >>> [ 3.783162] Sending DHCP requests ...... timed out! >>> [ 93.680402] meson8b-dwmac c9410000.ethernet eth0: Link is Down >>> [ 93.685712] IP-Config: Retrying forever (NFS root)... >>> [ 93.756540] meson8b-dwmac c9410000.ethernet eth0: PHY >>> [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) >>> [ 93.763266] meson8b-dwmac c9410000.ethernet eth0: Register >>> MEM_TYPE_PAGE_POOL RxQ-0 >>> [ 93.779340] meson8b-dwmac c9410000.ethernet eth0: No Safety >>> Features support found >>> [ 93.781336] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW >>> [ 93.788088] meson8b-dwmac c9410000.ethernet eth0: configuring for >>> phy/rmii link mode >>> [ 93.807459] random: fast init done >>> [ 95.353076] meson8b-dwmac c9410000.ethernet eth0: Link is Up - >>> 100Mbps/Full - flow control off >>> >>> This still happens with a kernel from master, currently 5.17-rc2 (less >>> frequently but still often hit by CI test jobs). >>> The jobs still usually get to work after restarting the interface a >>> couple of times, but sometimes it takes 3-4 attempts. >>> >>> Here is one example and full dmesg: >>> https://gitlab.freedesktop.org/enunes/mesa/-/jobs/16452399/raw >>> >>> Note that DHCP does not seem to be an issue here, besides the fact >>> that the problem only happens since the mentioned commit under the >>> same setup, I did try to set up the boards to use a static ip but then >>> the interfaces just don't communicate at all from boot. >>> >>> For test purposes I attempted to revert >>> 46f69ded988d2311e3be2e4c3898fc0edd7e6c5a on top of master but that >>> does not apply trivially anymore, and by trying to revert it manually >>> I haven't been able to get a working interface. >>> >>> Any advice on how to further debug or fix this? >> >> Hi Erico, >> >> Thanks a lot for digging into this topic. >> I'm seeing exactly the same behavior on the g12 based khadas-vim3: >> >> * Boot stalled waiting for DHCP - with an NFS based filesystem >> * Every minute, the network driver gets a reset and try again >> >> Sometimes it works on the first attempt, sometimes it takes up to 5 >> attempts. Eventually, it reaches the prompt which might be why it went >> unnoticed so far. >> >> I think that NFS just makes the problem easier to see. >> On devices with an eMMC based filesystem, I noticed that, sometimes, I >> had unplug/plug the ethernet cable to make it go. >> >> So far, the problem is reported on all the Amlogic SoC generation we >> support. I think a way forward is to ask the the other users of >> stmmac whether they have this problem or not - adding Allwinner and >> Rockchip ML. >> >> Since the commit you have identified is in the generic part of the >> stmmac code, Maybe Jose can help us understand what is going on. > > Hi all, > > thanks for the feedback so far, good to know that this is not only on > my board farm. > > Any more feedback about this from the people in cc? > > Thanks > > Erico Just to rule out that the PHY may be involved: - Does the issue occur with internal and/or external PHY? - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-02-26 13:53 ` Heiner Kallweit @ 2022-03-02 10:33 ` Erico Nunes 2022-03-02 11:01 ` Heiner Kallweit 0 siblings, 1 reply; 17+ messages in thread From: Erico Nunes @ 2022-03-02 10:33 UTC (permalink / raw) To: Heiner Kallweit, Jerome Brunet Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Martin Blumenstingl, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Sat, Feb 26, 2022 at 2:53 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > Just to rule out that the PHY may be involved: > - Does the issue occur with internal and/or external PHY? My target boards have the internal phy only. It is not possible for me at the moment to test it with an external phy. > - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts) Thanks for suggesting this. I did tests with this and it seems to be a workaround. With phy interrupt on recent kernels (around v5.17-rc3) I'm able to reproduce the issue relatively easily over a batch of a hundred jobs. With my tests with the phy in polling mode, I have not been able to reproduce so far, even with several hundred jobs. For completeness I also tested 46f69ded988d (from my initial analysis) and setting the phy to polling mode there does not make a difference, issue still reproduces. So it may have been a different bug. Though I guess at this point we can disregard that and focus on the current kernel. I tried adding a few debugs and delays to the interrupt code path in drivers/net/phy/meson-gxl.c but nothing gave me useful info so far. Do you have more advice on how to proceed from here? Thanks Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-02 10:33 ` Erico Nunes @ 2022-03-02 11:01 ` Heiner Kallweit 2022-03-02 13:39 ` Jerome Brunet 0 siblings, 1 reply; 17+ messages in thread From: Heiner Kallweit @ 2022-03-02 11:01 UTC (permalink / raw) To: Erico Nunes, Jerome Brunet, Martin Blumenstingl Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 02.03.2022 11:33, Erico Nunes wrote: > On Sat, Feb 26, 2022 at 2:53 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >> Just to rule out that the PHY may be involved: >> - Does the issue occur with internal and/or external PHY? > > My target boards have the internal phy only. It is not possible for me > at the moment to test it with an external phy. > >> - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts) > > Thanks for suggesting this. I did tests with this and it seems to be a > workaround. > With phy interrupt on recent kernels (around v5.17-rc3) I'm able to > reproduce the issue relatively easily over a batch of a hundred jobs. > With my tests with the phy in polling mode, I have not been able to > reproduce so far, even with several hundred jobs. > It's my understanding that in the problem case the "aneg complete" interrupt fires, but no data flows. This might indicate a timing issue. According to the meson PHY driver (I don't have the datasheet) the PHY doesn't have a "link up" interrupt source, just the mentioned "aneg complete". Below I send an experimental patch that delays the link up processing a little and eliminates not needed interrupt sources. Could you please test it with PHY interrupts enabled? By the way, to all: I found that interrupt mode is broken in fixed (aneg disabled) mode, because link-up isn't signaled. Experiments showed that irq source bit 7 can be used to fix this, but this bit isn't documented in the driver. > For completeness I also tested 46f69ded988d (from my initial analysis) > and setting the phy to polling mode there does not make a difference, > issue still reproduces. So it may have been a different bug. Though I > guess at this point we can disregard that and focus on the current > kernel. > > I tried adding a few debugs and delays to the interrupt code path in > drivers/net/phy/meson-gxl.c but nothing gave me useful info so far. > > Do you have more advice on how to proceed from here? > > Thanks > > Erico Heiner diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c index 7e7904fee..0acb3a99a 100644 --- a/drivers/net/phy/meson-gxl.c +++ b/drivers/net/phy/meson-gxl.c @@ -7,6 +7,7 @@ * Author: Neil Armstrong <narmstrong@baylibre.com> */ #include <linux/kernel.h> +#include <linux/delay.h> #include <linux/module.h> #include <linux/mii.h> #include <linux/ethtool.h> @@ -209,12 +210,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev) if (ret) return ret; - val = INTSRC_ANEG_PR - | INTSRC_PARALLEL_FAULT - | INTSRC_ANEG_LP_ACK - | INTSRC_LINK_DOWN - | INTSRC_REMOTE_FAULT - | INTSRC_ANEG_COMPLETE; + val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE; ret = phy_write(phydev, INTSRC_MASK, val); } else { val = 0; @@ -240,6 +236,9 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) if (irq_status == 0) return IRQ_NONE; + if (irq_status & INTSRC_ANEG_COMPLETE) + msleep(100); + phy_trigger_machine(phydev); return IRQ_HANDLED; -- 2.35.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-02 11:01 ` Heiner Kallweit @ 2022-03-02 13:39 ` Jerome Brunet 2022-03-02 16:34 ` Heiner Kallweit 0 siblings, 1 reply; 17+ messages in thread From: Jerome Brunet @ 2022-03-02 13:39 UTC (permalink / raw) To: Heiner Kallweit, Erico Nunes, Martin Blumenstingl Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Wed 02 Mar 2022 at 12:01, Heiner Kallweit <hkallweit1@gmail.com> wrote: > On 02.03.2022 11:33, Erico Nunes wrote: >> On Sat, Feb 26, 2022 at 2:53 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >>> Just to rule out that the PHY may be involved: >>> - Does the issue occur with internal and/or external PHY? >> >> My target boards have the internal phy only. It is not possible for me >> at the moment to test it with an external phy. >> >>> - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts) >> >> Thanks for suggesting this. I did tests with this and it seems to be a >> workaround. >> With phy interrupt on recent kernels (around v5.17-rc3) I'm able to >> reproduce the issue relatively easily over a batch of a hundred jobs. >> With my tests with the phy in polling mode, I have not been able to >> reproduce so far, even with several hundred jobs. >> > It's my understanding that in the problem case the "aneg complete" > interrupt fires, but no data flows. > This might indicate a timing issue. According to the meson PHY driver > (I don't have the datasheet) the PHY doesn't have a "link up" interrupt > source, just the mentioned "aneg complete". > > Below I send an experimental patch that delays the link up processing > a little and eliminates not needed interrupt sources. > Could you please test it with PHY interrupts enabled? > > > By the way, to all: > I found that interrupt mode is broken in fixed (aneg disabled) mode, > because link-up isn't signaled. Experiments showed that irq source > bit 7 can be used to fix this, but this bit isn't documented in the > driver. > >> For completeness I also tested 46f69ded988d (from my initial analysis) >> and setting the phy to polling mode there does not make a difference, >> issue still reproduces. So it may have been a different bug. Though I >> guess at this point we can disregard that and focus on the current >> kernel. >> >> I tried adding a few debugs and delays to the interrupt code path in >> drivers/net/phy/meson-gxl.c but nothing gave me useful info so far. >> >> Do you have more advice on how to proceed from here? >> >> Thanks >> >> Erico > > Heiner Hi, I also did some tests on my side as well. Mostly with v5.10.93 ATM It is true that I can recall seeing this issue only on boards using the internal PHY (g12 and gxl board for me - I don't have meson8b boards) I tried on the u200 (g12 based). Being the ref design it has both the internal and external interfaces and I can choose. To my surprise, I could not reproduce the issue on it with the internal PHY ... until I noticed that eMMC was initialising more or less at the same time as the network. I disabled the eMMC, out of curiosity, and the issue was back. Like Heiner, I suspect a timing issue - at this stage, I can't tell if it is PHY related though. I also tried with the external phy, could not reproduce. Unfortunately, as we can see from the first test on the u200, not reproducing is not really a proof and it difficult to conclude. Like Erico, I tried bisecting but I ended up on a BT merge ... Clearly inconclusive :( Disabling the IRQ is an interesting test but, on my side, I have mixed results (on the libretech-cc this time): * I first tried quickly while bisecting, on commit 5.6.0-rc3-01434-g8d4ccd7770e7: - With IRQ => NOK - POLL => NOK Seeing Erico's report, I thought maybe I mixed things up so I tried again, doubled checked IRQ were disabled ... still broken. There was another commit I reproduce it without IRQ but I lost it. * I also tried on v5.10.93: - With IRQ => NOK - POLL => OK ... (well, I got bored before the issue showed up) It seems that switching to polling, in some case, changes the timings just enough to hide the issue ... but not always. Unless I forgot to consider something else ?? Ideas ? If I understand the proposed patch correctly, it is mostly about the phy IRQ. Since I reproduce without the IRQ, I suppose it is not the problem we where looking for (might still be a problem worth fixing - the phy is not "rock-solid" when it comes to aneg - I already tried stabilising it a few years ago) TBH, It bothers me that I reproduced w/o the IRQ. The idea makes sense :/ > > > diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c > index 7e7904fee..0acb3a99a 100644 > --- a/drivers/net/phy/meson-gxl.c > +++ b/drivers/net/phy/meson-gxl.c > @@ -7,6 +7,7 @@ > * Author: Neil Armstrong <narmstrong@baylibre.com> > */ > #include <linux/kernel.h> > +#include <linux/delay.h> > #include <linux/module.h> > #include <linux/mii.h> > #include <linux/ethtool.h> > @@ -209,12 +210,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev) > if (ret) > return ret; > > - val = INTSRC_ANEG_PR > - | INTSRC_PARALLEL_FAULT > - | INTSRC_ANEG_LP_ACK > - | INTSRC_LINK_DOWN > - | INTSRC_REMOTE_FAULT > - | INTSRC_ANEG_COMPLETE; > + val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE; > ret = phy_write(phydev, INTSRC_MASK, val); > } else { > val = 0; > @@ -240,6 +236,9 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) > if (irq_status == 0) > return IRQ_NONE; > > + if (irq_status & INTSRC_ANEG_COMPLETE) > + msleep(100); > + > phy_trigger_machine(phydev); > > return IRQ_HANDLED; ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-02 13:39 ` Jerome Brunet @ 2022-03-02 16:34 ` Heiner Kallweit 2022-03-06 9:40 ` Erico Nunes 0 siblings, 1 reply; 17+ messages in thread From: Heiner Kallweit @ 2022-03-02 16:34 UTC (permalink / raw) To: Jerome Brunet, Erico Nunes, Martin Blumenstingl Cc: Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 02.03.2022 14:39, Jerome Brunet wrote: > > On Wed 02 Mar 2022 at 12:01, Heiner Kallweit <hkallweit1@gmail.com> wrote: > >> On 02.03.2022 11:33, Erico Nunes wrote: >>> On Sat, Feb 26, 2022 at 2:53 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >>>> Just to rule out that the PHY may be involved: >>>> - Does the issue occur with internal and/or external PHY? >>> >>> My target boards have the internal phy only. It is not possible for me >>> at the moment to test it with an external phy. >>> >>>> - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts) >>> >>> Thanks for suggesting this. I did tests with this and it seems to be a >>> workaround. >>> With phy interrupt on recent kernels (around v5.17-rc3) I'm able to >>> reproduce the issue relatively easily over a batch of a hundred jobs. >>> With my tests with the phy in polling mode, I have not been able to >>> reproduce so far, even with several hundred jobs. >>> >> It's my understanding that in the problem case the "aneg complete" >> interrupt fires, but no data flows. >> This might indicate a timing issue. According to the meson PHY driver >> (I don't have the datasheet) the PHY doesn't have a "link up" interrupt >> source, just the mentioned "aneg complete". >> >> Below I send an experimental patch that delays the link up processing >> a little and eliminates not needed interrupt sources. >> Could you please test it with PHY interrupts enabled? >> >> >> By the way, to all: >> I found that interrupt mode is broken in fixed (aneg disabled) mode, >> because link-up isn't signaled. Experiments showed that irq source >> bit 7 can be used to fix this, but this bit isn't documented in the >> driver. >> >>> For completeness I also tested 46f69ded988d (from my initial analysis) >>> and setting the phy to polling mode there does not make a difference, >>> issue still reproduces. So it may have been a different bug. Though I >>> guess at this point we can disregard that and focus on the current >>> kernel. >>> >>> I tried adding a few debugs and delays to the interrupt code path in >>> drivers/net/phy/meson-gxl.c but nothing gave me useful info so far. >>> >>> Do you have more advice on how to proceed from here? >>> >>> Thanks >>> >>> Erico >> >> Heiner > > Hi, > > I also did some tests on my side as well. Mostly with v5.10.93 ATM > It is true that I can recall seeing this issue only on boards using the > internal PHY (g12 and gxl board for me - I don't have meson8b boards) > > I tried on the u200 (g12 based). Being the ref design it has both > the internal and external interfaces and I can choose. > > To my surprise, I could not reproduce the issue on it with the internal > PHY ... until I noticed that eMMC was initialising more or less at the > same time as the network. > > I disabled the eMMC, out of curiosity, and the issue was back. > Like Heiner, I suspect a timing issue - at this stage, I can't tell if it > is PHY related though. > > I also tried with the external phy, could not reproduce. Unfortunately, > as we can see from the first test on the u200, not reproducing is not > really a proof and it difficult to conclude. > > Like Erico, I tried bisecting but I ended up on a BT merge ... Clearly > inconclusive :( > > Disabling the IRQ is an interesting test but, on my side, I have mixed > results (on the libretech-cc this time): > > * I first tried quickly while bisecting, on commit > 5.6.0-rc3-01434-g8d4ccd7770e7: > - With IRQ => NOK > - POLL => NOK > > Seeing Erico's report, I thought maybe I mixed things up so I tried again, > doubled checked IRQ were disabled ... still broken. There was another > commit I reproduce it without IRQ but I lost it. > > * I also tried on v5.10.93: > - With IRQ => NOK > - POLL => OK ... (well, I got bored before the issue showed up) > > It seems that switching to polling, in some case, changes the timings > just enough to hide the issue ... but not always. Unless I forgot to > consider something else ?? Ideas ? > When using polling the time difference between aneg complete and PHY state machine run is random in the interval 0 .. 1s. Hence there's a certain chance that the difference is too small to avoid the issue. > If I understand the proposed patch correctly, it is mostly about the phy > IRQ. Since I reproduce without the IRQ, I suppose it is not the > problem we where looking for (might still be a problem worth fixing - > the phy is not "rock-solid" when it comes to aneg - I already tried > stabilising it a few years ago) Below is a slightly improved version of the test patch. It doesn't sleep in the (threaded) interrupt handler and lets the workqueue do it. Maybe Amlogic is aware of a potentially related silicon issue? > > TBH, It bothers me that I reproduced w/o the IRQ. The idea makes > sense :/ > >> [...] > diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c index 7e7904fee..a3318ae01 100644 --- a/drivers/net/phy/meson-gxl.c +++ b/drivers/net/phy/meson-gxl.c @@ -209,12 +209,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev) if (ret) return ret; - val = INTSRC_ANEG_PR - | INTSRC_PARALLEL_FAULT - | INTSRC_ANEG_LP_ACK - | INTSRC_LINK_DOWN - | INTSRC_REMOTE_FAULT - | INTSRC_ANEG_COMPLETE; + val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE; ret = phy_write(phydev, INTSRC_MASK, val); } else { val = 0; @@ -240,7 +235,10 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) if (irq_status == 0) return IRQ_NONE; - phy_trigger_machine(phydev); + if (irq_status & INTSRC_ANEG_COMPLETE) + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); + else + phy_trigger_machine(phydev); return IRQ_HANDLED; } -- 2.35.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-02 16:34 ` Heiner Kallweit @ 2022-03-06 9:40 ` Erico Nunes 2022-03-06 12:56 ` Heiner Kallweit 0 siblings, 1 reply; 17+ messages in thread From: Erico Nunes @ 2022-03-06 9:40 UTC (permalink / raw) To: Heiner Kallweit Cc: Jerome Brunet, Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Wed, Mar 2, 2022 at 5:35 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > When using polling the time difference between aneg complete and > PHY state machine run is random in the interval 0 .. 1s. > Hence there's a certain chance that the difference is too small > to avoid the issue. > > > If I understand the proposed patch correctly, it is mostly about the phy > > IRQ. Since I reproduce without the IRQ, I suppose it is not the > > problem we where looking for (might still be a problem worth fixing - > > the phy is not "rock-solid" when it comes to aneg - I already tried > > stabilising it a few years ago) > > Below is a slightly improved version of the test patch. It doesn't sleep > in the (threaded) interrupt handler and lets the workqueue do it. > > Maybe Amlogic is aware of a potentially related silicon issue? > > > > > TBH, It bothers me that I reproduced w/o the IRQ. The idea makes > > sense :/ > > > >> > [...] > > > > > diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c > index 7e7904fee..a3318ae01 100644 > --- a/drivers/net/phy/meson-gxl.c > +++ b/drivers/net/phy/meson-gxl.c > @@ -209,12 +209,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev) > if (ret) > return ret; > > - val = INTSRC_ANEG_PR > - | INTSRC_PARALLEL_FAULT > - | INTSRC_ANEG_LP_ACK > - | INTSRC_LINK_DOWN > - | INTSRC_REMOTE_FAULT > - | INTSRC_ANEG_COMPLETE; > + val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE; > ret = phy_write(phydev, INTSRC_MASK, val); > } else { > val = 0; > @@ -240,7 +235,10 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) > if (irq_status == 0) > return IRQ_NONE; > > - phy_trigger_machine(phydev); > + if (irq_status & INTSRC_ANEG_COMPLETE) > + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); > + else > + phy_trigger_machine(phydev); > > return IRQ_HANDLED; > } > -- > 2.35.1 I did a lot of testing with this patch, and it seems to improve things. To me it completely resolves the original issue which was more easily reproducible where I would see "Link is Up" but the interface did not really work. At least in over a thousand jobs, that never reproduced again with this patch. I do see a different issue now, but it is even less frequent and harder to reproduce. In those over a thousand jobs, I have seen it only about 4 times. The difference is that now when the issue happens, the link is not even reported as Up. The output is a bit different than the original one, but it is consistently the same output in all instances where it reproduced. Looks like this (note that there is no longer Link is Down/Link is Up): [ 2.186151] meson8b-dwmac c9410000.ethernet eth0: PHY [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) [ 2.191582] meson8b-dwmac c9410000.ethernet eth0: Register MEM_TYPE_PAGE_POOL RxQ-0 [ 2.208713] meson8b-dwmac c9410000.ethernet eth0: No Safety Features support found [ 2.210673] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW [ 2.218083] meson8b-dwmac c9410000.ethernet eth0: configuring for phy/rmii link mode [ 22.227444] Waiting up to 100 more seconds for network. [ 42.231440] Waiting up to 80 more seconds for network. [ 62.235437] Waiting up to 60 more seconds for network. [ 82.239437] Waiting up to 40 more seconds for network. [ 102.243439] Waiting up to 20 more seconds for network. [ 122.243446] Sending DHCP requests ... [ 130.113944] random: fast init done [ 134.219441] ... timed out! [ 194.559562] IP-Config: Retrying forever (NFS root)... [ 194.624630] meson8b-dwmac c9410000.ethernet eth0: PHY [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) [ 194.630739] meson8b-dwmac c9410000.ethernet eth0: Register MEM_TYPE_PAGE_POOL RxQ-0 [ 194.649138] meson8b-dwmac c9410000.ethernet eth0: No Safety Features support found [ 194.651113] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW [ 194.657931] meson8b-dwmac c9410000.ethernet eth0: configuring for phy/rmii link mode [ 196.313602] meson8b-dwmac c9410000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off [ 196.339463] Sending DHCP requests ., OK ... I don't remember seeing an output like this one in the previous tests. Is there any further improvement we can do to the patch based on this? Thanks Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-06 9:40 ` Erico Nunes @ 2022-03-06 12:56 ` Heiner Kallweit 2022-03-09 14:45 ` Erico Nunes 0 siblings, 1 reply; 17+ messages in thread From: Heiner Kallweit @ 2022-03-06 12:56 UTC (permalink / raw) To: Erico Nunes Cc: Jerome Brunet, Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 06.03.2022 10:40, Erico Nunes wrote: > On Wed, Mar 2, 2022 at 5:35 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >> When using polling the time difference between aneg complete and >> PHY state machine run is random in the interval 0 .. 1s. >> Hence there's a certain chance that the difference is too small >> to avoid the issue. >> >>> If I understand the proposed patch correctly, it is mostly about the phy >>> IRQ. Since I reproduce without the IRQ, I suppose it is not the >>> problem we where looking for (might still be a problem worth fixing - >>> the phy is not "rock-solid" when it comes to aneg - I already tried >>> stabilising it a few years ago) >> >> Below is a slightly improved version of the test patch. It doesn't sleep >> in the (threaded) interrupt handler and lets the workqueue do it. >> >> Maybe Amlogic is aware of a potentially related silicon issue? >> >>> >>> TBH, It bothers me that I reproduced w/o the IRQ. The idea makes >>> sense :/ >>> >>>> >> [...] >>> >> >> >> diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c >> index 7e7904fee..a3318ae01 100644 >> --- a/drivers/net/phy/meson-gxl.c >> +++ b/drivers/net/phy/meson-gxl.c >> @@ -209,12 +209,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev) >> if (ret) >> return ret; >> >> - val = INTSRC_ANEG_PR >> - | INTSRC_PARALLEL_FAULT >> - | INTSRC_ANEG_LP_ACK >> - | INTSRC_LINK_DOWN >> - | INTSRC_REMOTE_FAULT >> - | INTSRC_ANEG_COMPLETE; >> + val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE; >> ret = phy_write(phydev, INTSRC_MASK, val); >> } else { >> val = 0; >> @@ -240,7 +235,10 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) >> if (irq_status == 0) >> return IRQ_NONE; >> >> - phy_trigger_machine(phydev); >> + if (irq_status & INTSRC_ANEG_COMPLETE) >> + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); >> + else >> + phy_trigger_machine(phydev); >> >> return IRQ_HANDLED; >> } >> -- >> 2.35.1 > > I did a lot of testing with this patch, and it seems to improve things. > To me it completely resolves the original issue which was more easily > reproducible where I would see "Link is Up" but the interface did not > really work. > At least in over a thousand jobs, that never reproduced again with this patch. > > I do see a different issue now, but it is even less frequent and > harder to reproduce. In those over a thousand jobs, I have seen it > only about 4 times. > The difference is that now when the issue happens, the link is not > even reported as Up. The output is a bit different than the original > one, but it is consistently the same output in all instances where it > reproduced. Looks like this (note that there is no longer Link is > Down/Link is Up): > > [ 2.186151] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 2.191582] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 2.208713] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 2.210673] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 2.218083] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 22.227444] Waiting up to 100 more seconds for network. > [ 42.231440] Waiting up to 80 more seconds for network. > [ 62.235437] Waiting up to 60 more seconds for network. > [ 82.239437] Waiting up to 40 more seconds for network. > [ 102.243439] Waiting up to 20 more seconds for network. > [ 122.243446] Sending DHCP requests ... > [ 130.113944] random: fast init done > [ 134.219441] ... timed out! > [ 194.559562] IP-Config: Retrying forever (NFS root)... > [ 194.624630] meson8b-dwmac c9410000.ethernet eth0: PHY > [0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48) > [ 194.630739] meson8b-dwmac c9410000.ethernet eth0: Register > MEM_TYPE_PAGE_POOL RxQ-0 > [ 194.649138] meson8b-dwmac c9410000.ethernet eth0: No Safety > Features support found > [ 194.651113] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW > [ 194.657931] meson8b-dwmac c9410000.ethernet eth0: configuring for > phy/rmii link mode > [ 196.313602] meson8b-dwmac c9410000.ethernet eth0: Link is Up - > 100Mbps/Full - flow control off > [ 196.339463] Sending DHCP requests ., OK > ... > > > I don't remember seeing an output like this one in the previous tests. > Is there any further improvement we can do to the patch based on this? > > Thanks > > Erico Thanks a lot for your testing efforts, much appreciated. You could try the following (quick and dirty) test patch that fully mimics the vendor driver as found here: https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c First apply https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 This patch is in the net tree currently and should show up in linux-next beginning of the week. On top please apply the following (it includes the test patch your working with). diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c index c49062ad7..92f94c8be 100644 --- a/drivers/net/phy/meson-gxl.c +++ b/drivers/net/phy/meson-gxl.c @@ -68,32 +68,19 @@ static int meson_gxl_open_banks(struct phy_device *phydev) return phy_write(phydev, TSTCNTL, TSTCNTL_TEST_MODE); } -static void meson_gxl_close_banks(struct phy_device *phydev) -{ - phy_write(phydev, TSTCNTL, 0); -} - static int meson_gxl_read_reg(struct phy_device *phydev, unsigned int bank, unsigned int reg) { int ret; - ret = meson_gxl_open_banks(phydev); - if (ret) - goto out; - ret = phy_write(phydev, TSTCNTL, TSTCNTL_READ | FIELD_PREP(TSTCNTL_REG_BANK_SEL, bank) | TSTCNTL_TEST_MODE | FIELD_PREP(TSTCNTL_READ_ADDRESS, reg)); if (ret) - goto out; + return ret; - ret = phy_read(phydev, TSTREAD1); -out: - /* Close the bank access on our way out */ - meson_gxl_close_banks(phydev); - return ret; + return phy_read(phydev, TSTREAD1); } static int meson_gxl_write_reg(struct phy_device *phydev, @@ -102,29 +89,28 @@ static int meson_gxl_write_reg(struct phy_device *phydev, { int ret; - ret = meson_gxl_open_banks(phydev); - if (ret) - goto out; - ret = phy_write(phydev, TSTWRITE, value); if (ret) - goto out; + return ret; - ret = phy_write(phydev, TSTCNTL, TSTCNTL_WRITE | - FIELD_PREP(TSTCNTL_REG_BANK_SEL, bank) | - TSTCNTL_TEST_MODE | - FIELD_PREP(TSTCNTL_WRITE_ADDRESS, reg)); + return phy_write(phydev, TSTCNTL, TSTCNTL_WRITE | + FIELD_PREP(TSTCNTL_REG_BANK_SEL, bank) | + TSTCNTL_TEST_MODE | + FIELD_PREP(TSTCNTL_WRITE_ADDRESS, reg)); -out: - /* Close the bank access on our way out */ - meson_gxl_close_banks(phydev); - return ret; } static int meson_gxl_config_init(struct phy_device *phydev) { int ret; + phy_set_bits(phydev, 0x1b, BIT(12)); + phy_write(phydev, 0x11, 0x0080); + + meson_gxl_open_banks(phydev); + + ret = meson_gxl_write_reg(phydev, BANK_ANALOG_DSP, 0x17, 0x8e0d); + /* Enable fractional PLL */ ret = meson_gxl_write_reg(phydev, BANK_BIST, FR_PLL_CONTROL, 0x5); if (ret) @@ -140,6 +126,10 @@ static int meson_gxl_config_init(struct phy_device *phydev) if (ret) return ret; + ret = meson_gxl_write_reg(phydev, BANK_ANALOG_DSP, 0x18, 0x000c); + ret = meson_gxl_write_reg(phydev, BANK_ANALOG_DSP, 0x17, 0x1a0c); + ret = meson_gxl_write_reg(phydev, BANK_ANALOG_DSP, 0x1a, 0x6400); + return 0; } @@ -186,7 +176,7 @@ static int meson_gxl_read_status(struct phy_device *phydev) if (!(wol & LPI_STATUS_RSV12) || ((exp & EXPANSION_NWAY) && !(lpa & LPA_LPACK))) { /* Looks like aneg failed after all */ - phydev_dbg(phydev, "LPA corruption - aneg restart\n"); + phydev_warn(phydev, "LPA corruption - aneg restart\n"); return genphy_restart_aneg(phydev); } } @@ -243,11 +233,23 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev) irq_status == INTSRC_ENERGY_DETECT) return IRQ_HANDLED; - phy_trigger_machine(phydev); + /* Give PHY some time before MAC starts sending data. This works + * around an issue where network doesn't come up properly. + */ + if (irq_status & INTSRC_ANEG_COMPLETE) + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); + else + phy_trigger_machine(phydev); return IRQ_HANDLED; } +static void meson_gxl_link_change_notify(struct phy_device *phydev) +{ + if (phydev->state == PHY_RUNNING && phydev->speed == SPEED_100) + meson_gxl_write_reg(phydev, BANK_ANALOG_DSP, 0x14, 0xa900); +} + static struct phy_driver meson_gxl_phy[] = { { PHY_ID_MATCH_EXACT(0x01814400), @@ -259,6 +261,7 @@ static struct phy_driver meson_gxl_phy[] = { .read_status = meson_gxl_read_status, .config_intr = meson_gxl_config_intr, .handle_interrupt = meson_gxl_handle_interrupt, + .link_change_notify = meson_gxl_link_change_notify, .suspend = genphy_suspend, .resume = genphy_resume, }, { -- 2.35.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-06 12:56 ` Heiner Kallweit @ 2022-03-09 14:45 ` Erico Nunes 2022-03-09 14:57 ` Jerome Brunet 0 siblings, 1 reply; 17+ messages in thread From: Erico Nunes @ 2022-03-09 14:45 UTC (permalink / raw) To: Heiner Kallweit Cc: Jerome Brunet, Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Sun, Mar 6, 2022 at 1:56 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > You could try the following (quick and dirty) test patch that fully mimics > the vendor driver as found here: > https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c > > First apply > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 > This patch is in the net tree currently and should show up in linux-next > beginning of the week. > > On top please apply the following (it includes the test patch your working with). I triggered test jobs with this configuration (latest mainline + a502a8f0409 + test patch for vendor driver behaviour), and the results are pretty much the same as with the previous test patch from this thread only. That is, I never got the issue with non-functional link up anymore, but I get the (rare) issue with link not going up. The reproducibility is still extremely low, in the >1% range. So at this point, I'm not sure how much more effort to invest into this. Given the rate is very low and the fallback is it will just reset the link and proceed to work, I think the situation would already be much better with the solution from that test patch being merged. If you propose that as a patch separately, I'm happy to test the final submitted patch again and provide feedback there. Or if there is another solution to try, I can try with that too. Thanks Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-09 14:45 ` Erico Nunes @ 2022-03-09 14:57 ` Jerome Brunet 2022-03-09 20:42 ` Heiner Kallweit 0 siblings, 1 reply; 17+ messages in thread From: Jerome Brunet @ 2022-03-09 14:57 UTC (permalink / raw) To: Erico Nunes, Heiner Kallweit Cc: Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Wed 09 Mar 2022 at 15:45, Erico Nunes <nunes.erico@gmail.com> wrote: > On Sun, Mar 6, 2022 at 1:56 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >> You could try the following (quick and dirty) test patch that fully mimics >> the vendor driver as found here: >> https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c >> >> First apply >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 >> This patch is in the net tree currently and should show up in linux-next >> beginning of the week. >> >> On top please apply the following (it includes the test patch your working with). > > I triggered test jobs with this configuration (latest mainline + > a502a8f0409 + test patch for vendor driver behaviour), and the results > are pretty much the same as with the previous test patch from this > thread only. > That is, I never got the issue with non-functional link up anymore, > but I get the (rare) issue with link not going up. > The reproducibility is still extremely low, in the >1% range. Low reproducibility means the problem is still there, or at least not understood completly. I understand the benefit from the user standpoint. Heiner if you are going to continue from the test patch you sent, I would welcome some explanation with each of the changes. We know very little about this IP and I'm not very confortable with tweaking/aligning with AML sdk "blindly" on a driver that has otherwise been working well so far. Thx > > So at this point, I'm not sure how much more effort to invest into > this. Given the rate is very low and the fallback is it will just > reset the link and proceed to work, I think the situation would > already be much better with the solution from that test patch being > merged. If you propose that as a patch separately, I'm happy to test > the final submitted patch again and provide feedback there. Or if > there is another solution to try, I can try with that too. > > Thanks > > > Erico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-03-09 14:57 ` Jerome Brunet @ 2022-03-09 20:42 ` Heiner Kallweit [not found] ` <CACdvmAhcyNXViJgk6o6oAoYvAjAg-NFD74Eym_nGHJx3YAqjzw@mail.gmail.com> 0 siblings, 1 reply; 17+ messages in thread From: Heiner Kallweit @ 2022-03-09 20:42 UTC (permalink / raw) To: Jerome Brunet, Erico Nunes Cc: Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On 09.03.2022 15:57, Jerome Brunet wrote: > > On Wed 09 Mar 2022 at 15:45, Erico Nunes <nunes.erico@gmail.com> wrote: > >> On Sun, Mar 6, 2022 at 1:56 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: >>> You could try the following (quick and dirty) test patch that fully mimics >>> the vendor driver as found here: >>> https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c >>> >>> First apply >>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 >>> This patch is in the net tree currently and should show up in linux-next >>> beginning of the week. >>> >>> On top please apply the following (it includes the test patch your working with). >> >> I triggered test jobs with this configuration (latest mainline + >> a502a8f0409 + test patch for vendor driver behaviour), and the results >> are pretty much the same as with the previous test patch from this >> thread only. >> That is, I never got the issue with non-functional link up anymore, >> but I get the (rare) issue with link not going up. >> The reproducibility is still extremely low, in the >1% range. > > Low reproducibility means the problem is still there, or at least not > understood completly. > > I understand the benefit from the user standpoint. > > Heiner if you are going to continue from the test patch you sent, > I would welcome some explanation with each of the changes. > The latest test patch was purely for checking whether we see any difference in behavior between vendor driver and the mainlined version. It's in no way meant to be applied to mainline. > We know very little about this IP and I'm not very confortable with > tweaking/aligning with AML sdk "blindly" on a driver that has otherwise > been working well so far. > This touches one thing I wanted to ask anyway: Supposedly Amlogic didn't develop an own Ethernet PHY, and if they licensed an existing IP then it should be similar to some other existing PHY (that may have a driver in phylib). Then what I'll do is submit the following small change that brought the error rate significantly down according to Erico's tests. - phy_trigger_machine(phydev); + if (irq_status & INTSRC_ANEG_COMPLETE) + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); + else + phy_trigger_machine(phydev); > Thx > >> >> So at this point, I'm not sure how much more effort to invest into >> this. Given the rate is very low and the fallback is it will just >> reset the link and proceed to work, I think the situation would >> already be much better with the solution from that test patch being >> merged. If you propose that as a patch separately, I'm happy to test >> the final submitted patch again and provide feedback there. Or if >> there is another solution to try, I can try with that too. >> >> Thanks >> >> >> Erico > Heiner ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <CACdvmAhcyNXViJgk6o6oAoYvAjAg-NFD74Eym_nGHJx3YAqjzw@mail.gmail.com>]
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot [not found] ` <CACdvmAhcyNXViJgk6o6oAoYvAjAg-NFD74Eym_nGHJx3YAqjzw@mail.gmail.com> @ 2022-06-13 9:10 ` Jerome Brunet 2022-07-15 5:35 ` Anand Moon 0 siblings, 1 reply; 17+ messages in thread From: Jerome Brunet @ 2022-06-13 9:10 UTC (permalink / raw) To: Da Xue, Heiner Kallweit Cc: Erico Nunes, Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi On Sat 11 Jun 2022 at 17:00, Da Xue <da@lessconfused.com> wrote: > On Wed, Mar 9, 2022 at 3:42 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > > On 09.03.2022 15:57, Jerome Brunet wrote: > > > > On Wed 09 Mar 2022 at 15:45, Erico Nunes <nunes.erico@gmail.com> wrote: > > > >> On Sun, Mar 6, 2022 at 1:56 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > >>> You could try the following (quick and dirty) test patch that fully mimics > >>> the vendor driver as found here: > >>> https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c > >>> > >>> First apply > >>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 > >>> This patch is in the net tree currently and should show up in linux-next > >>> beginning of the week. > >>> > >>> On top please apply the following (it includes the test patch your working with). > >> > >> I triggered test jobs with this configuration (latest mainline + > >> a502a8f0409 + test patch for vendor driver behaviour), and the results > >> are pretty much the same as with the previous test patch from this > >> thread only. > >> That is, I never got the issue with non-functional link up anymore, > >> but I get the (rare) issue with link not going up. > >> The reproducibility is still extremely low, in the >1% range. > > > > Low reproducibility means the problem is still there, or at least not > > understood completly. > > > > I understand the benefit from the user standpoint. > > > > Heiner if you are going to continue from the test patch you sent, > > I would welcome some explanation with each of the changes. > > > The latest test patch was purely for checking whether we see any > difference in behavior between vendor driver and the mainlined > version. It's in no way meant to be applied to mainline. > > > We know very little about this IP and I'm not very confortable with > > tweaking/aligning with AML sdk "blindly" on a driver that has otherwise > > been working well so far. > > > > This touches one thing I wanted to ask anyway: Supposedly Amlogic > didn't develop an own Ethernet PHY, and if they licensed an existing > IP then it should be similar to some other existing PHY (that may > have a driver in phylib). > > Then what I'll do is submit the following small change that brought > the error rate significantly down according to Erico's tests. > > - phy_trigger_machine(phydev); > + if (irq_status & INTSRC_ANEG_COMPLETE) > + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); > + else > + phy_trigger_machine(phydev); > > > Thx > > > >> > >> So at this point, I'm not sure how much more effort to invest into > >> this. Given the rate is very low and the fallback is it will just > >> reset the link and proceed to work, I think the situation would > >> already be much better with the solution from that test patch being > >> merged. If you propose that as a patch separately, I'm happy to test > >> the final submitted patch again and provide feedback there. Or if > >> there is another solution to try, I can try with that too. > >> > >> Thanks > >> > >> > >> Erico > > > > Heiner > > To help reproduce this problem, I have had this problem for as long as I can remember and it still occurs with this patch. Same here, on both gxl and g12a. Occurrence remains unchanged. The is even reproduced if the PHY is switched to polling mode so the merged change, related to the IRQ handling, is very unlikely to fix the problem. > > This doesn't happen on first boot most of the time. It happens on reboot consistently. I have tested with AML-S805X-CC board, AML-S905X-CC V1, and V2 boards. > On my side, I confirm the network never seems to get stuck in u-boot but it might break in Linux, even on the first boot after a power up from what I have seen so far. > I am on u-boot 22.04 with 5.18.3 which includes the patch. > u-boot brings up ethernet on start and can grab an IP. > Linux brings up ethernet and can grab an IP. > reboot > u-boot can grab an IP. > Linux does not get anything. > I have to do ip link set dev eth0 down && up once or more to get ethernet to work again. > Sometimes it spams meson8b-dwmac c9410000.ethernet eth0: Reset adapter. If it spams this, ethernet is dead and can't be recovered. I tried several things, none showing any improvement so far * Make sure LPI/EEE is disabled * Add the ethernet reset from the main controller on the MAC * Test the various DMA modes of STMMAC * Port the differences from u-boot and the vendor kernel in the Phy driver I have also tried to go back in time, up to v4.19 but the problem is actually already there. It occurs at lot less though. Since v5.6+ the occurence is quite high: approx 1 in 4 boots On v4.19: 1 in 50 boots - up to 150. > When the problem happen * link is reported up * ifconfig / MAC is claiming to be sending packets (Tx increasing - no Rx) * I see no traffic with wireshark The packets are getting lost somewhere. Can't say for sure if it is in the MAC or the PHY. > This is fixed via power cycle so I'm assuming some register is not reset or maybe the IP is stuck. > `ethtool -r eth0` also seems to work around the problem. This trigs the restart of so many things, it is close to an un/replug of the ethernet cable :/ > Best, > Da Xue ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot 2022-06-13 9:10 ` Jerome Brunet @ 2022-07-15 5:35 ` Anand Moon 0 siblings, 0 replies; 17+ messages in thread From: Anand Moon @ 2022-07-15 5:35 UTC (permalink / raw) To: Jerome Brunet Cc: Da Xue, Heiner Kallweit, Erico Nunes, Martin Blumenstingl, Alexandre Torgue, Giuseppe Cavallaro, Jose Abreu, Kevin Hilman, Neil Armstrong, linux-amlogic, netdev, open list:ARM/Rockchip SoC..., linux-sunxi Hi Jerome On Mon, 13 Jun 2022 at 15:10, Jerome Brunet <jbrunet@baylibre.com> wrote: > > > On Sat 11 Jun 2022 at 17:00, Da Xue <da@lessconfused.com> wrote: > > > On Wed, Mar 9, 2022 at 3:42 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > > > > On 09.03.2022 15:57, Jerome Brunet wrote: > > > > > > On Wed 09 Mar 2022 at 15:45, Erico Nunes <nunes.erico@gmail.com> wrote: > > > > > >> On Sun, Mar 6, 2022 at 1:56 PM Heiner Kallweit <hkallweit1@gmail.com> wrote: > > >>> You could try the following (quick and dirty) test patch that fully mimics > > >>> the vendor driver as found here: > > >>> https://github.com/khadas/linux/blob/buildroot-aml-4.9/drivers/amlogic/ethernet/phy/amlogic.c > > >>> > > >>> First apply > > >>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a502a8f04097e038c3daa16c5202a9538116d563 > > >>> This patch is in the net tree currently and should show up in linux-next > > >>> beginning of the week. > > >>> > > >>> On top please apply the following (it includes the test patch your working with). > > >> > > >> I triggered test jobs with this configuration (latest mainline + > > >> a502a8f0409 + test patch for vendor driver behaviour), and the results > > >> are pretty much the same as with the previous test patch from this > > >> thread only. > > >> That is, I never got the issue with non-functional link up anymore, > > >> but I get the (rare) issue with link not going up. > > >> The reproducibility is still extremely low, in the >1% range. > > > > > > Low reproducibility means the problem is still there, or at least not > > > understood completly. > > > > > > I understand the benefit from the user standpoint. > > > > > > Heiner if you are going to continue from the test patch you sent, > > > I would welcome some explanation with each of the changes. > > > > > The latest test patch was purely for checking whether we see any > > difference in behavior between vendor driver and the mainlined > > version. It's in no way meant to be applied to mainline. > > > > > We know very little about this IP and I'm not very confortable with > > > tweaking/aligning with AML sdk "blindly" on a driver that has otherwise > > > been working well so far. > > > > > > > This touches one thing I wanted to ask anyway: Supposedly Amlogic > > didn't develop an own Ethernet PHY, and if they licensed an existing > > IP then it should be similar to some other existing PHY (that may > > have a driver in phylib). > > > > Then what I'll do is submit the following small change that brought > > the error rate significantly down according to Erico's tests. > > > > - phy_trigger_machine(phydev); > > + if (irq_status & INTSRC_ANEG_COMPLETE) > > + phy_queue_state_machine(phydev, msecs_to_jiffies(100)); > > + else > > + phy_trigger_machine(phydev); > > > > > Thx > > > > > >> > > >> So at this point, I'm not sure how much more effort to invest into > > >> this. Given the rate is very low and the fallback is it will just > > >> reset the link and proceed to work, I think the situation would > > >> already be much better with the solution from that test patch being > > >> merged. If you propose that as a patch separately, I'm happy to test > > >> the final submitted patch again and provide feedback there. Or if > > >> there is another solution to try, I can try with that too. > > >> > > >> Thanks > > >> > > >> > > >> Erico > > > > > > > Heiner > > > > To help reproduce this problem, I have had this problem for as long as I can remember and it still occurs with this patch. > > Same here, on both gxl and g12a. Occurrence remains unchanged. > The is even reproduced if the PHY is switched to polling mode so the > merged change, related to the IRQ handling, is very unlikely to fix the > problem. > > > > > This doesn't happen on first boot most of the time. It happens on reboot consistently. I have tested with AML-S805X-CC board, AML-S905X-CC V1, and V2 boards. > > > > On my side, I confirm the network never seems to get stuck in u-boot but > it might break in Linux, even on the first boot after a power up from > what I have seen so far. > > > I am on u-boot 22.04 with 5.18.3 which includes the patch. > > u-boot brings up ethernet on start and can grab an IP. > > Linux brings up ethernet and can grab an IP. > > reboot > > u-boot can grab an IP. > > Linux does not get anything. > > I have to do ip link set dev eth0 down && up once or more to get ethernet to work again. > > Sometimes it spams meson8b-dwmac c9410000.ethernet eth0: Reset adapter. If it spams this, ethernet is dead and can't be recovered. > > I tried several things, none showing any improvement so far > * Make sure LPI/EEE is disabled > * Add the ethernet reset from the main controller on the MAC > * Test the various DMA modes of STMMAC > * Port the differences from u-boot and the vendor kernel in the Phy driver > > I have also tried to go back in time, up to v4.19 but the problem is actually > already there. It occurs at lot less though. > Since v5.6+ the occurence is quite high: approx 1 in 4 boots > On v4.19: 1 in 50 boots - up to 150. > > > > > When the problem happen > * link is reported up > * ifconfig / MAC is claiming to be sending packets (Tx increasing - no Rx) > * I see no traffic with wireshark > > The packets are getting lost somewhere. Can't say for sure if it is in > the MAC or the PHY. > > > This is fixed via power cycle so I'm assuming some register is not reset or maybe the IP is stuck. > > > > `ethtool -r eth0` also seems to work around the problem. > This trigs the restart of so many things, it is close to an un/replug of > the ethernet cable :/ > Have you give a try for setting up a regulator for ethernet and implementing runtime power management Best Regards -Anand > > Best, > > Da Xue > > > _______________________________________________ > Linux-rockchip mailing list > Linux-rockchip@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-rockchip ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2022-07-15 5:36 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-02 20:18 net: stmmac: dwmac-meson8b: interface sometimes does not come up at boot Erico Nunes
2022-02-03 13:53 ` Vyacheslav
2022-02-07 10:41 ` Jerome Brunet
2022-02-20 16:51 ` Erico Nunes
2022-02-22 2:30 ` Samuel Holland
2022-02-26 13:53 ` Heiner Kallweit
2022-03-02 10:33 ` Erico Nunes
2022-03-02 11:01 ` Heiner Kallweit
2022-03-02 13:39 ` Jerome Brunet
2022-03-02 16:34 ` Heiner Kallweit
2022-03-06 9:40 ` Erico Nunes
2022-03-06 12:56 ` Heiner Kallweit
2022-03-09 14:45 ` Erico Nunes
2022-03-09 14:57 ` Jerome Brunet
2022-03-09 20:42 ` Heiner Kallweit
[not found] ` <CACdvmAhcyNXViJgk6o6oAoYvAjAg-NFD74Eym_nGHJx3YAqjzw@mail.gmail.com>
2022-06-13 9:10 ` Jerome Brunet
2022-07-15 5:35 ` Anand Moon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).