Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues

public inbox for linux-tegra@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
       [not found] <Z8m-CRucPxDW5zZK@shell.armlinux.org.uk>
@ 2025-03-07 16:11 ` Jon Hunter
  2025-03-07 17:07   ` Russell King (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Hunter @ 2025-03-07 16:11 UTC (permalink / raw)
  To: Russell King (Oracle), Thierry Reding, Lad, Prabhakar
  Cc: Alexandre Torgue, Andrew Lunn, Andrew Lunn, David S. Miller,
	Eric Dumazet, Heiner Kallweit, Jakub Kicinski, linux-arm-kernel,
	linux-stm32, Maxime Coquelin, netdev, Paolo Abeni,
	linux-tegra@vger.kernel.org

Hi Russell,

On 06/03/2025 15:23, Russell King (Oracle) wrote:
> Hi,
> 
> This is a second approach to solving the STMMAC reset issues caused by
> the lack of receive clock from the PHY where the media is in low power
> mode with a PHY that supports receive clock-stop.
> 
> The first approach centred around only addressing the issue in the
> resume path, but it seems to also happen when the platform glue module
> is removed and re-inserted (Jon - can you check whether that's also
> the case for you please?)
> 
> As this is more targetted, I've dropped the patches from this series
> which move the call to phylink_resume(), so the link may still come
> up too early on resume - but that's something I also intend to fix.
> 
> This is experimental - so I value test reports for this change.


The subject indicates 3 patches, but I only see 2 patches? Can you 
confirm if there are 2 or 3?

So far I have only tested to resume case with the 2 patches to make that 
that is working but on Tegra186, which has been the most problematic, it 
is not working reliably on top of next-20250305.

Cheers,
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
  2025-03-07 16:11 ` [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues Jon Hunter
@ 2025-03-07 17:07   ` Russell King (Oracle)
  2025-03-10 14:20     ` Jon Hunter
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2025-03-07 17:07 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Thierry Reding, Lad, Prabhakar, Alexandre Torgue, Andrew Lunn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Heiner Kallweit,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni, linux-tegra@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1584 bytes --]

On Fri, Mar 07, 2025 at 04:11:19PM +0000, Jon Hunter wrote:
> Hi Russell,
> 
> On 06/03/2025 15:23, Russell King (Oracle) wrote:
> > Hi,
> > 
> > This is a second approach to solving the STMMAC reset issues caused by
> > the lack of receive clock from the PHY where the media is in low power
> > mode with a PHY that supports receive clock-stop.
> > 
> > The first approach centred around only addressing the issue in the
> > resume path, but it seems to also happen when the platform glue module
> > is removed and re-inserted (Jon - can you check whether that's also
> > the case for you please?)
> > 
> > As this is more targetted, I've dropped the patches from this series
> > which move the call to phylink_resume(), so the link may still come
> > up too early on resume - but that's something I also intend to fix.
> > 
> > This is experimental - so I value test reports for this change.
> 
> 
> The subject indicates 3 patches, but I only see 2 patches? Can you confirm
> if there are 2 or 3?

Yes, 2 patches is correct.

> So far I have only tested to resume case with the 2 patches to make that
> that is working but on Tegra186, which has been the most problematic, it is
> not working reliably on top of next-20250305.

To confirm, you're seeing stmmac_reset() sporadically timing out on
resume even with these patches appled? That's rather disappointing.

Do either of the two attached diffs make any difference?

Thanks for testing!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

[-- Attachment #2: stmmac-block-rx-clk-stop.diff --]
[-- Type: text/x-diff, Size: 1098 bytes --]

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8d3cae5b43c5..63d30e09c095 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3108,9 +3108,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
 		priv->plat->dma_cfg->atds = 1;
 
 	/* Note that the PHY clock must be running for reset to complete. */
-	phylink_rx_clk_stop_block(priv->phylink);
 	ret = stmmac_reset(priv, priv->ioaddr);
-	phylink_rx_clk_stop_unblock(priv->phylink);
 	if (ret) {
 		netdev_err(priv->dev, "Failed to reset the dma\n");
 		return ret;
@@ -3480,7 +3478,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool ptp_register)
 		phylink_pcs_pre_init(priv->phylink, priv->hw->phylink_pcs);
 
 	/* DMA initialization and SW reset */
+	phylink_rx_clk_stop_block(priv->phylink);
 	ret = stmmac_init_dma_engine(priv);
+	phylink_rx_clk_stop_unblock(priv->phylink);
 	if (ret < 0) {
 		netdev_err(priv->dev, "%s: DMA engine initialization failed\n",
 			   __func__);

[-- Attachment #3: stmmac-block-rx-clk-stop-2.diff --]
[-- Type: text/x-diff, Size: 1303 bytes --]

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8d3cae5b43c5..bebc9f98c875 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3108,9 +3108,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
 		priv->plat->dma_cfg->atds = 1;
 
 	/* Note that the PHY clock must be running for reset to complete. */
-	phylink_rx_clk_stop_block(priv->phylink);
 	ret = stmmac_reset(priv, priv->ioaddr);
-	phylink_rx_clk_stop_unblock(priv->phylink);
 	if (ret) {
 		netdev_err(priv->dev, "Failed to reset the dma\n");
 		return ret;
@@ -4045,7 +4043,9 @@ static int __stmmac_open(struct net_device *dev,
 		}
 	}
 
+	phylink_rx_clk_stop_block(priv->phylink);
 	ret = stmmac_hw_setup(dev, true);
+	phylink_rx_clk_stop_unblock(priv->phylink);
 	if (ret < 0) {
 		netdev_err(priv->dev, "%s: Hw setup failed\n", __func__);
 		goto init_error;
@@ -7949,7 +7949,9 @@ int stmmac_resume(struct device *dev)
 	stmmac_free_tx_skbufs(priv);
 	stmmac_clear_descriptors(priv, &priv->dma_conf);
 
+	phylink_rx_clk_stop_block(priv->phylink);
 	stmmac_hw_setup(ndev, false);
+	phylink_rx_clk_stop_unblock(priv->phylink);
 	stmmac_init_coalesce(priv);
 	stmmac_set_rx_mode(ndev);
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
  2025-03-07 17:07   ` Russell King (Oracle)
@ 2025-03-10 14:20     ` Jon Hunter
  2025-03-11 13:25       ` Jon Hunter
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Hunter @ 2025-03-10 14:20 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Thierry Reding, Lad, Prabhakar, Alexandre Torgue, Andrew Lunn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Heiner Kallweit,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni, linux-tegra@vger.kernel.org


On 07/03/2025 17:07, Russell King (Oracle) wrote:
> On Fri, Mar 07, 2025 at 04:11:19PM +0000, Jon Hunter wrote:
>> Hi Russell,
>>
>> On 06/03/2025 15:23, Russell King (Oracle) wrote:
>>> Hi,
>>>
>>> This is a second approach to solving the STMMAC reset issues caused by
>>> the lack of receive clock from the PHY where the media is in low power
>>> mode with a PHY that supports receive clock-stop.
>>>
>>> The first approach centred around only addressing the issue in the
>>> resume path, but it seems to also happen when the platform glue module
>>> is removed and re-inserted (Jon - can you check whether that's also
>>> the case for you please?)
>>>
>>> As this is more targetted, I've dropped the patches from this series
>>> which move the call to phylink_resume(), so the link may still come
>>> up too early on resume - but that's something I also intend to fix.
>>>
>>> This is experimental - so I value test reports for this change.
>>
>>
>> The subject indicates 3 patches, but I only see 2 patches? Can you confirm
>> if there are 2 or 3?
> 
> Yes, 2 patches is correct.
> 
>> So far I have only tested to resume case with the 2 patches to make that
>> that is working but on Tegra186, which has been the most problematic, it is
>> not working reliably on top of next-20250305.
> 
> To confirm, you're seeing stmmac_reset() sporadically timing out on
> resume even with these patches appled? That's rather disappointing.

So I am no longer seeing the reset fail, from what I can see, but now
NFS is not responding after resume ...

[   49.825094] Enabling non-boot CPUs ...
[   49.829760] Detected PIPT I-cache on CPU1
[   49.832694] CPU features: SANITY CHECK: Unexpected variation in SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU1: 0x0000009444c004
[   49.844120] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU1: 0x00000010305116
[   49.856231] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU1: 0x00000003001066
[   49.868081] CPU1: Booted secondary processor 0x0000000000 [0x4e0f0030]
[   49.875389] CPU1 is up
[   49.877187] Detected PIPT I-cache on CPU2
[   49.880824] CPU features: SANITY CHECK: Unexpected variation in SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU2: 0x0000009444c004
[   49.892266] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU2: 0x00000010305116
[   49.904467] CPU features: SANITY CHECK: Unexpected variation in SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU2: 0x00000003001066
[   49.916257] CPU2: Booted secondary processor 0x0000000001 [0x4e0f0030]
[   49.923610] CPU2 is up
[   49.925194] Detected PIPT I-cache on CPU3
[   49.929010] CPU3: Booted secondary processor 0x0000000101 [0x411fd073]
[   49.935866] CPU3 is up
[   49.937983] Detected PIPT I-cache on CPU4
[   49.941824] CPU4: Booted secondary processor 0x0000000102 [0x411fd073]
[   49.948593] CPU4 is up
[   49.950810] Detected PIPT I-cache on CPU5
[   49.954651] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
[   49.961431] CPU5 is up
[   50.069784] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
[   50.077634] dwmac4: Master AXI performs any burst length
[   50.080718] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
[   50.088172] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
[   50.096851] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   50.110897] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector: repeated role: device
[   50.113922] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
[   50.147552] OOM killer enabled.
[   50.148441] Restarting tasks ... done.
[   50.152552] VDDIO_SDMMC3_AP: voltage operation not allowed
[   50.154761] random: crng reseeded on system resumption
[   50.162912] PM: suspend exit
[   50.212215] VDDIO_SDMMC3_AP: voltage operation not allowed
[   50.271578] VDDIO_SDMMC3_AP: voltage operation not allowed
[   50.338597] VDDIO_SDMMC3_AP: voltage operation not allowed
[  234.474848] nfs: server 10.26.51.252 not responding, still trying
[  234.538769] nfs: server 10.26.51.252 not responding, still trying
[  237.546922] nfs: server 10.26.51.252 not responding, still trying
[  254.762753] nfs: server 10.26.51.252 not responding, timed out
[  254.762771] nfs: server 10.26.51.252 not responding, timed out
[  254.766376] nfs: server 10.26.51.252 not responding, timed out
[  254.766392] nfs: server 10.26.51.252 not responding, timed out
[  254.783778] nfs: server 10.26.51.252 not responding, timed out
[  254.789582] nfs: server 10.26.51.252 not responding, timed out
[  254.795421] nfs: server 10.26.51.252 not responding, timed out
[  254.801193] nfs: server 10.26.51.252 not responding, timed out

> Do either of the two attached diffs make any difference?

I will try these next.

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
  2025-03-10 14:20     ` Jon Hunter
@ 2025-03-11 13:25       ` Jon Hunter
  2025-03-11 13:58         ` Russell King (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Hunter @ 2025-03-11 13:25 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Thierry Reding, Lad, Prabhakar, Alexandre Torgue, Andrew Lunn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Heiner Kallweit,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni, linux-tegra@vger.kernel.org


On 10/03/2025 14:20, Jon Hunter wrote:
> 
> On 07/03/2025 17:07, Russell King (Oracle) wrote:
>> On Fri, Mar 07, 2025 at 04:11:19PM +0000, Jon Hunter wrote:
>>> Hi Russell,
>>>
>>> On 06/03/2025 15:23, Russell King (Oracle) wrote:
>>>> Hi,
>>>>
>>>> This is a second approach to solving the STMMAC reset issues caused by
>>>> the lack of receive clock from the PHY where the media is in low power
>>>> mode with a PHY that supports receive clock-stop.
>>>>
>>>> The first approach centred around only addressing the issue in the
>>>> resume path, but it seems to also happen when the platform glue module
>>>> is removed and re-inserted (Jon - can you check whether that's also
>>>> the case for you please?)
>>>>
>>>> As this is more targetted, I've dropped the patches from this series
>>>> which move the call to phylink_resume(), so the link may still come
>>>> up too early on resume - but that's something I also intend to fix.
>>>>
>>>> This is experimental - so I value test reports for this change.
>>>
>>>
>>> The subject indicates 3 patches, but I only see 2 patches? Can you 
>>> confirm
>>> if there are 2 or 3?
>>
>> Yes, 2 patches is correct.
>>
>>> So far I have only tested to resume case with the 2 patches to make that
>>> that is working but on Tegra186, which has been the most problematic, 
>>> it is
>>> not working reliably on top of next-20250305.
>>
>> To confirm, you're seeing stmmac_reset() sporadically timing out on
>> resume even with these patches appled? That's rather disappointing.
> 
> So I am no longer seeing the reset fail, from what I can see, but now
> NFS is not responding after resume ...
> 
> [   49.825094] Enabling non-boot CPUs ...
> [   49.829760] Detected PIPT I-cache on CPU1
> [   49.832694] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU1: 0x0000009444c004
> [   49.844120] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU1: 0x00000010305116
> [   49.856231] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU1: 0x00000003001066
> [   49.868081] CPU1: Booted secondary processor 0x0000000000 [0x4e0f0030]
> [   49.875389] CPU1 is up
> [   49.877187] Detected PIPT I-cache on CPU2
> [   49.880824] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU2: 0x0000009444c004
> [   49.892266] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU2: 0x00000010305116
> [   49.904467] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU2: 0x00000003001066
> [   49.916257] CPU2: Booted secondary processor 0x0000000001 [0x4e0f0030]
> [   49.923610] CPU2 is up
> [   49.925194] Detected PIPT I-cache on CPU3
> [   49.929010] CPU3: Booted secondary processor 0x0000000101 [0x411fd073]
> [   49.935866] CPU3 is up
> [   49.937983] Detected PIPT I-cache on CPU4
> [   49.941824] CPU4: Booted secondary processor 0x0000000102 [0x411fd073]
> [   49.948593] CPU4 is up
> [   49.950810] Detected PIPT I-cache on CPU5
> [   49.954651] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
> [   49.961431] CPU5 is up
> [   50.069784] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/ 
> rgmii link mode
> [   50.077634] dwmac4: Master AXI performs any burst length
> [   50.080718] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features 
> support found
> [   50.088172] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 
> Advanced Timestamp supported
> [   50.096851] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/ 
> Full - flow control rx/tx
> [   50.110897] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector: 
> repeated role: device
> [   50.113922] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 
> 13:39:28 UTC
> [   50.147552] OOM killer enabled.
> [   50.148441] Restarting tasks ... done.
> [   50.152552] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.154761] random: crng reseeded on system resumption
> [   50.162912] PM: suspend exit
> [   50.212215] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.271578] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.338597] VDDIO_SDMMC3_AP: voltage operation not allowed
> [  234.474848] nfs: server 10.26.51.252 not responding, still trying
> [  234.538769] nfs: server 10.26.51.252 not responding, still trying
> [  237.546922] nfs: server 10.26.51.252 not responding, still trying
> [  254.762753] nfs: server 10.26.51.252 not responding, timed out
> [  254.762771] nfs: server 10.26.51.252 not responding, timed out
> [  254.766376] nfs: server 10.26.51.252 not responding, timed out
> [  254.766392] nfs: server 10.26.51.252 not responding, timed out
> [  254.783778] nfs: server 10.26.51.252 not responding, timed out
> [  254.789582] nfs: server 10.26.51.252 not responding, timed out
> [  254.795421] nfs: server 10.26.51.252 not responding, timed out
> [  254.801193] nfs: server 10.26.51.252 not responding, timed out
> 
>> Do either of the two attached diffs make any difference?
> 
> I will try these next.


I tried both of the diffs, but both had the same problem as above and
I see these nfs timeouts after resuming. What works the best is the
original change you proposed (this is based upon the latest two
patches) ...

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e2146d3aee74..48a646b76a29 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3109,10 +3109,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
         if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
                 priv->plat->dma_cfg->atds = 1;
  
-       /* Note that the PHY clock must be running for reset to complete. */
-       phylink_rx_clk_stop_block(priv->phylink);
         ret = stmmac_reset(priv, priv->ioaddr);
-       phylink_rx_clk_stop_unblock(priv->phylink);
         if (ret) {
                 netdev_err(priv->dev, "Failed to reset the dma\n");
                 return ret;
@@ -7951,6 +7948,8 @@ int stmmac_resume(struct device *dev)
         rtnl_lock();
         mutex_lock(&priv->lock);
  
+       /* Note that the PHY clock must be running for reset to complete. */
+       phylink_rx_clk_stop_block(priv->phylink);
         stmmac_reset_queues_param(priv);
  
         stmmac_free_tx_skbufs(priv);
@@ -7961,6 +7960,7 @@ int stmmac_resume(struct device *dev)
         stmmac_set_rx_mode(ndev);
  
         stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
+       phylink_rx_clk_stop_unblock(priv->phylink);
  
         stmmac_enable_all_queues(priv);
         stmmac_enable_all_dma_irq(priv);

-- 
nvpublic


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
  2025-03-11 13:25       ` Jon Hunter
@ 2025-03-11 13:58         ` Russell King (Oracle)
  2025-03-11 17:32           ` Jon Hunter
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2025-03-11 13:58 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Thierry Reding, Lad, Prabhakar, Alexandre Torgue, Andrew Lunn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Heiner Kallweit,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni, linux-tegra@vger.kernel.org

On Tue, Mar 11, 2025 at 01:25:58PM +0000, Jon Hunter wrote:
> 
> On 10/03/2025 14:20, Jon Hunter wrote:
> > 
> > On 07/03/2025 17:07, Russell King (Oracle) wrote:
> > > On Fri, Mar 07, 2025 at 04:11:19PM +0000, Jon Hunter wrote:
> > > > Hi Russell,
> > > > 
> > > > On 06/03/2025 15:23, Russell King (Oracle) wrote:
> > > > > Hi,
> > > > > 
> > > > > This is a second approach to solving the STMMAC reset issues caused by
> > > > > the lack of receive clock from the PHY where the media is in low power
> > > > > mode with a PHY that supports receive clock-stop.
> > > > > 
> > > > > The first approach centred around only addressing the issue in the
> > > > > resume path, but it seems to also happen when the platform glue module
> > > > > is removed and re-inserted (Jon - can you check whether that's also
> > > > > the case for you please?)
> > > > > 
> > > > > As this is more targetted, I've dropped the patches from this series
> > > > > which move the call to phylink_resume(), so the link may still come
> > > > > up too early on resume - but that's something I also intend to fix.
> > > > > 
> > > > > This is experimental - so I value test reports for this change.
> > > > 
> > > > 
> > > > The subject indicates 3 patches, but I only see 2 patches? Can
> > > > you confirm
> > > > if there are 2 or 3?
> > > 
> > > Yes, 2 patches is correct.
> > > 
> > > > So far I have only tested to resume case with the 2 patches to make that
> > > > that is working but on Tegra186, which has been the most
> > > > problematic, it is
> > > > not working reliably on top of next-20250305.
> > > 
> > > To confirm, you're seeing stmmac_reset() sporadically timing out on
> > > resume even with these patches appled? That's rather disappointing.
> > 
> > So I am no longer seeing the reset fail, from what I can see, but now
> > NFS is not responding after resume ...
> > 
> > [   49.825094] Enabling non-boot CPUs ...
> > [   49.829760] Detected PIPT I-cache on CPU1
> > [   49.832694] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU1: 0x0000009444c004
> > [   49.844120] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU1: 0x00000010305116
> > [   49.856231] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU1: 0x00000003001066
> > [   49.868081] CPU1: Booted secondary processor 0x0000000000 [0x4e0f0030]
> > [   49.875389] CPU1 is up
> > [   49.877187] Detected PIPT I-cache on CPU2
> > [   49.880824] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU2: 0x0000009444c004
> > [   49.892266] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU2: 0x00000010305116
> > [   49.904467] CPU features: SANITY CHECK: Unexpected variation in
> > SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU2: 0x00000003001066
> > [   49.916257] CPU2: Booted secondary processor 0x0000000001 [0x4e0f0030]
> > [   49.923610] CPU2 is up
> > [   49.925194] Detected PIPT I-cache on CPU3
> > [   49.929010] CPU3: Booted secondary processor 0x0000000101 [0x411fd073]
> > [   49.935866] CPU3 is up
> > [   49.937983] Detected PIPT I-cache on CPU4
> > [   49.941824] CPU4: Booted secondary processor 0x0000000102 [0x411fd073]
> > [   49.948593] CPU4 is up
> > [   49.950810] Detected PIPT I-cache on CPU5
> > [   49.954651] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
> > [   49.961431] CPU5 is up
> > [   50.069784] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/
> > rgmii link mode
> > [   50.077634] dwmac4: Master AXI performs any burst length
> > [   50.080718] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features
> > support found
> > [   50.088172] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008
> > Advanced Timestamp supported
> > [   50.096851] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/
> > Full - flow control rx/tx
> > [   50.110897] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector:
> > repeated role: device
> > [   50.113922] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06
> > 13:39:28 UTC
> > [   50.147552] OOM killer enabled.
> > [   50.148441] Restarting tasks ... done.
> > [   50.152552] VDDIO_SDMMC3_AP: voltage operation not allowed
> > [   50.154761] random: crng reseeded on system resumption
> > [   50.162912] PM: suspend exit
> > [   50.212215] VDDIO_SDMMC3_AP: voltage operation not allowed
> > [   50.271578] VDDIO_SDMMC3_AP: voltage operation not allowed
> > [   50.338597] VDDIO_SDMMC3_AP: voltage operation not allowed
> > [  234.474848] nfs: server 10.26.51.252 not responding, still trying
> > [  234.538769] nfs: server 10.26.51.252 not responding, still trying
> > [  237.546922] nfs: server 10.26.51.252 not responding, still trying
> > [  254.762753] nfs: server 10.26.51.252 not responding, timed out
> > [  254.762771] nfs: server 10.26.51.252 not responding, timed out
> > [  254.766376] nfs: server 10.26.51.252 not responding, timed out
> > [  254.766392] nfs: server 10.26.51.252 not responding, timed out
> > [  254.783778] nfs: server 10.26.51.252 not responding, timed out
> > [  254.789582] nfs: server 10.26.51.252 not responding, timed out
> > [  254.795421] nfs: server 10.26.51.252 not responding, timed out
> > [  254.801193] nfs: server 10.26.51.252 not responding, timed out
> > 
> > > Do either of the two attached diffs make any difference?
> > 
> > I will try these next.
> 
> 
> I tried both of the diffs, but both had the same problem as above and
> I see these nfs timeouts after resuming. What works the best is the
> original change you proposed (this is based upon the latest two
> patches) ...

I'm wondering whether there's something else which needs the RX clock
running in order to take effect.

> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index e2146d3aee74..48a646b76a29 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3109,10 +3109,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
>         if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
>                 priv->plat->dma_cfg->atds = 1;
> -       /* Note that the PHY clock must be running for reset to complete. */
> -       phylink_rx_clk_stop_block(priv->phylink);
>         ret = stmmac_reset(priv, priv->ioaddr);
> -       phylink_rx_clk_stop_unblock(priv->phylink);
>         if (ret) {
>                 netdev_err(priv->dev, "Failed to reset the dma\n");
>                 return ret;
> @@ -7951,6 +7948,8 @@ int stmmac_resume(struct device *dev)
>         rtnl_lock();
>         mutex_lock(&priv->lock);
> +       /* Note that the PHY clock must be running for reset to complete. */
> +       phylink_rx_clk_stop_block(priv->phylink);
>         stmmac_reset_queues_param(priv);
>         stmmac_free_tx_skbufs(priv);
> @@ -7961,6 +7960,7 @@ int stmmac_resume(struct device *dev)
>         stmmac_set_rx_mode(ndev);
>         stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
> +       phylink_rx_clk_stop_unblock(priv->phylink);
>         stmmac_enable_all_queues(priv);
>         stmmac_enable_all_dma_irq(priv);

If you haven't already, can you try shrinking down the number of
functions that are within the block..unblock region please?

Looking at the functions called:

stmmac_reset_queues_param()
stmmac_free_tx_skbufs()
stmmac_clear_descriptors()
	These look like it's only manipulating software state

stmmac_hw_setup()
	We know this calls stmmac_reset() and thus needs the blocking

stmmac_init_coalesce()
	Looks like it's only manipulating software state

stmmac_set_rx_mode()
	This manipulates GMAC_RXQ_CTRL4, GMAC_HASH_TAB(*),
	GMAC_ADDR_HIGH(*), GMAC_ADDR_LOW(*), and GMAC_PACKET_FILTER.

stmmac_restore_hw_vlan_rx_fltr()
	This manipulates GMAC_VLAN_TAG_DATA, GMAC_VLAN_TAG,
	GMAC_VLAN_HASH_TABLE, GMAC_VLAN_TAG

I wonder whether the last two also require the RX clock to be running.

The reason I want to track this down is that we may need to add
block..unblock elsewhere in the driver to ensure that the RX clock
is running when configuration is done elsewhere.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues
  2025-03-11 13:58         ` Russell King (Oracle)
@ 2025-03-11 17:32           ` Jon Hunter
  0 siblings, 0 replies; 6+ messages in thread
From: Jon Hunter @ 2025-03-11 17:32 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Thierry Reding, Lad, Prabhakar, Alexandre Torgue, Andrew Lunn,
	Andrew Lunn, David S. Miller, Eric Dumazet, Heiner Kallweit,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni, linux-tegra@vger.kernel.org


On 11/03/2025 13:58, Russell King (Oracle) wrote:

...

> I'm wondering whether there's something else which needs the RX clock
> running in order to take effect.
> 
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index e2146d3aee74..48a646b76a29 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -3109,10 +3109,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
>>          if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
>>                  priv->plat->dma_cfg->atds = 1;
>> -       /* Note that the PHY clock must be running for reset to complete. */
>> -       phylink_rx_clk_stop_block(priv->phylink);
>>          ret = stmmac_reset(priv, priv->ioaddr);
>> -       phylink_rx_clk_stop_unblock(priv->phylink);
>>          if (ret) {
>>                  netdev_err(priv->dev, "Failed to reset the dma\n");
>>                  return ret;
>> @@ -7951,6 +7948,8 @@ int stmmac_resume(struct device *dev)
>>          rtnl_lock();
>>          mutex_lock(&priv->lock);
>> +       /* Note that the PHY clock must be running for reset to complete. */
>> +       phylink_rx_clk_stop_block(priv->phylink);
>>          stmmac_reset_queues_param(priv);
>>          stmmac_free_tx_skbufs(priv);
>> @@ -7961,6 +7960,7 @@ int stmmac_resume(struct device *dev)
>>          stmmac_set_rx_mode(ndev);
>>          stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
>> +       phylink_rx_clk_stop_unblock(priv->phylink);
>>          stmmac_enable_all_queues(priv);
>>          stmmac_enable_all_dma_irq(priv);
> 
> If you haven't already, can you try shrinking down the number of
> functions that are within the block..unblock region please?

It seems that at a minimum I need to block/unblock around the following
functions ...

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e2146d3aee74..46c343088b1f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3109,10 +3109,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
         if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
                 priv->plat->dma_cfg->atds = 1;
  
-       /* Note that the PHY clock must be running for reset to complete. */
-       phylink_rx_clk_stop_block(priv->phylink);
         ret = stmmac_reset(priv, priv->ioaddr);
-       phylink_rx_clk_stop_unblock(priv->phylink);
         if (ret) {
                 netdev_err(priv->dev, "Failed to reset the dma\n");
                 return ret;
@@ -7953,10 +7950,13 @@ int stmmac_resume(struct device *dev)
  
         stmmac_reset_queues_param(priv);
  
+       /* Note that the PHY clock must be running for reset to complete. */
+       phylink_rx_clk_stop_block(priv->phylink);
         stmmac_free_tx_skbufs(priv);
         stmmac_clear_descriptors(priv, &priv->dma_conf);
  
         stmmac_hw_setup(ndev, false);
+       phylink_rx_clk_stop_unblock(priv->phylink);
         stmmac_init_coalesce(priv);
         stmmac_set_rx_mode(ndev);

> Looking at the functions called:
> 
> stmmac_reset_queues_param()
> stmmac_free_tx_skbufs()
> stmmac_clear_descriptors()
> 	These look like it's only manipulating software state

So it appears that the last two need to be in the block/unblock region
and ...
  
> stmmac_hw_setup()
> 	We know this calls stmmac_reset() and thus needs the blocking

... this one, which is no surprise, but the others are OK. Please
note so far I have only tested on the Tegra186 board which seems
to be the most sensitive.

Cheers
Jon

-- 
nvpublic


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-03-11 17:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Z8m-CRucPxDW5zZK@shell.armlinux.org.uk>
2025-03-07 16:11 ` [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues Jon Hunter
2025-03-07 17:07   ` Russell King (Oracle)
2025-03-10 14:20     ` Jon Hunter
2025-03-11 13:25       ` Jon Hunter
2025-03-11 13:58         ` Russell King (Oracle)
2025-03-11 17:32           ` Jon Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox