netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] mlxbf_gige: Fix several bugs
@ 2023-09-22 17:36 Asmaa Mnebhi
  2023-09-22 17:36 ` [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown Asmaa Mnebhi
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-09-22 17:36 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, olteanv; +Cc: Asmaa Mnebhi, netdev, davthompson

Fix several bugs in the mlxbf_gige driver:
1) panic at shutdown
2) no ip assigned although link is up
3) clogged port due to RX pause frames

Asmaa Mnebhi (3):
  mlxbf_gige: Fix kernel panic at shutdown
  mlxbf_gige: Fix intermittent no ip issue
  mlxbf_gige: Enable the GigE port in mlxbf_gige_open

 .../mellanox/mlxbf_gige/mlxbf_gige_main.c     | 47 +++++++++----------
 .../mellanox/mlxbf_gige/mlxbf_gige_rx.c       |  9 ++--
 2 files changed, 28 insertions(+), 28 deletions(-)

-- 
2.30.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown
  2023-09-22 17:36 [PATCH v3 0/3] mlxbf_gige: Fix several bugs Asmaa Mnebhi
@ 2023-09-22 17:36 ` Asmaa Mnebhi
  2023-09-22 17:44   ` Florian Fainelli
  2023-09-22 17:36 ` [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue Asmaa Mnebhi
  2023-09-22 17:36 ` [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open Asmaa Mnebhi
  2 siblings, 1 reply; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-09-22 17:36 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, olteanv; +Cc: Asmaa Mnebhi, netdev, davthompson

There is a race condition happening during shutdown due to pending napi transactions.
Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a
result causes a kernel panic:

[  284.074822] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000070
...
[  284.322326] Call trace:
[  284.324757]  mlxbf_gige_handle_tx_complete+0xc8/0x170 [mlxbf_gige]
[  284.330924]  mlxbf_gige_poll+0x54/0x160 [mlxbf_gige]
[  284.335876]  __napi_poll+0x40/0x1c8
[  284.339353]  net_rx_action+0x314/0x3a0
[  284.343086]  __do_softirq+0x128/0x334
[  284.346734]  run_ksoftirqd+0x54/0x6c
[  284.350294]  smpboot_thread_fn+0x14c/0x190
[  284.354375]  kthread+0x10c/0x110
[  284.357588]  ret_from_fork+0x10/0x20
[  284.361150] Code: 8b070000 f9000ea0 f95056c0 f86178a1 (b9407002)
[  284.367227] ---[ end trace a18340bbb9ea2fa7 ]---

To fix this, invoke mlxbf_gige_remove to disable and dequeue napi during shutdown,
and also return in the case where "priv" is NULL in the poll function.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: David Thompson <davthompson@nvidia.com>
---
v2->v3:
- Add the logic to clean the port to the remove() function
v1-v2:
- make mlxbf_gige_shutdown() the same as the mlxbf_gige_remove()

 .../mellanox/mlxbf_gige/mlxbf_gige_main.c     | 21 ++++++++-----------
 .../mellanox/mlxbf_gige/mlxbf_gige_rx.c       |  3 +++
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
index 694de9513b9f..74185b02daa0 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
@@ -471,24 +471,21 @@ static int mlxbf_gige_probe(struct platform_device *pdev)
 	return err;
 }
 
-static int mlxbf_gige_remove(struct platform_device *pdev)
+static void mlxbf_gige_remove(struct platform_device *pdev)
 {
 	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
 
+	if (!priv)
+		return;
+
+	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
+        mlxbf_gige_clean_port(priv);
 	unregister_netdev(priv->netdev);
 	phy_disconnect(priv->netdev->phydev);
 	mlxbf_gige_mdio_remove(priv);
-	platform_set_drvdata(pdev, NULL);
-
-	return 0;
-}
-
-static void mlxbf_gige_shutdown(struct platform_device *pdev)
-{
-	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
-
 	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
 	mlxbf_gige_clean_port(priv);
+	platform_set_drvdata(pdev, NULL);
 }
 
 static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {
@@ -499,8 +496,8 @@ MODULE_DEVICE_TABLE(acpi, mlxbf_gige_acpi_match);
 
 static struct platform_driver mlxbf_gige_driver = {
 	.probe = mlxbf_gige_probe,
-	.remove = mlxbf_gige_remove,
-	.shutdown = mlxbf_gige_shutdown,
+	.remove_new = mlxbf_gige_remove,
+	.shutdown = mlxbf_gige_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
 		.acpi_match_table = ACPI_PTR(mlxbf_gige_acpi_match),
diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
index 0d5a41a2ae01..cfb8fb957f0c 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
@@ -298,6 +298,9 @@ int mlxbf_gige_poll(struct napi_struct *napi, int budget)
 
 	priv = container_of(napi, struct mlxbf_gige, napi);
 
+	if (!priv)
+		return 0;
+
 	mlxbf_gige_handle_tx_complete(priv);
 
 	do {
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue
  2023-09-22 17:36 [PATCH v3 0/3] mlxbf_gige: Fix several bugs Asmaa Mnebhi
  2023-09-22 17:36 ` [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown Asmaa Mnebhi
@ 2023-09-22 17:36 ` Asmaa Mnebhi
  2023-09-22 17:47   ` Florian Fainelli
  2023-09-22 17:36 ` [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open Asmaa Mnebhi
  2 siblings, 1 reply; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-09-22 17:36 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, olteanv; +Cc: Asmaa Mnebhi, netdev, davthompson

Although the link is up, there is no ip assigned on a setup with high background
traffic. Nothing is transmitted nor received.
The RX error count keeps on increasing. After several minutes, the RX error count
stagnates and the GigE interface finally gets an ip.

The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA is enabled,
the RX CI reaches the max of 128, and it becomes equal to RX PI. RX CI doesn't decrease
since the code hasn't ran phy_start yet.

The solution is to move the rx init after phy_start.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: David Thompson <davthompson@nvidia.com>
---
v2->v3:
- No changes
v1->v2:
- No changes

 .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 14 +++++++-------
 .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c   |  6 +++---
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
index 74185b02daa0..fd4fac1ca26c 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
@@ -147,14 +147,14 @@ static int mlxbf_gige_open(struct net_device *netdev)
 	 */
 	priv->valid_polarity = 0;
 
-	err = mlxbf_gige_rx_init(priv);
+	phy_start(phydev);
+
+	err = mlxbf_gige_tx_init(priv);
 	if (err)
 		goto free_irqs;
-	err = mlxbf_gige_tx_init(priv);
+	err = mlxbf_gige_rx_init(priv);
 	if (err)
-		goto rx_deinit;
-
-	phy_start(phydev);
+		goto tx_deinit;
 
 	netif_napi_add(netdev, &priv->napi, mlxbf_gige_poll);
 	napi_enable(&priv->napi);
@@ -176,8 +176,8 @@ static int mlxbf_gige_open(struct net_device *netdev)
 
 	return 0;
 
-rx_deinit:
-	mlxbf_gige_rx_deinit(priv);
+tx_deinit:
+	mlxbf_gige_tx_deinit(priv);
 
 free_irqs:
 	mlxbf_gige_free_irqs(priv);
diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
index cfb8fb957f0c..d82feeabb061 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
@@ -142,6 +142,9 @@ int mlxbf_gige_rx_init(struct mlxbf_gige *priv)
 	writeq(MLXBF_GIGE_RX_MAC_FILTER_COUNT_PASS_EN,
 	       priv->base + MLXBF_GIGE_RX_MAC_FILTER_COUNT_PASS);
 
+	writeq(ilog2(priv->rx_q_entries),
+	       priv->base + MLXBF_GIGE_RX_WQE_SIZE_LOG2);
+
 	/* Clear MLXBF_GIGE_INT_MASK 'receive pkt' bit to
 	 * indicate readiness to receive interrupts
 	 */
@@ -154,9 +157,6 @@ int mlxbf_gige_rx_init(struct mlxbf_gige *priv)
 	data |= MLXBF_GIGE_RX_DMA_EN;
 	writeq(data, priv->base + MLXBF_GIGE_RX_DMA);
 
-	writeq(ilog2(priv->rx_q_entries),
-	       priv->base + MLXBF_GIGE_RX_WQE_SIZE_LOG2);
-
 	return 0;
 
 free_wqe_and_skb:
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open
  2023-09-22 17:36 [PATCH v3 0/3] mlxbf_gige: Fix several bugs Asmaa Mnebhi
  2023-09-22 17:36 ` [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown Asmaa Mnebhi
  2023-09-22 17:36 ` [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue Asmaa Mnebhi
@ 2023-09-22 17:36 ` Asmaa Mnebhi
  2023-09-22 17:48   ` Florian Fainelli
  2 siblings, 1 reply; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-09-22 17:36 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, olteanv; +Cc: Asmaa Mnebhi, netdev, davthompson

At the moment, the GigE port is enabled in the mlxbf_gige_probe
function. If the mlxbf_gige_open is not executed, this could cause
pause frames to increase in the case where there is high backgroud
traffic. This results in clogging the port.
So move enabling the OOB port to mlxbf_gige_open.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: David Thompson <davthompson@nvidia.com>
---
v2->v3:
- No changes
v1->v2:
- Fix typo: "base" to "priv->base"

 .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c   | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
index fd4fac1ca26c..e74946b802a9 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
@@ -130,9 +130,15 @@ static int mlxbf_gige_open(struct net_device *netdev)
 {
 	struct mlxbf_gige *priv = netdev_priv(netdev);
 	struct phy_device *phydev = netdev->phydev;
+	u64 control;
 	u64 int_en;
 	int err;
 
+	/* Perform general init of GigE block */
+	control = readq(priv->base + MLXBF_GIGE_CONTROL);
+	control |= MLXBF_GIGE_CONTROL_PORT_EN;
+	writeq(control, priv->base + MLXBF_GIGE_CONTROL);
+
 	err = mlxbf_gige_request_irqs(priv);
 	if (err)
 		return err;
@@ -365,7 +371,6 @@ static int mlxbf_gige_probe(struct platform_device *pdev)
 	void __iomem *plu_base;
 	void __iomem *base;
 	int addr, phy_irq;
-	u64 control;
 	int err;
 
 	base = devm_platform_ioremap_resource(pdev, MLXBF_GIGE_RES_MAC);
@@ -380,11 +385,6 @@ static int mlxbf_gige_probe(struct platform_device *pdev)
 	if (IS_ERR(plu_base))
 		return PTR_ERR(plu_base);
 
-	/* Perform general init of GigE block */
-	control = readq(base + MLXBF_GIGE_CONTROL);
-	control |= MLXBF_GIGE_CONTROL_PORT_EN;
-	writeq(control, base + MLXBF_GIGE_CONTROL);
-
 	netdev = devm_alloc_etherdev(&pdev->dev, sizeof(*priv));
 	if (!netdev)
 		return -ENOMEM;
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown
  2023-09-22 17:36 ` [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown Asmaa Mnebhi
@ 2023-09-22 17:44   ` Florian Fainelli
  2023-10-13 14:58     ` Asmaa Mnebhi
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Fainelli @ 2023-09-22 17:44 UTC (permalink / raw)
  To: Asmaa Mnebhi, davem, edumazet, kuba, pabeni, olteanv; +Cc: netdev, davthompson



On 9/22/2023 10:36 AM, Asmaa Mnebhi wrote:
> There is a race condition happening during shutdown due to pending napi transactions.
> Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a
> result causes a kernel panic:
> 
> [  284.074822] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000070
> ...
> [  284.322326] Call trace:
> [  284.324757]  mlxbf_gige_handle_tx_complete+0xc8/0x170 [mlxbf_gige]
> [  284.330924]  mlxbf_gige_poll+0x54/0x160 [mlxbf_gige]
> [  284.335876]  __napi_poll+0x40/0x1c8
> [  284.339353]  net_rx_action+0x314/0x3a0
> [  284.343086]  __do_softirq+0x128/0x334
> [  284.346734]  run_ksoftirqd+0x54/0x6c
> [  284.350294]  smpboot_thread_fn+0x14c/0x190
> [  284.354375]  kthread+0x10c/0x110
> [  284.357588]  ret_from_fork+0x10/0x20
> [  284.361150] Code: 8b070000 f9000ea0 f95056c0 f86178a1 (b9407002)
> [  284.367227] ---[ end trace a18340bbb9ea2fa7 ]---
> 
> To fix this, invoke mlxbf_gige_remove to disable and dequeue napi during shutdown,
> and also return in the case where "priv" is NULL in the poll function.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> Reviewed-by: David Thompson <davthompson@nvidia.com>
> ---
> v2->v3:
> - Add the logic to clean the port to the remove() function
> v1-v2:
> - make mlxbf_gige_shutdown() the same as the mlxbf_gige_remove()
> 
>   .../mellanox/mlxbf_gige/mlxbf_gige_main.c     | 21 ++++++++-----------
>   .../mellanox/mlxbf_gige/mlxbf_gige_rx.c       |  3 +++
>   2 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> index 694de9513b9f..74185b02daa0 100644
> --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> @@ -471,24 +471,21 @@ static int mlxbf_gige_probe(struct platform_device *pdev)
>   	return err;
>   }
>   
> -static int mlxbf_gige_remove(struct platform_device *pdev)
> +static void mlxbf_gige_remove(struct platform_device *pdev)
>   {
>   	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
>   
> +	if (!priv)
> +		return;
> +
> +	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
> +        mlxbf_gige_clean_port(priv);
>   	unregister_netdev(priv->netdev);
>   	phy_disconnect(priv->netdev->phydev);
>   	mlxbf_gige_mdio_remove(priv);
> -	platform_set_drvdata(pdev, NULL);
> -
> -	return 0;
> -}
> -
> -static void mlxbf_gige_shutdown(struct platform_device *pdev)
> -{
> -	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
> -
>   	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
>   	mlxbf_gige_clean_port(priv);
> +	platform_set_drvdata(pdev, NULL);
>   }
>   
>   static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {
> @@ -499,8 +496,8 @@ MODULE_DEVICE_TABLE(acpi, mlxbf_gige_acpi_match);
>   
>   static struct platform_driver mlxbf_gige_driver = {
>   	.probe = mlxbf_gige_probe,
> -	.remove = mlxbf_gige_remove,
> -	.shutdown = mlxbf_gige_shutdown,
> +	.remove_new = mlxbf_gige_remove,
> +	.shutdown = mlxbf_gige_remove,
>   	.driver = {
>   		.name = KBUILD_MODNAME,
>   		.acpi_match_table = ACPI_PTR(mlxbf_gige_acpi_match),
> diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
> index 0d5a41a2ae01..cfb8fb957f0c 100644
> --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
> @@ -298,6 +298,9 @@ int mlxbf_gige_poll(struct napi_struct *napi, int budget)
>   
>   	priv = container_of(napi, struct mlxbf_gige, napi);
>   
> +	if (!priv)
> +		return 0;

Do you still need this test even after you unregistered the network 
device in your shutdown routine?
-- 
Florian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue
  2023-09-22 17:36 ` [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue Asmaa Mnebhi
@ 2023-09-22 17:47   ` Florian Fainelli
  2023-10-13 15:00     ` Asmaa Mnebhi
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Fainelli @ 2023-09-22 17:47 UTC (permalink / raw)
  To: Asmaa Mnebhi, davem, edumazet, kuba, pabeni, olteanv; +Cc: netdev, davthompson



On 9/22/2023 10:36 AM, Asmaa Mnebhi wrote:
> Although the link is up, there is no ip assigned on a setup with high background
> traffic. Nothing is transmitted nor received.
> The RX error count keeps on increasing. After several minutes, the RX error count
> stagnates and the GigE interface finally gets an ip.
> 
> The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA is enabled,
> the RX CI reaches the max of 128, and it becomes equal to RX PI. RX CI doesn't decrease
> since the code hasn't ran phy_start yet.
> 
> The solution is to move the rx init after phy_start.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> Reviewed-by: David Thompson <davthompson@nvidia.com>

This seems fine, but your description of the problem still looks like 
there may be a more fundamental ordering issue when you enable your RX 
pipe here.

It seems to me like you should enable it from "inner" as in closest to 
the CPU/DMA subsystem towards "outer" which is the MAC and finally the PHY.

It should be fine to enable your RX DMA as long as you keep the MAC's RX 
disabled, and then you can enable your MAC's RX enable and later start 
the PHY.
-- 
Florian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open
  2023-09-22 17:36 ` [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open Asmaa Mnebhi
@ 2023-09-22 17:48   ` Florian Fainelli
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Fainelli @ 2023-09-22 17:48 UTC (permalink / raw)
  To: Asmaa Mnebhi, davem, edumazet, kuba, pabeni, olteanv; +Cc: netdev, davthompson



On 9/22/2023 10:36 AM, Asmaa Mnebhi wrote:
> At the moment, the GigE port is enabled in the mlxbf_gige_probe
> function. If the mlxbf_gige_open is not executed, this could cause
> pause frames to increase in the case where there is high backgroud
> traffic. This results in clogging the port.
> So move enabling the OOB port to mlxbf_gige_open.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> Reviewed-by: David Thompson <davthompson@nvidia.com>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown
  2023-09-22 17:44   ` Florian Fainelli
@ 2023-10-13 14:58     ` Asmaa Mnebhi
  2024-01-05 16:05       ` Asmaa Mnebhi
  0 siblings, 1 reply; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-10-13 14:58 UTC (permalink / raw)
  To: Florian Fainelli, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, olteanv@gmail.com
  Cc: netdev@vger.kernel.org, David Thompson

> >   	priv = container_of(napi, struct mlxbf_gige, napi);
> >
> > +	if (!priv)
> > +		return 0;
> 
> Do you still need this test even after you unregistered the network device in
> your shutdown routine?

always good to check variables but if want me to remove it I can.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue
  2023-09-22 17:47   ` Florian Fainelli
@ 2023-10-13 15:00     ` Asmaa Mnebhi
  2023-10-16 17:45       ` Asmaa Mnebhi
  0 siblings, 1 reply; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-10-13 15:00 UTC (permalink / raw)
  To: Florian Fainelli, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, olteanv@gmail.com
  Cc: netdev@vger.kernel.org, David Thompson



> > Although the link is up, there is no ip assigned on a setup with high
> > background traffic. Nothing is transmitted nor received.
> > The RX error count keeps on increasing. After several minutes, the RX
> > error count stagnates and the GigE interface finally gets an ip.
> >
> > The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA
> > is enabled, the RX CI reaches the max of 128, and it becomes equal to
> > RX PI. RX CI doesn't decrease since the code hasn't ran phy_start yet.
> >
> > The solution is to move the rx init after phy_start.
> >
> > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> > Reviewed-by: David Thompson <davthompson@nvidia.com>
> 
> This seems fine, but your description of the problem still looks like there may
> be a more fundamental ordering issue when you enable your RX pipe here.
> 
> It seems to me like you should enable it from "inner" as in closest to the
> CPU/DMA subsystem towards "outer" which is the MAC and finally the PHY.
> 
> It should be fine to enable your RX DMA as long as you keep the MAC's RX
> disabled, and then you can enable your MAC's RX enable and later start the
> PHY.

Thanks for your feedback Florian. I will take a look and address your comments shortly. Sorry for the delayed response, I was OOO.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue
  2023-10-13 15:00     ` Asmaa Mnebhi
@ 2023-10-16 17:45       ` Asmaa Mnebhi
  0 siblings, 0 replies; 11+ messages in thread
From: Asmaa Mnebhi @ 2023-10-16 17:45 UTC (permalink / raw)
  To: Florian Fainelli, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, olteanv@gmail.com
  Cc: netdev@vger.kernel.org, David Thompson

> > > Although the link is up, there is no ip assigned on a setup with
> > > high background traffic. Nothing is transmitted nor received.
> > > The RX error count keeps on increasing. After several minutes, the
> > > RX error count stagnates and the GigE interface finally gets an ip.
> > >
> > > The issue is in the mlxbf_gige_rx_init function. As soon as the RX
> > > DMA is enabled, the RX CI reaches the max of 128, and it becomes
> > > equal to RX PI. RX CI doesn't decrease since the code hasn't ran phy_start
> yet.
> > >
> > > The solution is to move the rx init after phy_start.
> > >
> > > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet
> > > driver")
> > > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> > > Reviewed-by: David Thompson <davthompson@nvidia.com>
> >
> > This seems fine, but your description of the problem still looks like
> > there may be a more fundamental ordering issue when you enable your RX
> pipe here.
> >
> > It seems to me like you should enable it from "inner" as in closest to
> > the CPU/DMA subsystem towards "outer" which is the MAC and finally the
> PHY.
> >
> > It should be fine to enable your RX DMA as long as you keep the MAC's
> > RX disabled, and then you can enable your MAC's RX enable and later
> > start the PHY.
> 
> Thanks for your feedback Florian. I will take a look and address your
> comments shortly. Sorry for the delayed response, I was OOO.
Hi Florian,

We would like to maintain the code as is because we need to set the RX DMA after the MAC RX filters and the RX rings are setup (in mlxbf_gige_rx_init()).
The PHY start logic needs to be done  before that, otherwise, there is a chance we would encounter this bug where our MAC RX consumer index (CI) equals our MAC RX production index (PI) and that results in a MAC state that cannot be solved until we cleanup the MAC again. Note that this bug is difficult to reproduce. Our QA had to run the reboot test and have a setup with really high background traffic.

Thanks.
Asmaa

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown
  2023-10-13 14:58     ` Asmaa Mnebhi
@ 2024-01-05 16:05       ` Asmaa Mnebhi
  0 siblings, 0 replies; 11+ messages in thread
From: Asmaa Mnebhi @ 2024-01-05 16:05 UTC (permalink / raw)
  To: Florian Fainelli, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, olteanv@gmail.com
  Cc: netdev@vger.kernel.org, David Thompson

> > >   	priv = container_of(napi, struct mlxbf_gige, napi);
> > >
> > > +	if (!priv)
> > > +		return 0;
> >
> > Do you still need this test even after you unregistered the network
> > device in your shutdown routine?
> 
> always good to check variables but if want me to remove it I can.

I will resubmit this patch with addressed comments after our QA has done more thorough testing. 

Thank you.
Asmaa

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-01-05 16:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-22 17:36 [PATCH v3 0/3] mlxbf_gige: Fix several bugs Asmaa Mnebhi
2023-09-22 17:36 ` [PATCH v3 1/3] mlxbf_gige: Fix kernel panic at shutdown Asmaa Mnebhi
2023-09-22 17:44   ` Florian Fainelli
2023-10-13 14:58     ` Asmaa Mnebhi
2024-01-05 16:05       ` Asmaa Mnebhi
2023-09-22 17:36 ` [PATCH v3 2/3] mlxbf_gige: Fix intermittent no ip issue Asmaa Mnebhi
2023-09-22 17:47   ` Florian Fainelli
2023-10-13 15:00     ` Asmaa Mnebhi
2023-10-16 17:45       ` Asmaa Mnebhi
2023-09-22 17:36 ` [PATCH v3 3/3] mlxbf_gige: Enable the GigE port in mlxbf_gige_open Asmaa Mnebhi
2023-09-22 17:48   ` Florian Fainelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).