From: Leon Romanovsky <leon@kernel.org>
To: Saeed Mahameed <saeed@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Saeed Mahameed <saeedm@nvidia.com>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@nvidia.com>,
Shay Drory <shayd@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>
Subject: Re: [net 04/12] net/mlx5: Avoid recovery in probe flows
Date: Thu, 29 Dec 2022 08:33:45 +0200 [thread overview]
Message-ID: <Y600yfAjhObdtaJb@unreal> (raw)
In-Reply-To: <20221228194331.70419-5-saeed@kernel.org>
On Wed, Dec 28, 2022 at 11:43:23AM -0800, Saeed Mahameed wrote:
> From: Shay Drory <shayd@nvidia.com>
>
> Currently, recovery is done without considering whether the device is
> still in probe flow.
> This may lead to recovery before device have finished probed
> successfully. e.g.: while mlx5_init_one() is running. Recovery flow is
> using functionality that is loaded only by mlx5_init_one(), and there
> is no point in running recovery without mlx5_init_one() finished
> successfully.
>
> Fix it by waiting for probe flow to finish and checking whether the
> device is probed before trying to perform recovery.
>
> Fixes: 51d138c2610a ("net/mlx5: Fix health error state handling")
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/health.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
> index 86ed87d704f7..96417c5feed7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
> @@ -674,6 +674,12 @@ static void mlx5_fw_fatal_reporter_err_work(struct work_struct *work)
> dev = container_of(priv, struct mlx5_core_dev, priv);
> devlink = priv_to_devlink(dev);
>
> + mutex_lock(&dev->intf_state_mutex);
> + if (test_bit(MLX5_DROP_NEW_HEALTH_WORK, &health->flags)) {
> + mlx5_core_err(dev, "health works are not permitted at this stage\n");
> + return;
> + }
This bit is already checked when health recovery is queued in mlx5_trigger_health_work().
764 void mlx5_trigger_health_work(struct mlx5_core_dev *dev)
765 {
766 struct mlx5_core_health *health = &dev->priv.health;
767 unsigned long flags;
768
769 spin_lock_irqsave(&health->wq_lock, flags);
770 if (!test_bit(MLX5_DROP_NEW_HEALTH_WORK, &health->flags))
771 queue_work(health->wq, &health->fatal_report_work);
772 else
773 mlx5_core_err(dev, "new health works are not permitted at this stage\n");
774 spin_unlock_irqrestore(&health->wq_lock, flags);
775 }
You probably need to elevate this check to poll_health() routine and
change intf_state_mutex to be spinlock.
Or another solution is to start health polling only when init complete.
Thanks
> + mutex_unlock(&dev->intf_state_mutex);
> enter_error_state(dev, false);
> if (IS_ERR_OR_NULL(health->fw_fatal_reporter)) {
> devl_lock(devlink);
> --
> 2.38.1
>
next prev parent reply other threads:[~2022-12-29 6:33 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-28 19:43 [pull request][net 00/12] mlx5 fixes 2022-12-28 Saeed Mahameed
2022-12-28 19:43 ` [net 01/12] net/mlx5: E-Switch, properly handle ingress tagged packets on VST Saeed Mahameed
2022-12-30 7:40 ` patchwork-bot+netdevbpf
2022-12-28 19:43 ` [net 02/12] net/mlx5: Add forgotten cleanup calls into mlx5_init_once() error path Saeed Mahameed
2022-12-28 19:43 ` [net 03/12] net/mlx5: Fix io_eq_size and event_eq_size params validation Saeed Mahameed
2022-12-28 19:43 ` [net 04/12] net/mlx5: Avoid recovery in probe flows Saeed Mahameed
2022-12-29 6:33 ` Leon Romanovsky [this message]
2022-12-29 18:29 ` Saeed Mahameed
2023-01-01 6:52 ` Leon Romanovsky
2022-12-28 19:43 ` [net 05/12] net/mlx5: Fix RoCE setting at HCA level Saeed Mahameed
2022-12-28 19:43 ` [net 06/12] net/mlx5e: IPoIB, Don't allow CQE compression to be turned on by default Saeed Mahameed
2022-12-28 19:43 ` [net 07/12] net/mlx5e: Fix RX reporter for XSK RQs Saeed Mahameed
2022-12-28 19:43 ` [net 08/12] net/mlx5e: CT: Fix ct debugfs folder name Saeed Mahameed
2022-12-28 19:43 ` [net 09/12] net/mlx5e: Always clear dest encap in neigh-update-del Saeed Mahameed
2022-12-28 19:43 ` [net 10/12] net/mlx5e: Fix hw mtu initializing at XDP SQ allocation Saeed Mahameed
2022-12-28 19:43 ` [net 11/12] net/mlx5e: Set geneve_tlv_option_0_exist when matching on geneve option Saeed Mahameed
2022-12-28 19:43 ` [net 12/12] net/mlx5: Lag, fix failure to cancel delayed bond work Saeed Mahameed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y600yfAjhObdtaJb@unreal \
--to=leon@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeed@kernel.org \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.