Re: [pull request][net 00/10] mlx5 fixes 2023-02-07

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Rahul Rameshbabu <rrameshbabu@nvidia.com>
To: Saeed Mahameed <saeedm@nvidia.com>
Cc: Vadim Fedorenko <vadfed@meta.com>,
	Saeed Mahameed <saeed@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Tariq Toukan <tariqt@nvidia.com>
Subject: Re: [pull request][net 00/10] mlx5 fixes 2023-02-07
Date: Wed, 08 Feb 2023 13:16:12 -0800	[thread overview]
Message-ID: <871qmzoo2r.fsf@nvidia.com> (raw)
In-Reply-To: <DM5PR12MB134054EC92BC13E36B6C5711B3D89@DM5PR12MB1340.namprd12.prod.outlook.com> (Saeed Mahameed's message of "Wed, 8 Feb 2023 12:52:55 -0800")

On Wed, 08 Feb, 2023 12:52:55 -0800 Saeed Mahameed <saeedm@nvidia.com> wrote:
> Hi Vadim,
>
> We have some new findings internally and Rahul is testing your patches,
> he found some issues where the patches don't handle the case where only drops are happening, meanings no OOO.
>
> Rahul can share more details, he's still working on this and I believe we will have a fully detailed follow-up by the end of the week.

One thing I noticed was the conditional in mlx5e_ptp_ts_cqe_ooo in v5
does handle OOO but considers the monotomically increasing case of 1,3,4
for example to be OOO as well (a resync does not occur when I tested
this case).

A simple patch I made to verify this.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
index ae75e230170b..dfa5c53bd0d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
@@ -125,6 +125,8 @@ static void mlx5e_ptp_handle_ts_cqe(struct mlx5e_ptpsq *ptpsq,
 	struct sk_buff *skb;
 	ktime_t hwtstamp;
 
+	pr_info("wqe_counter value: %u\n", skb_id);
+
 	if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
 		skb = mlx5e_skb_fifo_pop(&ptpsq->skb_fifo);
 		ptpsq->cq_stats->err_cqe++;
@@ -133,6 +135,7 @@ static void mlx5e_ptp_handle_ts_cqe(struct mlx5e_ptpsq *ptpsq,
 
 	if (mlx5e_ptp_ts_cqe_drop(ptpsq, skb_cc, skb_id)) {
 		if (mlx5e_ptp_ts_cqe_ooo(ptpsq, skb_id)) {
+			pr_info("Marked ooo wqe_counter: %u\n", skb_id);
 			/* already handled by a previous resync */
 			ptpsq->cq_stats->ooo_cqe_drop++;
 			return;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f7897ddb29c5..8582f0535e21 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -646,7 +646,7 @@ static void mlx5e_cqe_ts_id_eseg(struct mlx5e_ptpsq *ptpsq, struct sk_buff *skb,
 				 struct mlx5_wqe_eth_seg *eseg)
 {
 	if (ptpsq->ts_cqe_ctr_mask && unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
-		eseg->flow_table_metadata = cpu_to_be32(ptpsq->skb_fifo_pc &
+		eseg->flow_table_metadata = cpu_to_be32((ptpsq->skb_fifo_pc * 2) &
 							ptpsq->ts_cqe_ctr_mask);
 }
 
Basically, I multiply the wqe_counter written in the WQE by two. The
thing here is we have a situation where we have "lost" a CQE with
wqe_counter index of one, but the patch treats that as OOO, which
basically disables our normal resiliency path for resyncs on drops. At
that point, the patch could just remove the resync logic altogether when
a drop is detected.

What I noticed then was that the case of 0,2 was marked as OOO even
though out of order would be something like 0,2,1.

  [Feb 8 02:40] wqe_counter value: 0
  [ +24.199404] wqe_counter value: 2
  [  +0.001041] Marked ooo wqe_counter: 2

I acknowledge the OOO issue but not sure the patch as is, correctly
solves the issue.

>
> Sorry for the late update but these new findings are only from yesterday.
>
> Thanks,
> Saeed.
>
>  
> -------------------------------------------------------------------------------------------------------------------------
> From: Vadim Fedorenko <vadfed@meta.com>
> Sent: Wednesday, February 8, 2023 4:40 AM
> To: Saeed Mahameed <saeed@kernel.org>; Jakub Kicinski <kuba@kernel.org>
> Cc: Saeed Mahameed <saeedm@nvidia.com>; netdev@vger.kernel.org <netdev@vger.kernel.org>; Tariq Toukan <tariqt@nvidia.com>
> Subject: Re: [pull request][net 00/10] mlx5 fixes 2023-02-07 
>  
> On 08/02/2023 03:02, Saeed Mahameed wrote:
>> From: Saeed Mahameed <saeedm@nvidia.com>
>> 
>> This series provides bug fixes to mlx5 driver.
>> Please pull and let me know if there is any problem.
>> 
> Still no patches for PTP queue? That's a bit wierd.
> Do you think that they are not ready to be in -net?

-- Rahul Rameshbabu

next prev parent reply	other threads:[~2023-02-08 21:16 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-08  3:02 [pull request][net 00/10] mlx5 fixes 2023-02-07 Saeed Mahameed
2023-02-08  3:02 ` [net 01/10] net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change Saeed Mahameed
2023-02-09  5:00   ` patchwork-bot+netdevbpf
2023-02-08  3:02 ` [net 02/10] net/mlx5: DR, Fix potential race in dr_rule_create_rule_nic Saeed Mahameed
2023-02-08  3:02 ` [net 03/10] net/mlx5: Bridge, fix ageing of peer FDB entries Saeed Mahameed
2023-02-08  3:02 ` [net 04/10] net/mlx5e: Fix crash unsetting rx-vlan-filter in switchdev mode Saeed Mahameed
2023-02-08  3:02 ` [net 05/10] net/mlx5e: IPoIB, Show unknown speed instead of error Saeed Mahameed
2023-02-08  3:02 ` [net 06/10] net/mlx5: Store page counters in a single array Saeed Mahameed
2023-02-08  3:02 ` [net 07/10] net/mlx5: Expose SF firmware pages counter Saeed Mahameed
2023-02-08  3:03 ` [net 08/10] net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers Saeed Mahameed
2023-02-08  3:03 ` [net 09/10] net/mlx5: fw_tracer, Zero consumer index when reloading the tracer Saeed Mahameed
2023-02-08  3:03 ` [net 10/10] net/mlx5: Serialize module cleanup with reload and remove Saeed Mahameed
2023-02-08 12:40 ` [pull request][net 00/10] mlx5 fixes 2023-02-07 Vadim Fedorenko
     [not found]   ` <DM5PR12MB134054EC92BC13E36B6C5711B3D89@DM5PR12MB1340.namprd12.prod.outlook.com>
2023-02-08 21:16     ` Rahul Rameshbabu [this message]
2023-02-08 21:36       ` Vadim Fedorenko
2023-02-08 23:52         ` Rahul Rameshbabu
2023-02-09  1:10           ` Vadim Fedorenko
2023-02-15  3:19             ` Rahul Rameshbabu

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ae75e230170 dfblob:dfa5c53bd0d dfblob:f7897ddb29c
dfblob:8582f0535e2 )
 OR (
bs:"Re: [pull request][net 00/10] mlx5 fixes 2023-02-07" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871qmzoo2r.fsf@nvidia.com \
    --to=rrameshbabu@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    --cc=vadfed@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).