All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vadim Fedorenko <vadfed@meta.com>
To: Gal Pressman <gal@nvidia.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Aya Levin <ayal@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: [PATCH net 1/2] mlx5: fix possible ptp queue fifo overflow
Date: Wed, 25 Jan 2023 14:42:08 +0000	[thread overview]
Message-ID: <45d08ca1-e156-c482-777d-df2bc48dffed@meta.com> (raw)
In-Reply-To: <46b57864-5a1a-7707-442c-b53e14d3a6b8@nvidia.com>

On 24/01/2023 14:39, Gal Pressman wrote:
> On 23/01/2023 19:24, Vadim Fedorenko wrote:
>>   > Hi Vadim,
>>   >
>>
>>   >> +		ptpsq->cq_stats->ooo_cqe++;
>>   >> +		return false;
>>   >> +	}
>>   >
>>   >I honestly don't understand how this could happen, can you please
>>   >provide more information about your issue? Did you actually witness ooo
>>   >completions or is it a theoretical issue?
>>   >We know ptp CQEs can be dropped in some rare cases (that's the reason we
>>   >implemented this resync flow), but completions should always arrive
>>   >in-order.
>>
>> I was also surprised to see OOO completions but it's the reality. With a
>> little bit of debug I found this issue:
> 
> Where are these prints added? I assume inside the 'if
> (mlx5e_ptp_ts_cqe_drop())' statement?
> 
>>
>> [65578.231710] FIFO drop found, skb_cc = 141, skb_id = 140
> 
> Is this the first drop? In order for skb_cc to reach 141 it means it has
> already seen skb_id 140 (and consumed it). But here we see skb_id 140
> again? Is it a duplicate completion? Or is it a full wraparound?
> I'm now realising that the naming of the variables is very confusing,
> skb_cc isn't really the consumer counter, it is the cosumer index
> (masked value).
> 
>> [65578.293358] FIFO drop found, skb_cc = 141, skb_id = 143
> 
> How come we see the same skb_cc twice? When a drop is found we increment it.
> 
>> [65578.301240] FIFO drop found, skb_cc = 145, skb_id = 142
>> [65578.365277] FIFO drop found, skb_cc = 173, skb_id = 141
>> [65578.426681] FIFO drop found, skb_cc = 173, skb_id = 145
>> [65578.488089] FIFO drop found, skb_cc = 173, skb_id = 146
>> [65578.549489] FIFO drop found, skb_cc = 173, skb_id = 147
>> [65578.610897] FIFO drop found, skb_cc = 173, skb_id = 148
>> [65578.672301] FIFO drop found, skb_cc = 173, skb_id = 149
> 
> Confusing :S, did you manage to make sense out of these prints? We need
> prints when !dropped as well, otherwise it's impossible to tell when a
> wraparound occurred.
> 
> Anyway, I'd like to zoom out for a second, the whole fifo was designed
> under the assumption that completions are in-order (this is a guarantee
> for all SQs, not just ptp ones!), this fix seems more of a bandage that
> potentially hides a more severe issue.
> 
>>
>> It really shows that CQE are coming OOO sometimes.
> 
> Can we reproduce it somehow?
> Can you please try to update your firmware version? I'm quite confident
> that this issue is fixed already.
> 
I added some debug prints on top of the patches to show skb_cc and 
skb_id for every packet that is found by mlx5e_ptp_ts_cqe_drop() and 10 
packets after. The output is in https://dpaste.org/rMybA/raw

It clearly shows that some reordering is happening in CQE.
I'm open to add more debug info if you need it.

  parent reply	other threads:[~2023-01-25 14:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-22 16:16 [PATCH net 0/2] mlx5: bugfixes for ptp fifo queue Vadim Fedorenko
2023-01-22 16:16 ` [PATCH net 1/2] mlx5: fix possible ptp queue fifo overflow Vadim Fedorenko
2023-01-23  7:20   ` Leon Romanovsky
2023-01-23 14:19     ` Vadim Fedorenko
2023-01-23 12:32   ` Gal Pressman
2023-01-23 23:49     ` Vadim Fedorenko
2023-01-23 17:24   ` Vadim Fedorenko
2023-01-24 14:39     ` Gal Pressman
2023-01-24 15:09       ` Vadim Fedorenko
2023-01-25 14:42       ` Vadim Fedorenko [this message]
2023-01-25 20:33         ` Saeed Mahameed
2023-01-25 21:24           ` Vadim Fedorenko
2023-01-22 16:16 ` [PATCH net 2/2] mlx5: fix skb leak while fifo resync Vadim Fedorenko
2023-01-23 12:38   ` Gal Pressman
2023-01-23 16:52     ` Vadim Fedorenko
2023-01-24  2:03     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45d08ca1-e156-c482-777d-df2bc48dffed@meta.com \
    --to=vadfed@meta.com \
    --cc=ayal@nvidia.com \
    --cc=gal@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.