* [PATCH 0/1] net/mlx5: query_mcia_reg fail logging at debug severity
@ 2025-10-28 19:40 Matthew W Carlis
2025-10-28 19:40 ` [PATCH 1/1] " Matthew W Carlis
2025-10-28 20:27 ` [PATCH 0/1] " Gal Pressman
0 siblings, 2 replies; 13+ messages in thread
From: Matthew W Carlis @ 2025-10-28 19:40 UTC (permalink / raw)
To: netdev; +Cc: saeedm, tariqt, mbloch, ashishk, msaggi, adailey,
Matthew W Carlis
The commit message pretty much has most of my thoughts on this one. We
see this message generated all the time because users are trying to
determine if there is an SFP module installed in a port or when trying
to get module info from a port that doesn't have an SFP module. This
is not really an error at all & therefore the message is undesirable.
Downgrading the log severity for this message to dbg allows users to
control it via dyndbg & therefore "up to the user to decide" if they
care about it.
Matthew W Carlis (1):
net/mlx5: query_mcia_reg fail logging at debug severity
drivers/net/ethernet/mellanox/mlx5/core/port.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 19:40 [PATCH 0/1] net/mlx5: query_mcia_reg fail logging at debug severity Matthew W Carlis
@ 2025-10-28 19:40 ` Matthew W Carlis
2025-10-28 20:28 ` Gal Pressman
2025-10-28 22:43 ` Jakub Kicinski
2025-10-28 20:27 ` [PATCH 0/1] " Gal Pressman
1 sibling, 2 replies; 13+ messages in thread
From: Matthew W Carlis @ 2025-10-28 19:40 UTC (permalink / raw)
To: netdev; +Cc: saeedm, tariqt, mbloch, ashishk, msaggi, adailey,
Matthew W Carlis
Whenever a user or automation runs ethtool -m <eth> or an equivalent
to mlx5 device & there is not any SFP module in that device the
kernel log is spammed with ""query_mcia_reg failed: status:" which
is really not that informative to the user who already knows that
their command failed. Since the severity is logged at error severity
the log message cannot be disabled via dyndbg etc...
Signed-off-by: Matthew W Carlis <mattc@purestorage.com>
100.0% drivers/net/ethernet/mellanox/mlx5/core/
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/port.c b/drivers/net/ethernet/mellanox/mlx5/core/port.c
index aa9f2b0a77d3..e1c93a96e479 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/port.c
@@ -310,7 +310,7 @@ static int mlx5_query_module_id(struct mlx5_core_dev *dev, int module_num,
status = MLX5_GET(mcia_reg, out, status);
if (status) {
- mlx5_core_err(dev, "query_mcia_reg failed: status: 0x%x\n",
+ mlx5_core_dbg(dev, "query_mcia_reg failed: status: 0x%x\n",
status);
return -EIO;
}
@@ -394,7 +394,7 @@ static int mlx5_query_mcia(struct mlx5_core_dev *dev,
status = MLX5_GET(mcia_reg, out, status);
if (status) {
- mlx5_core_err(dev, "query_mcia_reg failed: status: 0x%x\n",
+ mlx5_core_dbg(dev, "query_mcia_reg failed: status: 0x%x\n",
status);
return -EIO;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 19:40 [PATCH 0/1] net/mlx5: query_mcia_reg fail logging at debug severity Matthew W Carlis
2025-10-28 19:40 ` [PATCH 1/1] " Matthew W Carlis
@ 2025-10-28 20:27 ` Gal Pressman
1 sibling, 0 replies; 13+ messages in thread
From: Gal Pressman @ 2025-10-28 20:27 UTC (permalink / raw)
To: Matthew W Carlis, netdev; +Cc: saeedm, tariqt, mbloch, ashishk, msaggi, adailey
On 28/10/2025 21:40, Matthew W Carlis wrote:
> The commit message pretty much has most of my thoughts on this one. We
> see this message generated all the time because users are trying to
> determine if there is an SFP module installed in a port or when trying
> to get module info from a port that doesn't have an SFP module. This
> is not really an error at all
It is an error, as evident by the fact that you only changed the log
level, not the error return value.
We cannot query the module EEPROM if there's no module to query.
> & therefore the message is undesirable.
Arguable.
This is a privileged command, and error messages are intended for
errors. It's important to have the value of the status field in the logs
in case of an error.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 19:40 ` [PATCH 1/1] " Matthew W Carlis
@ 2025-10-28 20:28 ` Gal Pressman
2025-10-28 21:48 ` Matthew W Carlis
2025-10-28 22:43 ` Jakub Kicinski
1 sibling, 1 reply; 13+ messages in thread
From: Gal Pressman @ 2025-10-28 20:28 UTC (permalink / raw)
To: Matthew W Carlis, netdev; +Cc: saeedm, tariqt, mbloch, ashishk, msaggi, adailey
On 28/10/2025 21:40, Matthew W Carlis wrote:
> Whenever a user or automation runs ethtool -m <eth> or an equivalent
> to mlx5 device & there is not any SFP module in that device the
> kernel log is spammed with ""query_mcia_reg failed: status:" which
> is really not that informative to the user who already knows that
> their command failed.
Assuming the user knows why the command failed is a big assumption, most
users don't. These status errors are indicative of more than one type of
error.
And if he knows, I would expect him to not run the command again?
BTW, we have a bug that returns success in case of an error in module
EEPROM query so the user doesn't even know that the command failed other
than seeing the error message.
We'll push a fix for that in the next few days.
Since the severity is logged at error severity
> the log message cannot be disabled via dyndbg etc...
>
> Signed-off-by: Matthew W Carlis <mattc@purestorage.com>
>
> 100.0% drivers/net/ethernet/mellanox/mlx5/core/
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/port.c b/drivers/net/ethernet/mellanox/mlx5/core/port.c
> index aa9f2b0a77d3..e1c93a96e479 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/port.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/port.c
> @@ -310,7 +310,7 @@ static int mlx5_query_module_id(struct mlx5_core_dev *dev, int module_num,
>
> status = MLX5_GET(mcia_reg, out, status);
> if (status) {
> - mlx5_core_err(dev, "query_mcia_reg failed: status: 0x%x\n",
> + mlx5_core_dbg(dev, "query_mcia_reg failed: status: 0x%x\n",
> status);
> return -EIO;
> }
> @@ -394,7 +394,7 @@ static int mlx5_query_mcia(struct mlx5_core_dev *dev,
>
> status = MLX5_GET(mcia_reg, out, status);
> if (status) {
> - mlx5_core_err(dev, "query_mcia_reg failed: status: 0x%x\n",
> + mlx5_core_dbg(dev, "query_mcia_reg failed: status: 0x%x\n",
> status);
> return -EIO;
> }
Both of these functions are called from mlx5e_get_module_eeprom() which
has its own error print regardless, so this doesn't really prevent the
"spam".
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 20:28 ` Gal Pressman
@ 2025-10-28 21:48 ` Matthew W Carlis
2025-10-29 7:22 ` Gal Pressman
0 siblings, 1 reply; 13+ messages in thread
From: Matthew W Carlis @ 2025-10-28 21:48 UTC (permalink / raw)
To: gal; +Cc: adailey, ashishk, mattc, mbloch, msaggi, netdev, saeedm, tariqt
Tue, 28 Oct 2025 22:27:39, Gal Pressman wrote:
> And if he knows, I would expect him to not run the command again?
Sometimes a user is a script or an inventory automation tool.
> It is an error, as evident by the fact that you only changed the log
> level, not the error return value.
I don't know of any strict convention in terms of when error return codes
should have associated log messages. I wonder if there is something more targeted
that could be done. For example, if there is a "physical presence" mechanism
& a module is not present simply skip the logging.
Cheers!
- Matt
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 19:40 ` [PATCH 1/1] " Matthew W Carlis
2025-10-28 20:28 ` Gal Pressman
@ 2025-10-28 22:43 ` Jakub Kicinski
2025-10-29 7:33 ` Gal Pressman
1 sibling, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2025-10-28 22:43 UTC (permalink / raw)
To: Matthew W Carlis; +Cc: netdev, saeedm, tariqt, mbloch, ashishk, msaggi, adailey
On Tue, 28 Oct 2025 13:40:11 -0600 Matthew W Carlis wrote:
> Whenever a user or automation runs ethtool -m <eth> or an equivalent
> to mlx5 device & there is not any SFP module in that device the
> kernel log is spammed with ""query_mcia_reg failed: status:" which
> is really not that informative to the user who already knows that
> their command failed. Since the severity is logged at error severity
> the log message cannot be disabled via dyndbg etc...
+1 from me FWIW. I wonder if we're hitting this on the same class
of systems but recently this started hitting at Meta as well.
Millions of log entries a day because some ports don't have an SFP
plugged in :|
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 21:48 ` Matthew W Carlis
@ 2025-10-29 7:22 ` Gal Pressman
0 siblings, 0 replies; 13+ messages in thread
From: Gal Pressman @ 2025-10-29 7:22 UTC (permalink / raw)
To: Matthew W Carlis; +Cc: adailey, ashishk, mbloch, msaggi, netdev, saeedm, tariqt
On 28/10/2025 23:48, Matthew W Carlis wrote:
> Tue, 28 Oct 2025 22:27:39, Gal Pressman wrote:
>> And if he knows, I would expect him to not run the command again?
>
> Sometimes a user is a script or an inventory automation tool.
>
>> It is an error, as evident by the fact that you only changed the log
>> level, not the error return value.
>
> I don't know of any strict convention in terms of when error return codes
> should have associated log messages. I wonder if there is something more targeted
> that could be done. For example, if there is a "physical presence" mechanism
> & a module is not present simply skip the logging.
Maybe the problem is that because we returned zero on error the script
would wrongly continue to try the queries?
Does it stop when an error is detected?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-28 22:43 ` Jakub Kicinski
@ 2025-10-29 7:33 ` Gal Pressman
2025-10-29 16:49 ` Matthew W Carlis
0 siblings, 1 reply; 13+ messages in thread
From: Gal Pressman @ 2025-10-29 7:33 UTC (permalink / raw)
To: Jakub Kicinski, Matthew W Carlis
Cc: netdev, saeedm, tariqt, mbloch, ashishk, msaggi, adailey
On 29/10/2025 0:43, Jakub Kicinski wrote:
> On Tue, 28 Oct 2025 13:40:11 -0600 Matthew W Carlis wrote:
>> Whenever a user or automation runs ethtool -m <eth> or an equivalent
>> to mlx5 device & there is not any SFP module in that device the
>> kernel log is spammed with ""query_mcia_reg failed: status:" which
>> is really not that informative to the user who already knows that
>> their command failed. Since the severity is logged at error severity
>> the log message cannot be disabled via dyndbg etc...
>
> +1 from me FWIW. I wonder if we're hitting this on the same class
> of systems but recently this started hitting at Meta as well.
> Millions of log entries a day because some ports don't have an SFP
> plugged in :|
>
Allow me to split the discussion to two questions:
1. Is this an error?
2. Should it be logged?
Do we agree that the answer to #1 is yes?
For #2, I think it should, but we can probably improve the situation
with extack instead of a print.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-29 7:33 ` Gal Pressman
@ 2025-10-29 16:49 ` Matthew W Carlis
2025-10-29 23:33 ` Jakub Kicinski
0 siblings, 1 reply; 13+ messages in thread
From: Matthew W Carlis @ 2025-10-29 16:49 UTC (permalink / raw)
To: gal; +Cc: adailey, ashishk, kuba, mattc, mbloch, msaggi, netdev, saeedm,
tariqt
On Wed, 29 Oct 2025, Gal Pressman wrote:
> Allow me to split the discussion to two questions:
> 1. Is this an error?
> 2. Should it be logged?
>
> Do we agree that the answer to #1 is yes?
>
> For #2, I think it should, but we can probably improve the situation
> with extack instead of a print.
I think its an 'expected error' if the module is not present. I agree.
For 2 I think if the user runs "ethtool -m" on a port with no module,
they received an error message stating something along the lines of
"module not present" and the kernel didn't have any log messages about
it that would be near to 'the best' solution. I specifically suffer pain
from the case of an unused port (no module installed).
I took a peek at the SFP+ and QSFP connector specifications. It looks
like they do have pins reserved for module presence so I'm wondering if
the firmware exposes a mechanism for the mlx driver to check the module
presence. This would clear up essentially every issue with this particular
case because I have curious people seeing the existing message & then
questioning me about whether is the cause of some other problem
(much of the time not even a networking problem) when in the port is
simply 'unused'.
Cheers!
-Matt
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-29 16:49 ` Matthew W Carlis
@ 2025-10-29 23:33 ` Jakub Kicinski
2025-10-30 14:48 ` Gal Pressman
0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2025-10-29 23:33 UTC (permalink / raw)
To: Matthew W Carlis
Cc: gal, adailey, ashishk, mbloch, msaggi, netdev, saeedm, tariqt
On Wed, 29 Oct 2025 10:49:24 -0600 Matthew W Carlis wrote:
> On Wed, 29 Oct 2025, Gal Pressman wrote:
> > Allow me to split the discussion to two questions:
> > 1. Is this an error?
> > 2. Should it be logged?
> >
> > Do we agree that the answer to #1 is yes?
> >
> > For #2, I think it should, but we can probably improve the situation
> > with extack instead of a print.
>
> I think its an 'expected error' if the module is not present. I agree.
>
> For 2 I think if the user runs "ethtool -m" on a port with no module,
> they received an error message stating something along the lines of
> "module not present" and the kernel didn't have any log messages about
> it that would be near to 'the best' solution.
I assume you mean error message specifically from the CLI or whatever
API the user is exercising? If so I agree.
The system logs are for fatal / unexpected conditions. AFAIU returning
-EIO is the _expected_ way to find out that module is not plugged in.
If there's a better API I suppose we can make ethtool call it first
to avoid the error.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-29 23:33 ` Jakub Kicinski
@ 2025-10-30 14:48 ` Gal Pressman
2025-10-30 20:09 ` Matthew W Carlis
0 siblings, 1 reply; 13+ messages in thread
From: Gal Pressman @ 2025-10-30 14:48 UTC (permalink / raw)
To: Jakub Kicinski, Matthew W Carlis
Cc: adailey, ashishk, mbloch, msaggi, netdev, saeedm, tariqt
On 30/10/2025 1:33, Jakub Kicinski wrote:
> On Wed, 29 Oct 2025 10:49:24 -0600 Matthew W Carlis wrote:
>> On Wed, 29 Oct 2025, Gal Pressman wrote:
>>> Allow me to split the discussion to two questions:
>>> 1. Is this an error?
>>> 2. Should it be logged?
>>>
>>> Do we agree that the answer to #1 is yes?
>>>
>>> For #2, I think it should, but we can probably improve the situation
>>> with extack instead of a print.
>>
>> I think its an 'expected error' if the module is not present. I agree.
>>
>> For 2 I think if the user runs "ethtool -m" on a port with no module,
>> they received an error message stating something along the lines of
>> "module not present" and the kernel didn't have any log messages about
>> it that would be near to 'the best' solution.
>
> I assume you mean error message specifically from the CLI or whatever
> API the user is exercising? If so I agree.
>
> The system logs are for fatal / unexpected conditions. AFAIU returning
> -EIO is the _expected_ way to find out that module is not plugged in.
> If there's a better API I suppose we can make ethtool call it first
> to avoid the error.
There are other cases which will return -EIO, but do not necessarily
mean that the module is disconnected: unresponsive module, i2c error,
disabled module, unsupported module, etc.
We cannot differentiate between them without the status print.
Changing the log level makes things more difficult, as most production
servers will not enable the debug print, and the logs would be harder to
analyze.
I asked before, maybe these automatic tools that keep querying the
module continue to do so because of the success return code, and that
will be resolved soon?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-30 14:48 ` Gal Pressman
@ 2025-10-30 20:09 ` Matthew W Carlis
2025-11-02 13:09 ` Gal Pressman
0 siblings, 1 reply; 13+ messages in thread
From: Matthew W Carlis @ 2025-10-30 20:09 UTC (permalink / raw)
To: gal; +Cc: adailey, ashishk, kuba, mattc, mbloch, msaggi, netdev, saeedm,
tariqt
On Thu, 30 Oct 2025 16:48:54, Gal Pressman wrote:
> I asked before, maybe these automatic tools that keep querying the
> module continue to do so because of the success return code, and that
> will be resolved soon?
I went and ran the command on a port with no module and agree that it does
return zero.
It doesn't solve the problem for me... Part of the reason automation will
periodically run the command is for inventory purposes. We want to roughly
know when a module was installed or removed. Then, as far as I can tell the
only way to know if a module is installed is by running this command which
has the consequence of generating log spam when there is not a module installed.
> Changing the log level makes things more difficult, as most production
> servers will not enable the debug print, and the logs would be harder to
> analyze.
I care less about the logging level... I suspect it should be info or dbg, but
we just don't want log lines generated when we're trying to figure out if some
module is installed.
Cheers!
-Matt
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] net/mlx5: query_mcia_reg fail logging at debug severity
2025-10-30 20:09 ` Matthew W Carlis
@ 2025-11-02 13:09 ` Gal Pressman
0 siblings, 0 replies; 13+ messages in thread
From: Gal Pressman @ 2025-11-02 13:09 UTC (permalink / raw)
To: Matthew W Carlis
Cc: adailey, ashishk, kuba, mbloch, msaggi, netdev, saeedm, tariqt
On 30/10/2025 22:09, Matthew W Carlis wrote:
> On Thu, 30 Oct 2025 16:48:54, Gal Pressman wrote:
>> I asked before, maybe these automatic tools that keep querying the
>> module continue to do so because of the success return code, and that
>> will be resolved soon?
>
> I went and ran the command on a port with no module and agree that it does
> return zero.
>
> It doesn't solve the problem for me... Part of the reason automation will
> periodically run the command is for inventory purposes. We want to roughly
> know when a module was installed or removed. Then, as far as I can tell the
> only way to know if a module is installed is by running this command which
> has the consequence of generating log spam when there is not a module installed.
>
>> Changing the log level makes things more difficult, as most production
>> servers will not enable the debug print, and the logs would be harder to
>> analyze.
>
> I care less about the logging level... I suspect it should be info or dbg, but
> we just don't want log lines generated when we're trying to figure out if some
> module is installed.
I have an idea on how to make the situation better, will hopefully have
something soon.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-11-02 13:09 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 19:40 [PATCH 0/1] net/mlx5: query_mcia_reg fail logging at debug severity Matthew W Carlis
2025-10-28 19:40 ` [PATCH 1/1] " Matthew W Carlis
2025-10-28 20:28 ` Gal Pressman
2025-10-28 21:48 ` Matthew W Carlis
2025-10-29 7:22 ` Gal Pressman
2025-10-28 22:43 ` Jakub Kicinski
2025-10-29 7:33 ` Gal Pressman
2025-10-29 16:49 ` Matthew W Carlis
2025-10-29 23:33 ` Jakub Kicinski
2025-10-30 14:48 ` Gal Pressman
2025-10-30 20:09 ` Matthew W Carlis
2025-11-02 13:09 ` Gal Pressman
2025-10-28 20:27 ` [PATCH 0/1] " Gal Pressman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).