netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
@ 2024-06-25 18:49 Aleksandr Loktionov
  2024-06-26  9:28 ` [Intel-wired-lan] " Przemek Kitszel
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Aleksandr Loktionov @ 2024-06-25 18:49 UTC (permalink / raw)
  To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
  Cc: netdev, Kelvin Kang, Arkadiusz Kubalewski

Remove wrong EIO to EGAIN conversion and pass all errors as is.

After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
replace F/W specific error codes with Linux kernel generic, all EIO errors
suddenly started to be converted into EAGAIN which leads nvmupdate to retry
until it timeouts and sometimes fails after more than 20 minutes in the
middle of NVM update, so NVM becomes corrupted.

The bug affects users only at the time when they try to update NVM, and
only F/W versions that generate errors while nvmupdate. For example, X710DA2
with 0x8000ECB7 F/W is affected, but there are probably more...

Command for reproduction is just NVM update:
 ./nvmupdate64

In the log instead of:
 i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
appears:
 i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
 i40e: eeprom check failed (-5), Tx/Rx traffic disabled

The problematic code did silently convert EIO into EAGAIN which forced
nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
That's why NVM update takes 20+ minutes to finish with the fail in the end.

Fixes: 230f3d53a547 ("i40e: remove i40e_status")
Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
v4->v5 commit message update
https://lore.kernel.org/netdev/20240618132111.3193963-1-aleksandr.loktionov@intel.com/T/#u
v3->v4 commit message update
v2->v3 commit messege typos
v1->v2 commit message update
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index ee86d2c..55b5bb8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
 		-EFBIG,      /* I40E_AQ_RC_EFBIG */
 	};
 
-	/* aq_rc is invalid if AQ timed out */
-	if (aq_ret == -EIO)
-		return -EAGAIN;
-
 	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
 		return -ERANGE;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
  2024-06-25 18:49 [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update Aleksandr Loktionov
@ 2024-06-26  9:28 ` Przemek Kitszel
  2024-06-27 17:33 ` Simon Horman
  2024-07-08 12:59 ` Brelinski, Tony
  2 siblings, 0 replies; 6+ messages in thread
From: Przemek Kitszel @ 2024-06-26  9:28 UTC (permalink / raw)
  To: Aleksandr Loktionov, anthony.l.nguyen
  Cc: netdev, Kelvin Kang, Arkadiusz Kubalewski, intel-wired-lan

On 6/25/24 20:49, Aleksandr Loktionov wrote:
> Remove wrong EIO to EGAIN conversion and pass all errors as is.
> 
> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
> replace F/W specific error codes with Linux kernel generic, all EIO errors
> suddenly started to be converted into EAGAIN which leads nvmupdate to retry
> until it timeouts and sometimes fails after more than 20 minutes in the
> middle of NVM update, so NVM becomes corrupted.
> 
> The bug affects users only at the time when they try to update NVM, and
> only F/W versions that generate errors while nvmupdate. For example, X710DA2
> with 0x8000ECB7 F/W is affected, but there are probably more...
> 
> Command for reproduction is just NVM update:
>   ./nvmupdate64
> 
> In the log instead of:
>   i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
> appears:
>   i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
>   i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> 
> The problematic code did silently convert EIO into EAGAIN which forced
> nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
> That's why NVM update takes 20+ minutes to finish with the fail in the end.
> 
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v4->v5 commit message update
> https://lore.kernel.org/netdev/20240618132111.3193963-1-aleksandr.loktionov@intel.com/T/#u
> v3->v4 commit message update
> v2->v3 commit messege typos
> v1->v2 commit message update
> ---
>   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee86d2c..55b5bb8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>   	};
>   
> -	/* aq_rc is invalid if AQ timed out */
> -	if (aq_ret == -EIO)
> -		return -EAGAIN;
> -
>   	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>   		return -ERANGE;
>   

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
  2024-06-25 18:49 [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update Aleksandr Loktionov
  2024-06-26  9:28 ` [Intel-wired-lan] " Przemek Kitszel
@ 2024-06-27 17:33 ` Simon Horman
  2024-07-08 15:38   ` [Intel-wired-lan] " Loktionov, Aleksandr
  2024-07-08 12:59 ` Brelinski, Tony
  2 siblings, 1 reply; 6+ messages in thread
From: Simon Horman @ 2024-06-27 17:33 UTC (permalink / raw)
  To: Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Kelvin Kang,
	Arkadiusz Kubalewski

On Tue, Jun 25, 2024 at 08:49:53PM +0200, Aleksandr Loktionov wrote:
> Remove wrong EIO to EGAIN conversion and pass all errors as is.
> 
> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
> replace F/W specific error codes with Linux kernel generic, all EIO errors
> suddenly started to be converted into EAGAIN which leads nvmupdate to retry
> until it timeouts and sometimes fails after more than 20 minutes in the
> middle of NVM update, so NVM becomes corrupted.
> 
> The bug affects users only at the time when they try to update NVM, and
> only F/W versions that generate errors while nvmupdate. For example, X710DA2
> with 0x8000ECB7 F/W is affected, but there are probably more...
> 
> Command for reproduction is just NVM update:
>  ./nvmupdate64
> 
> In the log instead of:
>  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
> appears:
>  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
>  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> 
> The problematic code did silently convert EIO into EAGAIN which forced
> nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
> That's why NVM update takes 20+ minutes to finish with the fail in the end.
> 
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Hi Aleksandr,

Maybe I'm reading things wrong, I have concerns :(

Amongst other things, the cited commit:
1. Maps a number of different I40E_ERR_* values to -EIO; and
2. Maps checks on different I40E_ERR_* values to -EIO

My concern is that the code may now incorrectly match against -EIO
for cases where it would not have previously matched when more
specific error codes.

In the case at hand:
1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_ERROR
2. i40e_aq_rc_to_posix checks for -EIO in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT

As you point out, we are now in a bad place.
Which your patch addresses.

But what about a different case where:
1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT
2. i40e_aq_rc_to_posix checks for -EIO in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT

In this scenario the, the code without your patch is correct,
and with your patch it seems incorrect.

Perhaps only the scenario you are fixing occurs.
If so, all good. But it's not obvious to me that is the case.

I'm likewise concerned by other conditions on -EIO
introduced by the cited commit.

> ---
> v4->v5 commit message update
> https://lore.kernel.org/netdev/20240618132111.3193963-1-aleksandr.loktionov@intel.com/T/#u
> v3->v4 commit message update
> v2->v3 commit messege typos
> v1->v2 commit message update
> ---
>  drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee86d2c..55b5bb8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>  		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>  	};
>  
> -	/* aq_rc is invalid if AQ timed out */
> -	if (aq_ret == -EIO)
> -		return -EAGAIN;
> -

Perhaps it has already been covered, but with this change the aq_ret
argument of this function is longer used.  It could be removed as a
follow-up for iwl-next.

>  	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>  		return -ERANGE;
>  
> -- 
> 2.25.1
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
  2024-06-25 18:49 [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update Aleksandr Loktionov
  2024-06-26  9:28 ` [Intel-wired-lan] " Przemek Kitszel
  2024-06-27 17:33 ` Simon Horman
@ 2024-07-08 12:59 ` Brelinski, Tony
  2 siblings, 0 replies; 6+ messages in thread
From: Brelinski, Tony @ 2024-07-08 12:59 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org,
	Nguyen, Anthony L, Loktionov, Aleksandr
  Cc: netdev@vger.kernel.org, Kang, Kelvin, Kubalewski, Arkadiusz

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Aleksandr Loktionov
> Sent: Tuesday, June 25, 2024 11:50 AM
> To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>
> Cc: netdev@vger.kernel.org; Kang, Kelvin <kelvin.kang@intel.com>;
> Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless retries
> of NVM update
>
> Remove wrong EIO to EGAIN conversion and pass all errors as is.
>
> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
> replace F/W specific error codes with Linux kernel generic, all EIO errors
> suddenly started to be converted into EAGAIN which leads nvmupdate to
> retry until it timeouts and sometimes fails after more than 20 minutes in the
> middle of NVM update, so NVM becomes corrupted.
>
> The bug affects users only at the time when they try to update NVM, and only
> F/W versions that generate errors while nvmupdate. For example, X710DA2
> with 0x8000ECB7 F/W is affected, but there are probably more...
>
> Command for reproduction is just NVM update:
>  ./nvmupdate64
>
> In the log instead of:
>  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
> I40E_AQ_RC_ENOMEM)
> appears:
>  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
>  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
>
> The problematic code did silently convert EIO into EAGAIN which forced
> nvmupdate to ignore EAGAIN error and retry the same operation until
> timeout.
> That's why NVM update takes 20+ minutes to finish with the fail in the end.
>
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v4->v5 commit message update
> https://lore.kernel.org/netdev/20240618132111.3193963-1-
> aleksandr.loktionov@intel.com/T/#u
> v3->v4 commit message update
> v2->v3 commit messege typos
> v1->v2 commit message update
> ---
>  drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>  1 file changed, 4 deletions(-)

Tested-by: Tony Brelinski <tony.brelinski@intel.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
  2024-06-27 17:33 ` Simon Horman
@ 2024-07-08 15:38   ` Loktionov, Aleksandr
  2024-07-09  7:42     ` Simon Horman
  0 siblings, 1 reply; 6+ messages in thread
From: Loktionov, Aleksandr @ 2024-07-08 15:38 UTC (permalink / raw)
  To: Simon Horman
  Cc: Nguyen, Anthony L, Kang, Kelvin, Kubalewski, Arkadiusz,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Simon Horman
> Sent: Thursday, June 27, 2024 7:34 PM
> To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>
> Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kang, Kelvin
> <kelvin.kang@intel.com>; Kubalewski, Arkadiusz
> <arkadiusz.kubalewski@intel.com>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove
> needless retries of NVM update
> 
> On Tue, Jun 25, 2024 at 08:49:53PM +0200, Aleksandr Loktionov wrote:
> > Remove wrong EIO to EGAIN conversion and pass all errors as is.
> >
> > After commit 230f3d53a547 ("i40e: remove i40e_status"), which should
> > only replace F/W specific error codes with Linux kernel generic, all
> > EIO errors suddenly started to be converted into EAGAIN which leads
> > nvmupdate to retry until it timeouts and sometimes fails after more
> > than 20 minutes in the middle of NVM update, so NVM becomes
> corrupted.
> >
> > The bug affects users only at the time when they try to update NVM,
> > and only F/W versions that generate errors while nvmupdate. For
> > example, X710DA2 with 0x8000ECB7 F/W is affected, but there are
> probably more...
> >
> > Command for reproduction is just NVM update:
> >  ./nvmupdate64
> >
> > In the log instead of:
> >  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
> > I40E_AQ_RC_ENOMEM)
> > appears:
> >  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
> >  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> >
> > The problematic code did silently convert EIO into EAGAIN which
> forced
> > nvmupdate to ignore EAGAIN error and retry the same operation until
> timeout.
> > That's why NVM update takes 20+ minutes to finish with the fail in
> the end.
> >
> > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> > Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> > Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> 
> Hi Aleksandr,
> 
> Maybe I'm reading things wrong, I have concerns :(
> 
> Amongst other things, the cited commit:
> 1. Maps a number of different I40E_ERR_* values to -EIO; and 2. Maps
> checks on different I40E_ERR_* values to -EIO
> 
> My concern is that the code may now incorrectly match against -EIO for
> cases where it would not have previously matched when more specific
> error codes.
> 
> In the case at hand:
> 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_ERROR 2.
> i40e_aq_rc_to_posix checks for -EIO in place of
> I40E_ERR_ADMIN_QUEUE_TIMEOUT
> 
> As you point out, we are now in a bad place.
> Which your patch addresses.
> 
> But what about a different case where:
> 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT 2.
> i40e_aq_rc_to_posix checks for -EIO in place of
> I40E_ERR_ADMIN_QUEUE_TIMEOUT
> 
> In this scenario the, the code without your patch is correct, and with
> your patch it seems incorrect.
> 
> Perhaps only the scenario you are fixing occurs.
> If so, all good. But it's not obvious to me that is the case.
> 
> I'm likewise concerned by other conditions on -EIO introduced by the
> cited commit.

This commit do not introduce -EIO errors.
Before 230f3d53a547 ("i40e: remove i40e_status") some specific F/W error codes were
converted into -EAGAIN by i40e_aq_rc_to_posix(), but now all error codes are already
Linux kernel codes, so there is no way to distinguish special F/W codes and convert
them into -EAGAIN.

Our validation has been tested regressions of current patch and gave signed off.

Do you propose change 
	if (aq_ret == -EIO)
		return -EAGAIN;
into

	if (aq_ret == -EIO)
		return -EIO;
?

It will require additional testing...

> > ---
> > v4->v5 commit message update
> > https://lore.kernel.org/netdev/20240618132111.3193963-1-
> aleksandr.lokt
> > ionov@intel.com/T/#u
> > v3->v4 commit message update
> > v2->v3 commit messege typos
> > v1->v2 commit message update
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > index ee86d2c..55b5bb8 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int
> aq_ret, int aq_rc)
> >  		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> >  	};
> >
> > -	/* aq_rc is invalid if AQ timed out */
> > -	if (aq_ret == -EIO)
> > -		return -EAGAIN;
> > -
> 
> Perhaps it has already been covered, but with this change the aq_ret
> argument of this function is longer used.  It could be removed as a
> follow-up for iwl-next.
> 
> >  	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> sizeof((aq_to_posix)[0]))))
> >  		return -ERANGE;
> >
> > --
> > 2.25.1
> >
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update
  2024-07-08 15:38   ` [Intel-wired-lan] " Loktionov, Aleksandr
@ 2024-07-09  7:42     ` Simon Horman
  0 siblings, 0 replies; 6+ messages in thread
From: Simon Horman @ 2024-07-09  7:42 UTC (permalink / raw)
  To: Loktionov, Aleksandr
  Cc: Nguyen, Anthony L, Kang, Kelvin, Kubalewski, Arkadiusz,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org

On Mon, Jul 08, 2024 at 03:38:11PM +0000, Loktionov, Aleksandr wrote:
> 
> 
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of Simon Horman
> > Sent: Thursday, June 27, 2024 7:34 PM
> > To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>
> > Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kang, Kelvin
> > <kelvin.kang@intel.com>; Kubalewski, Arkadiusz
> > <arkadiusz.kubalewski@intel.com>; intel-wired-lan@lists.osuosl.org;
> > netdev@vger.kernel.org
> > Subject: Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove
> > needless retries of NVM update
> > 
> > On Tue, Jun 25, 2024 at 08:49:53PM +0200, Aleksandr Loktionov wrote:
> > > Remove wrong EIO to EGAIN conversion and pass all errors as is.
> > >
> > > After commit 230f3d53a547 ("i40e: remove i40e_status"), which should
> > > only replace F/W specific error codes with Linux kernel generic, all
> > > EIO errors suddenly started to be converted into EAGAIN which leads
> > > nvmupdate to retry until it timeouts and sometimes fails after more
> > > than 20 minutes in the middle of NVM update, so NVM becomes
> > corrupted.
> > >
> > > The bug affects users only at the time when they try to update NVM,
> > > and only F/W versions that generate errors while nvmupdate. For
> > > example, X710DA2 with 0x8000ECB7 F/W is affected, but there are
> > probably more...
> > >
> > > Command for reproduction is just NVM update:
> > >  ./nvmupdate64
> > >
> > > In the log instead of:
> > >  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
> > > I40E_AQ_RC_ENOMEM)
> > > appears:
> > >  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
> > >  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> > >
> > > The problematic code did silently convert EIO into EAGAIN which
> > forced
> > > nvmupdate to ignore EAGAIN error and retry the same operation until
> > timeout.
> > > That's why NVM update takes 20+ minutes to finish with the fail in
> > the end.
> > >
> > > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> > > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> > > Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> > > Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > 
> > Hi Aleksandr,
> > 
> > Maybe I'm reading things wrong, I have concerns :(
> > 
> > Amongst other things, the cited commit:
> > 1. Maps a number of different I40E_ERR_* values to -EIO; and 2. Maps
> > checks on different I40E_ERR_* values to -EIO
> > 
> > My concern is that the code may now incorrectly match against -EIO for
> > cases where it would not have previously matched when more specific
> > error codes.
> > 
> > In the case at hand:
> > 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_ERROR 2.
> > i40e_aq_rc_to_posix checks for -EIO in place of
> > I40E_ERR_ADMIN_QUEUE_TIMEOUT
> > 
> > As you point out, we are now in a bad place.
> > Which your patch addresses.
> > 
> > But what about a different case where:
> > 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT 2.
> > i40e_aq_rc_to_posix checks for -EIO in place of
> > I40E_ERR_ADMIN_QUEUE_TIMEOUT
> > 
> > In this scenario the, the code without your patch is correct, and with
> > your patch it seems incorrect.
> > 
> > Perhaps only the scenario you are fixing occurs.
> > If so, all good. But it's not obvious to me that is the case.
> > 
> > I'm likewise concerned by other conditions on -EIO introduced by the
> > cited commit.
> 
> This commit do not introduce -EIO errors.
> Before 230f3d53a547 ("i40e: remove i40e_status") some specific F/W error codes were
> converted into -EAGAIN by i40e_aq_rc_to_posix(), but now all error codes are already
> Linux kernel codes, so there is no way to distinguish special F/W codes and convert
> them into -EAGAIN.

Right, this last part is the nub of my concern.

> Our validation has been tested regressions of current patch and gave signed off.
> 
> Do you propose change 
> 	if (aq_ret == -EIO)
> 		return -EAGAIN;
> into
> 
> 	if (aq_ret == -EIO)
> 		return -EIO;
> ?
> 
> It will require additional testing...

If the problem I described is indeed a problem then a suspect a more
invasive change is required, to differentiate between the different
cases previously covered by internal error codes.

However, that is speculation on my part.
While your patch has been tested.

So I suggest, contrary to my previous email, that this patch moves forwards.

IOW, I am not blocking progress of this patch (anymore).

...

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-07-09  7:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25 18:49 [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM update Aleksandr Loktionov
2024-06-26  9:28 ` [Intel-wired-lan] " Przemek Kitszel
2024-06-27 17:33 ` Simon Horman
2024-07-08 15:38   ` [Intel-wired-lan] " Loktionov, Aleksandr
2024-07-09  7:42     ` Simon Horman
2024-07-08 12:59 ` Brelinski, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).