linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr
       [not found] <1441384543-32447-1-git-send-email-xerofoify@gmail.com>
@ 2015-09-04 16:45 ` Doug Ledford
  2015-09-04 17:01   ` Jason Gunthorpe
  2015-09-14 11:48   ` Selvin Xavier
  0 siblings, 2 replies; 3+ messages in thread
From: Doug Ledford @ 2015-09-04 16:45 UTC (permalink / raw)
  To: Nicholas Krause, selvin.xavier
  Cc: devesh.sharma, mitesh.ahuja, sean.hefty, hal.rosenstock,
	linux-rdma, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3033 bytes --]

On 09/04/2015 12:35 PM, Nicholas Krause wrote:
> This fixes error checking in the function ocrdma_dereg_mr to properly
> check if the call to ocrdma_mbx_alloc fails as if so we must exit this
> function immediately and return the error code to the caller of the
> function ocrdma_dereg_mr to signal a non recoverable error has occurred
> when calling this function that needs to be handled by the caller in its
> own intended error paths.

This is probably not the right solution here.

Emulex will need to speak to this as it involves firmware interactions.
 But, in a nutshell, there are only a few very specific ways in which a
mailbox command to a device should ever be expected to fail:

1) The device firmware is hung.  In which case we have bigger problems
than the application can deal with and the driver needs to take its own
actions, not leave it up to the calling kernel code or application.

2) The mr is still in active use.  I'm not even sure that this will
cause a failure.  It depends on the firmware.  If the firmware knows it
is still in use (by a queued work request) and fails to dereg the mr,
then I could see returning that error code.  However, if the firmware
simply does the dereg whether the mr is in use or not (presumably
causing any work request still using this mr to fail once it finally
attempts to use the mr), then there wouldn't be an error to return here.

3) Unspecified PCI error resulting in an inability to communicate with
the card at all.  This falls under the same category as #1 and is beyond
the scope of a calling application.

So, depending on Emulex's answer in regards to #2 above, I doubt this
patch is the right thing to do.  For #1 and #3, the card will certainly
have to be reset before it can be used again, and will loose all of its
registrations anyway, so we would need to *not* return from an error
condition and we would need to proceed to clean up all of the dangling
resources and then return a 0 result as the application should not be
involved in the recovery that must take place.

> Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
> ---
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index bc84cd4..ea7dfc5b 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -969,8 +969,11 @@ int ocrdma_dereg_mr(struct ib_mr *ib_mr)
>  {
>  	struct ocrdma_mr *mr = get_ocrdma_mr(ib_mr);
>  	struct ocrdma_dev *dev = get_ocrdma_dev(ib_mr->device);
> +	int ret = 0;
>  
> -	(void) ocrdma_mbx_dealloc_lkey(dev, mr->hwmr.fr_mr, mr->hwmr.lkey);
> +	ret =  ocrdma_mbx_dealloc_lkey(dev, mr->hwmr.fr_mr, mr->hwmr.lkey);
> +	if (ret)
> +		return ret;
>  
>  	ocrdma_free_mr_pbl_tbl(dev, &mr->hwmr);
>  
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr
  2015-09-04 16:45 ` [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr Doug Ledford
@ 2015-09-04 17:01   ` Jason Gunthorpe
  2015-09-14 11:48   ` Selvin Xavier
  1 sibling, 0 replies; 3+ messages in thread
From: Jason Gunthorpe @ 2015-09-04 17:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Nicholas Krause, selvin.xavier, devesh.sharma, mitesh.ahuja,
	sean.hefty, hal.rosenstock, linux-rdma, linux-kernel

On Fri, Sep 04, 2015 at 12:45:38PM -0400, Doug Ledford wrote:
> On 09/04/2015 12:35 PM, Nicholas Krause wrote:
> > This fixes error checking in the function ocrdma_dereg_mr to properly
> > check if the call to ocrdma_mbx_alloc fails as if so we must exit this
> > function immediately and return the error code to the caller of the
> > function ocrdma_dereg_mr to signal a non recoverable error has occurred
> > when calling this function that needs to be handled by the caller in its
> > own intended error paths.
> 
> This is probably not the right solution here.

Indeed, it is my long term goal to remove error reporting from
'dealloc' style driver functions.

Only Emulex's driver reports errors here, and *nothing* pays attention
to them, or has any recovery strategy.

The best iterm option is to WARN_ON here:

> > -	(void) ocrdma_mbx_dealloc_lkey(dev, mr->hwmr.fr_mr, mr->hwmr.lkey);
> > +	ret =  ocrdma_mbx_dealloc_lkey(dev, mr->hwmr.fr_mr, mr->hwmr.lkey);

And Emulex should work to fix their driver so dealloc failures are
either impossible or cause a full device reset of some kind.

Jason

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr
  2015-09-04 16:45 ` [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr Doug Ledford
  2015-09-04 17:01   ` Jason Gunthorpe
@ 2015-09-14 11:48   ` Selvin Xavier
  1 sibling, 0 replies; 3+ messages in thread
From: Selvin Xavier @ 2015-09-14 11:48 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Nicholas Krause, Devesh Sharma, Mitesh Ahuja, sean.hefty,
	hal.rosenstock, linux-rdma, linux-kernel

Hi Doug/Jason/Nicholas,



Apologies for the delay in response.

I consulted with our f/w and h/w teams regarding this issue.  The current
ocrdma code seems right in handling the lkey dealloc failure from FW.
ocrdma_mbx_dealloc_lkey can fail only if f the FW is hung or PCI errors
( points 1 and 3 in Doug's mail).

Doug,
Please see inline  for replies to your specific queries


On Fri, Sep 4, 2015 at 10:15 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 09/04/2015 12:35 PM, Nicholas Krause wrote:
>> This fixes error checking in the function ocrdma_dereg_mr to properly
>> check if the call to ocrdma_mbx_alloc fails as if so we must exit this
>> function immediately and return the error code to the caller of the
>> function ocrdma_dereg_mr to signal a non recoverable error has occurred
>> when calling this function that needs to be handled by the caller in its
>> own intended error paths.
>
> This is probably not the right solution here.
>
> Emulex will need to speak to this as it involves firmware interactions.
>  But, in a nutshell, there are only a few very specific ways in which a
> mailbox command to a device should ever be expected to fail:
>
> 1) The device firmware is hung.  In which case we have bigger problems
> than the application can deal with and the driver needs to take its own
> actions, not leave it up to the calling kernel code or application.
>
> 2) The mr is still in active use.  I'm not even sure that this will
> cause a failure.  It depends on the firmware.  If the firmware knows it
> is still in use (by a queued work request) and fails to dereg the mr,
> then I could see returning that error code.  However, if the firmware
> simply does the dereg whether the mr is in use or not (presumably
> causing any work request still using this mr to fail once it finally
> attempts to use the mr), then there wouldn't be an error to return here.
>

FW will not fail  dealloc_lkey if the MRs are already in use.
Future work-requests will fail when it attempts to use this MR.
Hence driver is ignoring the return value ocrdma_mbx_dealloc_lkey.
So it is safe for the stack to free up the resources.
So i feel changes suggested by Nicholas is not necessary.

> 3) Unspecified PCI error resulting in an inability to communicate with
> the card at all.  This falls under the same category as #1 and is beyond
> the scope of a calling application.
>
> So, depending on Emulex's answer in regards to #2 above, I doubt this
> patch is the right thing to do.  For #1 and #3, the card will certainly
> have to be reset before it can be used again, and will loose all of its
> registrations anyway, so we would need to *not* return from an error
> condition and we would need to proceed to clean up all of the dangling
> resources and then return a 0 result as the application should not be
> involved in the recovery that must take place.
>

PCI error and FW hang needs to be handled as mentioned above by Doug.
As of now reboot is the only solution, for this very rare scenario. Is there any
support from the stack framework to cleanup kernel and application resources,
as this error path will be common for all other vendor drivers?

Thanks,
Selvin Xavier

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-09-14 11:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1441384543-32447-1-git-send-email-xerofoify@gmail.com>
2015-09-04 16:45 ` [PATCH] infiniband:Fix error checking in the function ocrdma_dereg_mr Doug Ledford
2015-09-04 17:01   ` Jason Gunthorpe
2015-09-14 11:48   ` Selvin Xavier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).