From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yanjun Zhu Subject: Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Date: Wed, 18 Apr 2018 13:46:18 +0800 Message-ID: <82f1c53e-f0f6-2ad6-8f70-2f5ac58d560b@oracle.com> References: <1523840527-22746-1-git-send-email-yanjun.zhu@oracle.com> <6dd17e45-e27e-8451-42ab-1a4551d3a651@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit To: Tariq Toukan , netdev@vger.kernel.org, linux-rdma@vger.kernel.org, haakon.bugge@oracle.com Return-path: Received: from userp2130.oracle.com ([156.151.31.86]:57346 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751720AbeDRFq2 (ORCPT ); Wed, 18 Apr 2018 01:46:28 -0400 In-Reply-To: <6dd17e45-e27e-8451-42ab-1a4551d3a651@mellanox.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 2018/4/17 23:37, Tariq Toukan wrote: > > > On 16/04/2018 4:02 AM, Zhu Yanjun wrote: >> While a faulty cable is used or HCA firmware error, HCA device will >> be offline. When the driver is accessing this offline device, the >> following call trace will pop out. >> >> " >> ... >>    [] dump_stack+0x63/0x81 >>    [] panic+0xcc/0x21b >>    [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core] >>    [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core] >>    [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core] >>    [] __mlx4_cmd+0xb0/0x160 [mlx4_core] >>    [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core] >>    [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core] >> ... >> " >> In the above call trace, the function mlx4_cmd_poll calls the function >> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post >> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls >> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out. >> >> This is not reasonable. Since HCA device is offline when it is being >> accessed, it should not be reset again. >> >> In this patch, since HCA is offline, the function mlx4_cmd_post returns >> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly >> returns >> instead of resetting HCA. >> >> CC: Srinivas Eeda >> CC: Junxiao Bi >> Suggested-by: Håkon Bugge >> Signed-off-by: Zhu Yanjun >> --- >>   drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 ++++++++ >>   1 file changed, 8 insertions(+) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c >> b/drivers/net/ethernet/mellanox/mlx4/cmd.c >> index 6a9086d..f1c8c42 100644 >> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c >> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c >> @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, >> u64 in_param, u64 out_param, >>            * Device is going through error recovery >>            * and cannot accept commands. >>            */ >> +        mlx4_err(dev, "%s : Device is in error recovery.\n", __func__); >> +        ret = -EINVAL; >>           goto out; >>       } >>   @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, >> u64 in_param, u64 *out_param, >>       } >>     out_reset: >> +    if (err == -EINVAL) >> +        goto out; >> + > > See below. > >>       if (err) >>           err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); >>   out: >> @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, >> u64 in_param, u64 *out_param, >>           *out_param = context->out_param; >>     out_reset: >> +    if (err == -EINVAL) >> +        goto out; >> + >>       if (err) > > Instead, just do here: if (err && err != -EINVAL) > >>           err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); >>   out: >> > > I am not sure this does not mistakenly cover other cases that already > exist and have (err == -EINVAL). > > For example, this line is hard to predict: > err = mlx4_status_to_errno > and later on, we might get into > if (mlx4_closing_cmd_fatal_error(op, stat)) > which leads to out_reset. Thanks a lot. Sure. I agree with you that "err = mlx4_status_to_errno" and "if (mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL". This will mistakenly go to out instead of resetting HCA device. I will make a new patch to avoid the above error. Zhu Yanjun > > We must have a deeper look at this. > But a better option is, change the error indication to uniquely > indicate "already in error recovery". >