From: Yanjun Zhu <yanjun.zhu@oracle.com>
To: Tariq Toukan <tariqt@mellanox.com>,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
haakon.bugge@oracle.com
Subject: Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
Date: Wed, 18 Apr 2018 13:46:18 +0800 [thread overview]
Message-ID: <82f1c53e-f0f6-2ad6-8f70-2f5ac58d560b@oracle.com> (raw)
In-Reply-To: <6dd17e45-e27e-8451-42ab-1a4551d3a651@mellanox.com>
On 2018/4/17 23:37, Tariq Toukan wrote:
>
>
> On 16/04/2018 4:02 AM, Zhu Yanjun wrote:
>> While a faulty cable is used or HCA firmware error, HCA device will
>> be offline. When the driver is accessing this offline device, the
>> following call trace will pop out.
>>
>> "
>> ...
>> [<ffffffff816e4842>] dump_stack+0x63/0x81
>> [<ffffffff816e459e>] panic+0xcc/0x21b
>> [<ffffffffa03e5f8a>] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
>> [<ffffffffa03e7298>] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
>> [<ffffffffa03e7381>] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
>> [<ffffffffa03e9f00>] __mlx4_cmd+0xb0/0x160 [mlx4_core]
>> [<ffffffffa0406934>] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
>> [<ffffffffa03f5f54>] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
>> ...
>> "
>> In the above call trace, the function mlx4_cmd_poll calls the function
>> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
>> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
>> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.
>>
>> This is not reasonable. Since HCA device is offline when it is being
>> accessed, it should not be reset again.
>>
>> In this patch, since HCA is offline, the function mlx4_cmd_post returns
>> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly
>> returns
>> instead of resetting HCA.
>>
>> CC: Srinivas Eeda <srinivas.eeda@oracle.com>
>> CC: Junxiao Bi <junxiao.bi@oracle.com>
>> Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>> ---
>> drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> index 6a9086d..f1c8c42 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev,
>> u64 in_param, u64 out_param,
>> * Device is going through error recovery
>> * and cannot accept commands.
>> */
>> + mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
>> + ret = -EINVAL;
>> goto out;
>> }
>> @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev,
>> u64 in_param, u64 *out_param,
>> }
>> out_reset:
>> + if (err == -EINVAL)
>> + goto out;
>> +
>
> See below.
>
>> if (err)
>> err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>> out:
>> @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev,
>> u64 in_param, u64 *out_param,
>> *out_param = context->out_param;
>> out_reset:
>> + if (err == -EINVAL)
>> + goto out;
>> +
>> if (err)
>
> Instead, just do here: if (err && err != -EINVAL)
>
>> err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>> out:
>>
>
> I am not sure this does not mistakenly cover other cases that already
> exist and have (err == -EINVAL).
>
> For example, this line is hard to predict:
> err = mlx4_status_to_errno
> and later on, we might get into
> if (mlx4_closing_cmd_fatal_error(op, stat))
> which leads to out_reset.
Thanks a lot.
Sure. I agree with you that "err = mlx4_status_to_errno" and "if
(mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL".
This will mistakenly go to out instead of resetting HCA device.
I will make a new patch to avoid the above error.
Zhu Yanjun
>
> We must have a deeper look at this.
> But a better option is, change the error indication to uniquely
> indicate "already in error recovery".
>
prev parent reply other threads:[~2018-04-18 5:46 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-16 1:02 [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Zhu Yanjun
2018-04-16 16:51 ` David Miller
2018-04-17 7:05 ` Tariq Toukan
2018-04-17 15:37 ` Tariq Toukan
2018-04-18 5:46 ` Yanjun Zhu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=82f1c53e-f0f6-2ad6-8f70-2f5ac58d560b@oracle.com \
--to=yanjun.zhu@oracle.com \
--cc=haakon.bugge@oracle.com \
--cc=linux-rdma@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=tariqt@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).