* [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
@ 2020-08-20 3:54 Chao Leng
2020-08-20 4:30 ` Sagi Grimberg
0 siblings, 1 reply; 5+ messages in thread
From: Chao Leng @ 2020-08-20 3:54 UTC (permalink / raw)
To: linux-nvme; +Cc: linux-block, kbusch, axboe, hch, sagi, lengchao
A crash happens When we test nvme over roce with link blink. The
reason: nvme_enable_aen falsely start async_event_work when set
sync_event feature timeout, but async_event_sqe and qp of the queue
already be freeed when timeout. if async_event_work scheduling is
delayed for busy cpu, crash happens because use after free.
log:
[ 2229.253424] nvme nvme0: I/O 21 QID 0 timeout
[ 2229.253427] nvme nvme0: starting error recovery
[ 2229.354181] nvme nvme0: Failed to configure AEN (cfg 100)
[ 2229.354373] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2229.354928] #PF: supervisor write access in kernel mode
[ 2229.357945] #PF: error_code(0x0002) - not-present page
[ 2229.361009] PGD 0 P4D 0
[ 2229.364052] Oops: 0002 [#1] SMP PTI
[ 2229.367132] CPU: 4 PID: 17561 Comm: kworker/u12:0 Kdump: loaded Tainted: G OE 5.7.8 #1
[ 2229.369124] nvme nvme0: Reconnecting in 10 seconds...
[ 2229.370412] Hardware name: Huawei RH1288 V3/BC11HGSC0,
BIOS 5.03 07/25/2018
[ 2229.370421] Workqueue: nvme-wq nvme_async_event_work [nvme_core]
[ 2229.380029] RIP: 0010:nvme_rdma_submit_async_event+0x74/0x160
[nvme_rdma]
[ 2229.383408] Code: 48 85 c0 0f 84 e4 00 00 00 48 8b 40 50 48 85 c0 74
0f b9 01 00 00 00 ba 40 00 00 00 e8 25 d5 7a f9 48 8d 7b 08 48 89 d9 31
c0 <48> c7 03 00 00 00 00 48 83 e7 f8 48 c7 43 38 00 00 00 00 48 29 f9
[ 2229.391164] RSP: 0018:ffffa864c14fbe30 EFLAGS: 00010246
[ 2229.395215] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2229.399478] RDX: 0000000000000040 RSI: 000000024d89bc00 RDI: 0000000000000008
[ 2229.403785] RBP: ffff9267c76b82f8 R08: 0000000000000000 R09: 0071772d656d766e
[ 2229.408223] R10: 8080808080808080 R11: 0000000000000000 R12: ffff9267bf6ba800
[ 2229.412758] R13: ffff9267ae980000 R14: 0ffff9267d6ba8a0 R15: ffff9267c76b8c20
[ 2229.417401] FS: 0000000000000000(0000) GS:ffff9267ffd00000(0000) knlGS:0000000000000000
[ 2229.422360] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2229.427046] CR2: 0000000000000000 CR3: 0000000237188006 CR4: 00000000001606e0
[ 2229.431925] Call Trace:
[ 2229.436834] ? __switch_to_asm+0x34/0x70
[ 2229.441867] nvme_async_event_work+0x5d/0xc0 [nvme_core]
[ 2229.447057] process_one_work+0x1a7/0x370
[ 2229.452314] worker_thread+0x30/0x380
[ 2229.457634] ? max_active_store+0x80/0x80
[ 2229.463033] kthread+0x112/0x130
[ 2229.468482] ? __kthread_parkme+0x70/0x70
[ 2229.474031] ret_from_fork+0x35/0x40
nvme_enable_aen should not queue async_event_work when set aync_event
feature timeout. Based on the patch: set the flag:NVME_REQ_CANCELLED
for NVME_SC_HOST_ABORTED_CMD and NVME_SC_HOST_PATH_ERROR, check ruturn
value, if less than 0, do not queue async_event_work and return error.
Signed-off-by: Chao Leng <lengchao@huawei.com>
---
drivers/nvme/host/core.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 74f76aa78b02..f4c347fe925a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
-static void nvme_enable_aen(struct nvme_ctrl *ctrl)
+static int nvme_enable_aen(struct nvme_ctrl *ctrl)
{
u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
int status;
if (!supported_aens)
- return;
+ return 0;
status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
NULL, 0, &result);
- if (status)
+ if (status) {
dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
supported_aens);
+ if (status < 0)
+ return status;
+ }
queue_work(nvme_wq, &ctrl->async_event_work);
+ return 0;
}
/*
@@ -4343,12 +4347,14 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
{
nvme_start_keep_alive(ctrl);
- nvme_enable_aen(ctrl);
+ if (nvme_enable_aen(ctrl))
+ goto out;
if (ctrl->queue_count > 1) {
nvme_queue_scan(ctrl);
nvme_start_queues(ctrl);
}
+out:
ctrl->created = true;
}
EXPORT_SYMBOL_GPL(nvme_start_ctrl);
--
2.16.4
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
2020-08-20 3:54 [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout Chao Leng
@ 2020-08-20 4:30 ` Sagi Grimberg
2020-08-20 6:43 ` Chao Leng
0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-08-20 4:30 UTC (permalink / raw)
To: Chao Leng, linux-nvme; +Cc: linux-block, kbusch, axboe, hch
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 74f76aa78b02..f4c347fe925a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
> (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
> NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
>
> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
> {
> u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
> int status;
>
> if (!supported_aens)
> - return;
> + return 0;
>
> status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
> NULL, 0, &result);
> - if (status)
> + if (status) {
> dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
> supported_aens);
> + if (status < 0)
> + return status;
Why do you need to check status < 0, you need to fail it regardless.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
2020-08-20 4:30 ` Sagi Grimberg
@ 2020-08-20 6:43 ` Chao Leng
2020-08-21 7:49 ` Christoph Hellwig
0 siblings, 1 reply; 5+ messages in thread
From: Chao Leng @ 2020-08-20 6:43 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme; +Cc: linux-block, kbusch, axboe, hch
On 2020/8/20 12:30, Sagi Grimberg wrote:
>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 74f76aa78b02..f4c347fe925a 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
>> (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
>> NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>> {
>> u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>> int status;
>> if (!supported_aens)
>> - return;
>> + return 0;
>> status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>> NULL, 0, &result);
>> - if (status)
>> + if (status) {
>> dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>> supported_aens);
>> + if (status < 0)
>> + return status;
>
> Why do you need to check status < 0, you need to fail it regardless.
agree.
Just want to keep the old logic. I guess the old logic: if supported_aens
is true, the result of set features can ignore.
If there is no objection to doing so, I will resend the patch later.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
2020-08-20 6:43 ` Chao Leng
@ 2020-08-21 7:49 ` Christoph Hellwig
2020-08-21 20:17 ` Sagi Grimberg
0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2020-08-21 7:49 UTC (permalink / raw)
To: Chao Leng; +Cc: Sagi Grimberg, linux-nvme, linux-block, kbusch, axboe, hch
On Thu, Aug 20, 2020 at 02:43:20PM +0800, Chao Leng wrote:
>>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>>> {
>>> u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>>> int status;
>>> if (!supported_aens)
>>> - return;
>>> + return 0;
>>> status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>>> NULL, 0, &result);
>>> - if (status)
>>> + if (status) {
>>> dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>>> supported_aens);
>>> + if (status < 0)
>>> + return status;
>>
>> Why do you need to check status < 0, you need to fail it regardless.
>
> agree.
> Just want to keep the old logic. I guess the old logic: if supported_aens
> is true, the result of set features can ignore.
>
> If there is no objection to doing so, I will resend the patch later.
In the past we've dedice to ignore real NVMe errors in various
spots as the functionality wasn't deemed critical. I think that is
pretty sloppy and we should only do that where we really have to.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
2020-08-21 7:49 ` Christoph Hellwig
@ 2020-08-21 20:17 ` Sagi Grimberg
0 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2020-08-21 20:17 UTC (permalink / raw)
To: Christoph Hellwig, Chao Leng; +Cc: linux-nvme, linux-block, kbusch, axboe
>>>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>>>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>>>> {
>>>> u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>>>> int status;
>>>> if (!supported_aens)
>>>> - return;
>>>> + return 0;
>>>> status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>>>> NULL, 0, &result);
>>>> - if (status)
>>>> + if (status) {
>>>> dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>>>> supported_aens);
>>>> + if (status < 0)
>>>> + return status;
>>>
>>> Why do you need to check status < 0, you need to fail it regardless.
>>
>> agree.
>> Just want to keep the old logic. I guess the old logic: if supported_aens
>> is true, the result of set features can ignore.
>>
>> If there is no objection to doing so, I will resend the patch later.
>
> In the past we've dedice to ignore real NVMe errors in various
> spots as the functionality wasn't deemed critical. I think that is
> pretty sloppy and we should only do that where we really have to.
Agreed.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-08-21 20:17 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-20 3:54 [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout Chao Leng
2020-08-20 4:30 ` Sagi Grimberg
2020-08-20 6:43 ` Chao Leng
2020-08-21 7:49 ` Christoph Hellwig
2020-08-21 20:17 ` Sagi Grimberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox