public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
@ 2020-08-20  3:54 Chao Leng
  2020-08-20  4:30 ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: Chao Leng @ 2020-08-20  3:54 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, kbusch, axboe, hch, sagi, lengchao

A crash happens When we test nvme over roce with link blink. The
reason: nvme_enable_aen falsely start async_event_work when set
sync_event feature timeout, but async_event_sqe and qp of the queue
already be freeed when timeout. if async_event_work scheduling is
delayed for busy cpu, crash happens because use after free.

log:
[ 2229.253424] nvme nvme0: I/O 21 QID 0 timeout
[ 2229.253427] nvme nvme0: starting error recovery
[ 2229.354181] nvme nvme0: Failed to configure AEN (cfg 100)
[ 2229.354373] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2229.354928] #PF: supervisor write access in kernel mode
[ 2229.357945] #PF: error_code(0x0002) - not-present page
[ 2229.361009] PGD 0 P4D 0
[ 2229.364052] Oops: 0002 [#1] SMP PTI
[ 2229.367132] CPU: 4 PID: 17561 Comm: kworker/u12:0 Kdump: loaded Tainted: G           OE     5.7.8 #1
[ 2229.369124] nvme nvme0: Reconnecting in 10 seconds...
[ 2229.370412] Hardware name: Huawei RH1288 V3/BC11HGSC0,
BIOS 5.03 07/25/2018
[ 2229.370421] Workqueue: nvme-wq nvme_async_event_work [nvme_core]
[ 2229.380029] RIP: 0010:nvme_rdma_submit_async_event+0x74/0x160
[nvme_rdma]
[ 2229.383408] Code: 48 85 c0 0f 84 e4 00 00 00 48 8b 40 50 48 85 c0 74
0f b9 01 00 00 00 ba 40 00 00 00 e8 25 d5 7a f9 48 8d 7b 08 48 89 d9 31
c0 <48> c7 03 00 00 00 00 48 83 e7 f8 48 c7 43 38 00 00 00 00 48 29 f9
[ 2229.391164] RSP: 0018:ffffa864c14fbe30 EFLAGS: 00010246
[ 2229.395215] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2229.399478] RDX: 0000000000000040 RSI: 000000024d89bc00 RDI: 0000000000000008
[ 2229.403785] RBP: ffff9267c76b82f8 R08: 0000000000000000 R09: 0071772d656d766e
[ 2229.408223] R10: 8080808080808080 R11: 0000000000000000 R12: ffff9267bf6ba800
[ 2229.412758] R13: ffff9267ae980000 R14: 0ffff9267d6ba8a0 R15: ffff9267c76b8c20
[ 2229.417401] FS:  0000000000000000(0000) GS:ffff9267ffd00000(0000) knlGS:0000000000000000
[ 2229.422360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2229.427046] CR2: 0000000000000000 CR3: 0000000237188006 CR4: 00000000001606e0
[ 2229.431925] Call Trace:
[ 2229.436834]  ? __switch_to_asm+0x34/0x70
[ 2229.441867]  nvme_async_event_work+0x5d/0xc0 [nvme_core]
[ 2229.447057]  process_one_work+0x1a7/0x370
[ 2229.452314]  worker_thread+0x30/0x380
[ 2229.457634]  ? max_active_store+0x80/0x80
[ 2229.463033]  kthread+0x112/0x130
[ 2229.468482]  ? __kthread_parkme+0x70/0x70
[ 2229.474031]  ret_from_fork+0x35/0x40

nvme_enable_aen should not queue async_event_work when set aync_event
feature timeout. Based on the patch: set the flag:NVME_REQ_CANCELLED
for NVME_SC_HOST_ABORTED_CMD and NVME_SC_HOST_PATH_ERROR, check ruturn
value, if less than 0, do not queue async_event_work and return error.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/core.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 74f76aa78b02..f4c347fe925a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
 	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
 	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
 
-static void nvme_enable_aen(struct nvme_ctrl *ctrl)
+static int nvme_enable_aen(struct nvme_ctrl *ctrl)
 {
 	u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
 	int status;
 
 	if (!supported_aens)
-		return;
+		return 0;
 
 	status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
 			NULL, 0, &result);
-	if (status)
+	if (status) {
 		dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
 			 supported_aens);
+		if (status < 0)
+			return status;
+	}
 
 	queue_work(nvme_wq, &ctrl->async_event_work);
+	return 0;
 }
 
 /*
@@ -4343,12 +4347,14 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_start_keep_alive(ctrl);
 
-	nvme_enable_aen(ctrl);
+	if (nvme_enable_aen(ctrl))
+		goto out;
 
 	if (ctrl->queue_count > 1) {
 		nvme_queue_scan(ctrl);
 		nvme_start_queues(ctrl);
 	}
+out:
 	ctrl->created = true;
 }
 EXPORT_SYMBOL_GPL(nvme_start_ctrl);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
  2020-08-20  3:54 [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout Chao Leng
@ 2020-08-20  4:30 ` Sagi Grimberg
  2020-08-20  6:43   ` Chao Leng
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-08-20  4:30 UTC (permalink / raw)
  To: Chao Leng, linux-nvme; +Cc: linux-block, kbusch, axboe, hch


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 74f76aa78b02..f4c347fe925a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
>   	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
>   	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
>   
> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>   {
>   	u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>   	int status;
>   
>   	if (!supported_aens)
> -		return;
> +		return 0;
>   
>   	status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>   			NULL, 0, &result);
> -	if (status)
> +	if (status) {
>   		dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>   			 supported_aens);
> +		if (status < 0)
> +			return status;

Why do you need to check status < 0, you need to fail it regardless.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
  2020-08-20  4:30 ` Sagi Grimberg
@ 2020-08-20  6:43   ` Chao Leng
  2020-08-21  7:49     ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Chao Leng @ 2020-08-20  6:43 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: linux-block, kbusch, axboe, hch



On 2020/8/20 12:30, Sagi Grimberg wrote:
> 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 74f76aa78b02..f4c347fe925a 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -1422,21 +1422,25 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
>>       (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
>>        NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>>   {
>>       u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>>       int status;
>>       if (!supported_aens)
>> -        return;
>> +        return 0;
>>       status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>>               NULL, 0, &result);
>> -    if (status)
>> +    if (status) {
>>           dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>>                supported_aens);
>> +        if (status < 0)
>> +            return status;
> 
> Why do you need to check status < 0, you need to fail it regardless.

agree.
Just want to keep the old logic. I guess the old logic: if supported_aens
is true, the result of set features can ignore.

If there is no objection to doing so, I will resend the patch later.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
  2020-08-20  6:43   ` Chao Leng
@ 2020-08-21  7:49     ` Christoph Hellwig
  2020-08-21 20:17       ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2020-08-21  7:49 UTC (permalink / raw)
  To: Chao Leng; +Cc: Sagi Grimberg, linux-nvme, linux-block, kbusch, axboe, hch

On Thu, Aug 20, 2020 at 02:43:20PM +0800, Chao Leng wrote:
>>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>>>   {
>>>       u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>>>       int status;
>>>       if (!supported_aens)
>>> -        return;
>>> +        return 0;
>>>       status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>>>               NULL, 0, &result);
>>> -    if (status)
>>> +    if (status) {
>>>           dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>>>                supported_aens);
>>> +        if (status < 0)
>>> +            return status;
>>
>> Why do you need to check status < 0, you need to fail it regardless.
>
> agree.
> Just want to keep the old logic. I guess the old logic: if supported_aens
> is true, the result of set features can ignore.
>
> If there is no objection to doing so, I will resend the patch later.

In the past we've dedice to ignore real NVMe errors in various
spots as the functionality wasn't deemed critical.  I think that is
pretty sloppy and we should only do that where we really have to.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout
  2020-08-21  7:49     ` Christoph Hellwig
@ 2020-08-21 20:17       ` Sagi Grimberg
  0 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2020-08-21 20:17 UTC (permalink / raw)
  To: Christoph Hellwig, Chao Leng; +Cc: linux-nvme, linux-block, kbusch, axboe


>>>> -static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>>>> +static int nvme_enable_aen(struct nvme_ctrl *ctrl)
>>>>    {
>>>>        u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
>>>>        int status;
>>>>        if (!supported_aens)
>>>> -        return;
>>>> +        return 0;
>>>>        status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, supported_aens,
>>>>                NULL, 0, &result);
>>>> -    if (status)
>>>> +    if (status) {
>>>>            dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
>>>>                 supported_aens);
>>>> +        if (status < 0)
>>>> +            return status;
>>>
>>> Why do you need to check status < 0, you need to fail it regardless.
>>
>> agree.
>> Just want to keep the old logic. I guess the old logic: if supported_aens
>> is true, the result of set features can ignore.
>>
>> If there is no objection to doing so, I will resend the patch later.
> 
> In the past we've dedice to ignore real NVMe errors in various
> spots as the functionality wasn't deemed critical.  I think that is
> pretty sloppy and we should only do that where we really have to.

Agreed.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-21 20:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-20  3:54 [PATCH 3/3] nvme-core: fix crash when nvme_enable_aen timeout Chao Leng
2020-08-20  4:30 ` Sagi Grimberg
2020-08-20  6:43   ` Chao Leng
2020-08-21  7:49     ` Christoph Hellwig
2020-08-21 20:17       ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox