* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 7:13 ` [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl Ruozhu Li
@ 2021-11-04 12:26 ` Sagi Grimberg
2021-11-04 23:23 ` James Smart
2021-11-05 1:34 ` liruozhu
2021-11-11 4:09 ` liruozhu
` (2 subsequent siblings)
3 siblings, 2 replies; 12+ messages in thread
From: Sagi Grimberg @ 2021-11-04 12:26 UTC (permalink / raw)
To: Ruozhu Li, linux-nvme
> A crash happens when I try to disconnect a reconnecting ctrl:
>
> 1) The network was cut off when the connection was just established,
> scan work hang there waiting for some IOs complete.Those IOs were
> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>
> 2) After a while, I tried to disconnect this connection.This procedure
> also hung because it tried to obtain ctrl->scan_lock.It should be noted
> that now we have switched the controller state to NVME_CTRL_DELETING.
>
> 3) In nvme_check_ready(), we always return true when ctrl->state is
> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
> device which was already freed.
>
> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
> device only when queue state is live.If not, return host path error to blk.
>
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/nvme.h | 2 +-
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 838b5e2058be..752203ad7639 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
> struct request *rq)
> {
> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
> + ctrl->state != NVME_CTRL_DELETING &&
Please explain why you need this change? As suggested by the name
only DELETING_NOIO does not accept I/O, and if we return
BLK_STS_RESOURCE we can get into an endless loop of resubmission.
> ctrl->state != NVME_CTRL_DEAD &&
> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index b334af8aa264..9b095ee01364 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
> return true;
> if (ctrl->ops->flags & NVME_F_FABRICS &&
> ctrl->state == NVME_CTRL_DELETING)
> - return true;
> + return queue_live;
I agree with this change. I thought I've already seen this change from
James in the past.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 12:26 ` Sagi Grimberg
@ 2021-11-04 23:23 ` James Smart
2021-11-05 1:55 ` liruozhu
2021-11-05 1:34 ` liruozhu
1 sibling, 1 reply; 12+ messages in thread
From: James Smart @ 2021-11-04 23:23 UTC (permalink / raw)
To: Sagi Grimberg, Ruozhu Li, linux-nvme
On 11/4/2021 5:26 AM, Sagi Grimberg wrote:
>
>> A crash happens when I try to disconnect a reconnecting ctrl:
>>
>> 1) The network was cut off when the connection was just established,
>> scan work hang there waiting for some IOs complete.Those IOs were
>> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>>
>> 2) After a while, I tried to disconnect this connection.This procedure
>> also hung because it tried to obtain ctrl->scan_lock.It should be noted
>> that now we have switched the controller state to NVME_CTRL_DELETING.
>>
>> 3) In nvme_check_ready(), we always return true when ctrl->state is
>> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
>> device which was already freed.
>>
>> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
>> device only when queue state is live.If not, return host path error to
>> blk.
>>
>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>> ---
>> drivers/nvme/host/core.c | 1 +
>> drivers/nvme/host/nvme.h | 2 +-
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 838b5e2058be..752203ad7639 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct
>> nvme_ctrl *ctrl,
>> struct request *rq)
>> {
>> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
>> + ctrl->state != NVME_CTRL_DELETING &&
>
> Please explain why you need this change? As suggested by the name
> only DELETING_NOIO does not accept I/O, and if we return
> BLK_STS_RESOURCE we can get into an endless loop of resubmission.
Before the change below (if fabrics and DELETING, return queue_live),
when DELETING, fabrics always would have returned true and never called
the nvme_fail_nonready_command() routine.
But with the change, we now have DELETING cases where qlive is false
calling this routine. Its possible some of those may have returned
BLK_STS_RESOURCE and gotten into the endless loop. The !DELETING check
keeps the same behavior as prior while forcing the new DELETING requests
to return host_path_error.
I think the change is ok.
>
>> ctrl->state != NVME_CTRL_DEAD &&
>> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index b334af8aa264..9b095ee01364 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct
>> nvme_ctrl *ctrl, struct request *rq,
>> return true;
>> if (ctrl->ops->flags & NVME_F_FABRICS &&
>> ctrl->state == NVME_CTRL_DELETING)
>> - return true;
>> + return queue_live;
>
> I agree with this change. I thought I've already seen this change from
> James in the past.
>
this new test was added when when nvmf_check_ready() moved to
nvme_check_ready, as fabrics need to do GET/SET_PROPERTIES for register
access on shutdown (CC, CSTS) whereas PCI doesn't. So it was keeping
the fabrics unconditional return true to let them through.
It's ok to qualify it as to whether the transport has the queue live.
-- james
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 23:23 ` James Smart
@ 2021-11-05 1:55 ` liruozhu
0 siblings, 0 replies; 12+ messages in thread
From: liruozhu @ 2021-11-05 1:55 UTC (permalink / raw)
To: James Smart, Sagi Grimberg, linux-nvme
On 2021/11/5 7:23, James Smart wrote:
> On 11/4/2021 5:26 AM, Sagi Grimberg wrote:
>>
>>> A crash happens when I try to disconnect a reconnecting ctrl:
>>>
>>> 1) The network was cut off when the connection was just established,
>>> scan work hang there waiting for some IOs complete.Those IOs were
>>> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>>>
>>> 2) After a while, I tried to disconnect this connection.This procedure
>>> also hung because it tried to obtain ctrl->scan_lock.It should be noted
>>> that now we have switched the controller state to NVME_CTRL_DELETING.
>>>
>>> 3) In nvme_check_ready(), we always return true when ctrl->state is
>>> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
>>> device which was already freed.
>>>
>>> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to
>>> bottom
>>> device only when queue state is live.If not, return host path error
>>> to blk.
>>>
>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>> ---
>>> drivers/nvme/host/core.c | 1 +
>>> drivers/nvme/host/nvme.h | 2 +-
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 838b5e2058be..752203ad7639 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct
>>> nvme_ctrl *ctrl,
>>> struct request *rq)
>>> {
>>> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
>>> + ctrl->state != NVME_CTRL_DELETING &&
>>
>> Please explain why you need this change? As suggested by the name
>> only DELETING_NOIO does not accept I/O, and if we return
>> BLK_STS_RESOURCE we can get into an endless loop of resubmission.
>
> Before the change below (if fabrics and DELETING, return queue_live),
> when DELETING, fabrics always would have returned true and never
> called the nvme_fail_nonready_command() routine.
>
> But with the change, we now have DELETING cases where qlive is false
> calling this routine. Its possible some of those may have returned
> BLK_STS_RESOURCE and gotten into the endless loop. The !DELETING check
> keeps the same behavior as prior while forcing the new DELETING
> requests to return host_path_error.
>
> I think the change is ok.
>
>
>>
>>> ctrl->state != NVME_CTRL_DEAD &&
>>> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>>> !blk_noretry_request(rq) && !(rq->cmd_flags &
>>> REQ_NVME_MPATH))
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index b334af8aa264..9b095ee01364 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct
>>> nvme_ctrl *ctrl, struct request *rq,
>>> return true;
>>> if (ctrl->ops->flags & NVME_F_FABRICS &&
>>> ctrl->state == NVME_CTRL_DELETING)
>>> - return true;
>>> + return queue_live;
>>
>> I agree with this change. I thought I've already seen this change from
>> James in the past.
>>
>
> this new test was added when when nvmf_check_ready() moved to
> nvme_check_ready, as fabrics need to do GET/SET_PROPERTIES for
> register access on shutdown (CC, CSTS) whereas PCI doesn't. So it was
> keeping the fabrics unconditional return true to let them through.
>
> It's ok to qualify it as to whether the transport has the queue live.
>
> -- james
> .
Thanks for your reviewing.
-- ruozhu
.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 12:26 ` Sagi Grimberg
2021-11-04 23:23 ` James Smart
@ 2021-11-05 1:34 ` liruozhu
2021-11-13 10:04 ` liruozhu
1 sibling, 1 reply; 12+ messages in thread
From: liruozhu @ 2021-11-05 1:34 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
On 2021/11/4 20:26, Sagi Grimberg wrote:
>
>> A crash happens when I try to disconnect a reconnecting ctrl:
>>
>> 1) The network was cut off when the connection was just established,
>> scan work hang there waiting for some IOs complete.Those IOs were
>> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>>
>> 2) After a while, I tried to disconnect this connection.This procedure
>> also hung because it tried to obtain ctrl->scan_lock.It should be noted
>> that now we have switched the controller state to NVME_CTRL_DELETING.
>>
>> 3) In nvme_check_ready(), we always return true when ctrl->state is
>> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
>> device which was already freed.
>>
>> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
>> device only when queue state is live.If not, return host path error
>> to blk.
>>
>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>> ---
>> drivers/nvme/host/core.c | 1 +
>> drivers/nvme/host/nvme.h | 2 +-
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 838b5e2058be..752203ad7639 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct
>> nvme_ctrl *ctrl,
>> struct request *rq)
>> {
>> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
>> + ctrl->state != NVME_CTRL_DELETING &&
>
> Please explain why you need this change? As suggested by the name
> only DELETING_NOIO does not accept I/O, and if we return
> BLK_STS_RESOURCE we can get into an endless loop of resubmission.
I just added the handling of the DELETING state here, did not modify the
DELETING_NOIO case.
Thanks,
Ruozhu
>
>> ctrl->state != NVME_CTRL_DEAD &&
>> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index b334af8aa264..9b095ee01364 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct
>> nvme_ctrl *ctrl, struct request *rq,
>> return true;
>> if (ctrl->ops->flags & NVME_F_FABRICS &&
>> ctrl->state == NVME_CTRL_DELETING)
>> - return true;
>> + return queue_live;
>
> I agree with this change. I thought I've already seen this change from
> James in the past.
> .
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-05 1:34 ` liruozhu
@ 2021-11-13 10:04 ` liruozhu
2021-11-14 10:20 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: liruozhu @ 2021-11-13 10:04 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Hi sagi,
On 2021/11/5 9:34, liruozhu wrote:
> On 2021/11/4 20:26, Sagi Grimberg wrote:
>>
>>> A crash happens when I try to disconnect a reconnecting ctrl:
>>>
>>> 1) The network was cut off when the connection was just established,
>>> scan work hang there waiting for some IOs complete.Those IOs were
>>> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>>>
>>> 2) After a while, I tried to disconnect this connection.This procedure
>>> also hung because it tried to obtain ctrl->scan_lock.It should be noted
>>> that now we have switched the controller state to NVME_CTRL_DELETING.
>>>
>>> 3) In nvme_check_ready(), we always return true when ctrl->state is
>>> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
>>> device which was already freed.
>>>
>>> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to
>>> bottom
>>> device only when queue state is live.If not, return host path error
>>> to blk.
>>>
>>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>>> ---
>>> drivers/nvme/host/core.c | 1 +
>>> drivers/nvme/host/nvme.h | 2 +-
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 838b5e2058be..752203ad7639 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct
>>> nvme_ctrl *ctrl,
>>> struct request *rq)
>>> {
>>> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
>>> + ctrl->state != NVME_CTRL_DELETING &&
>>
>> Please explain why you need this change? As suggested by the name
>> only DELETING_NOIO does not accept I/O, and if we return
>> BLK_STS_RESOURCE we can get into an endless loop of resubmission.
>
> I just added the handling of the DELETING state here, did not modify
> the DELETING_NOIO case.
>
> Thanks,
> Ruozhu
>
I'm not sure if I explained it clearly, my English is not very good.
If you think there is still a problem with this patch, please tell me.
Thanks,
Ruozhu
>>
>>> ctrl->state != NVME_CTRL_DEAD &&
>>> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>>> !blk_noretry_request(rq) && !(rq->cmd_flags &
>>> REQ_NVME_MPATH))
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index b334af8aa264..9b095ee01364 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct
>>> nvme_ctrl *ctrl, struct request *rq,
>>> return true;
>>> if (ctrl->ops->flags & NVME_F_FABRICS &&
>>> ctrl->state == NVME_CTRL_DELETING)
>>> - return true;
>>> + return queue_live;
>>
>> I agree with this change. I thought I've already seen this change from
>> James in the past.
>> .
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-13 10:04 ` liruozhu
@ 2021-11-14 10:20 ` Sagi Grimberg
0 siblings, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2021-11-14 10:20 UTC (permalink / raw)
To: liruozhu, linux-nvme
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 838b5e2058be..752203ad7639 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct
>>>> nvme_ctrl *ctrl,
>>>> struct request *rq)
>>>> {
>>>> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
>>>> + ctrl->state != NVME_CTRL_DELETING &&
>>>
>>> Please explain why you need this change? As suggested by the name
>>> only DELETING_NOIO does not accept I/O, and if we return
>>> BLK_STS_RESOURCE we can get into an endless loop of resubmission.
>>
>> I just added the handling of the DELETING state here, did not modify
>> the DELETING_NOIO case.
>>
>> Thanks,
>> Ruozhu
>>
> I'm not sure if I explained it clearly, my English is not very good.
>
> If you think there is still a problem with this patch, please tell me.
Naa, re-thinking this I think it's reasonable to complete the command
for DELETING if the queue is not live...
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 7:13 ` [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl Ruozhu Li
2021-11-04 12:26 ` Sagi Grimberg
@ 2021-11-11 4:09 ` liruozhu
2021-11-25 3:20 ` liruozhu
2021-12-07 12:45 ` liruozhu
3 siblings, 0 replies; 12+ messages in thread
From: liruozhu @ 2021-11-11 4:09 UTC (permalink / raw)
To: linux-nvme; +Cc: sagi, James Smart
Hi all,
Are there any other suggestions for this problem?
Thanks,
Ruozhu
On 2021/11/4 15:13, Ruozhu Li wrote:
> A crash happens when I try to disconnect a reconnecting ctrl:
>
> 1) The network was cut off when the connection was just established,
> scan work hang there waiting for some IOs complete.Those IOs were
> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>
> 2) After a while, I tried to disconnect this connection.This procedure
> also hung because it tried to obtain ctrl->scan_lock.It should be noted
> that now we have switched the controller state to NVME_CTRL_DELETING.
>
> 3) In nvme_check_ready(), we always return true when ctrl->state is
> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
> device which was already freed.
>
> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
> device only when queue state is live.If not, return host path error to blk.
>
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/nvme.h | 2 +-
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 838b5e2058be..752203ad7639 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
> struct request *rq)
> {
> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
> + ctrl->state != NVME_CTRL_DELETING &&
> ctrl->state != NVME_CTRL_DEAD &&
> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index b334af8aa264..9b095ee01364 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
> return true;
> if (ctrl->ops->flags & NVME_F_FABRICS &&
> ctrl->state == NVME_CTRL_DELETING)
> - return true;
> + return queue_live;
> return __nvme_check_ready(ctrl, rq, queue_live);
> }
> int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 7:13 ` [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl Ruozhu Li
2021-11-04 12:26 ` Sagi Grimberg
2021-11-11 4:09 ` liruozhu
@ 2021-11-25 3:20 ` liruozhu
2021-12-07 12:45 ` liruozhu
3 siblings, 0 replies; 12+ messages in thread
From: liruozhu @ 2021-11-25 3:20 UTC (permalink / raw)
To: linux-nvme, hch
Ping...
On 2021/11/4 15:13, Ruozhu Li wrote:
> A crash happens when I try to disconnect a reconnecting ctrl:
>
> 1) The network was cut off when the connection was just established,
> scan work hang there waiting for some IOs complete.Those IOs were
> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>
> 2) After a while, I tried to disconnect this connection.This procedure
> also hung because it tried to obtain ctrl->scan_lock.It should be noted
> that now we have switched the controller state to NVME_CTRL_DELETING.
>
> 3) In nvme_check_ready(), we always return true when ctrl->state is
> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
> device which was already freed.
>
> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
> device only when queue state is live.If not, return host path error to blk.
>
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/nvme.h | 2 +-
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 838b5e2058be..752203ad7639 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
> struct request *rq)
> {
> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
> + ctrl->state != NVME_CTRL_DELETING &&
> ctrl->state != NVME_CTRL_DEAD &&
> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index b334af8aa264..9b095ee01364 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
> return true;
> if (ctrl->ops->flags & NVME_F_FABRICS &&
> ctrl->state == NVME_CTRL_DELETING)
> - return true;
> + return queue_live;
> return __nvme_check_ready(ctrl, rq, queue_live);
> }
> int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl
2021-11-04 7:13 ` [PATCH 1/1] nvme: fix use after free when disconnect a reconnecting ctrl Ruozhu Li
` (2 preceding siblings ...)
2021-11-25 3:20 ` liruozhu
@ 2021-12-07 12:45 ` liruozhu
2021-12-07 17:23 ` Christoph Hellwig
3 siblings, 1 reply; 12+ messages in thread
From: liruozhu @ 2021-12-07 12:45 UTC (permalink / raw)
To: hch; +Cc: linux-nvme
On 2021/11/4 15:13, Ruozhu Li wrote:
> A crash happens when I try to disconnect a reconnecting ctrl:
>
> 1) The network was cut off when the connection was just established,
> scan work hang there waiting for some IOs complete.Those IOs were
> retrying because we return BLK_STS_RESOURCE to blk in reconnecting.
>
> 2) After a while, I tried to disconnect this connection.This procedure
> also hung because it tried to obtain ctrl->scan_lock.It should be noted
> that now we have switched the controller state to NVME_CTRL_DELETING.
>
> 3) In nvme_check_ready(), we always return true when ctrl->state is
> NVME_CTRL_DELETING, so those retrying IOs were issued to the bottom
> device which was already freed.
>
> To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
> device only when queue state is live.If not, return host path error to blk.
>
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/nvme.h | 2 +-
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 838b5e2058be..752203ad7639 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -666,6 +666,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
> struct request *rq)
> {
> if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
> + ctrl->state != NVME_CTRL_DELETING &&
> ctrl->state != NVME_CTRL_DEAD &&
> !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index b334af8aa264..9b095ee01364 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -709,7 +709,7 @@ static inline bool nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
> return true;
> if (ctrl->ops->flags & NVME_F_FABRICS &&
> ctrl->state == NVME_CTRL_DELETING)
> - return true;
> + return queue_live;
> return __nvme_check_ready(ctrl, rq, queue_live);
> }
> int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
Hi Christoph,
friendly ping...
thanks,
Ruozhu
^ permalink raw reply [flat|nested] 12+ messages in thread