Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
@ 2023-12-04  7:58 Jirong Feng
  2023-12-04  8:47 ` Sagi Grimberg
  2023-12-05  4:37 ` Keith Busch
  0 siblings, 2 replies; 28+ messages in thread
From: Jirong Feng @ 2023-12-04  7:58 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, peng.xiao

Hi all,

I have two storage servers, each of which has an NVMe SSD. Recently I'm 
trying nvmet-tcp with DRBD, steps are:
1. Configure DRBD for the two SSDs in two-primary mode, so that each 
server can accept IO on DRBD device.
2. On each server, add the corresponding DRBD device to nvmet subsystem 
with same device uuid, so that multipath on the host side can group them 
into one device(My fabric type is tcp).
3. On client host, nvme discover & connect the both servers, making sure 
DM multipath device is generated, and both paths are online.
4. Execute fio randread on DM device continuously.
5. On the server whose multipath status is active, under nvmet namespace 
configfs directory, execute "echo 0 > enable" to disable the namespace.
what I expect is that IO can be automatically retried and switched to 
the other storage server by multipath, fio goes on. But actually I see 
an "Operation not supported" error, and fio fails and stops. I've also 
tried iSCSI target, after I delete mapped lun from acl, fio continues 
running without any error.

My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the 
kernel code, I found that:
1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to 
namespace not found.
2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP 
for block layer.
3. Multipath calls for function blk_path_error() to decide whether to retry.
4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to be 
a path error, so it returns false, multipath will not retry.
I've also checked out the master branch from origin, it's almost the 
same. In iSCSI target, the process is similar, the only difference is 
that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which is 
considered to be a path error in function blk_path_error().

So my question is as the subject...Is it reasonable to translate 
NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? 
Should multipath failover on this error?

Thanks & Best Regards,
Jirong

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-04  7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng
@ 2023-12-04  8:47 ` Sagi Grimberg
  2023-12-05  3:54   ` Jirong Feng
  2023-12-05  4:37 ` Keith Busch
  1 sibling, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2023-12-04  8:47 UTC (permalink / raw)
  To: Jirong Feng, Keith Busch, Jens Axboe, Christoph Hellwig
  Cc: linux-nvme, peng.xiao

> Hi all,
> 
> I have two storage servers, each of which has an NVMe SSD. Recently I'm 
> trying nvmet-tcp with DRBD, steps are:
> 1. Configure DRBD for the two SSDs in two-primary mode, so that each 
> server can accept IO on DRBD device.
> 2. On each server, add the corresponding DRBD device to nvmet subsystem 
> with same device uuid, so that multipath on the host side can group them 
> into one device(My fabric type is tcp).
> 3. On client host, nvme discover & connect the both servers, making sure 
> DM multipath device is generated, and both paths are online.
> 4. Execute fio randread on DM device continuously.
> 5. On the server whose multipath status is active, under nvmet namespace 
> configfs directory, execute "echo 0 > enable" to disable the namespace.
> what I expect is that IO can be automatically retried and switched to 
> the other storage server by multipath, fio goes on. But actually I see 
> an "Operation not supported" error, and fio fails and stops. I've also 
> tried iSCSI target, after I delete mapped lun from acl, fio continues 
> running without any error.
> 
> My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the 
> kernel code, I found that:
> 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to 
> namespace not found.
> 2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP 
> for block layer.
> 3. Multipath calls for function blk_path_error() to decide whether to 
> retry.
> 4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to be 
> a path error, so it returns false, multipath will not retry.
> I've also checked out the master branch from origin, it's almost the 
> same. In iSCSI target, the process is similar, the only difference is 
> that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which is 
> considered to be a path error in function blk_path_error().
> 
> So my question is as the subject...Is it reasonable to translate 
> NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? 
> Should multipath failover on this error?

The host issued IO to a non-existing namespace. Semantically it is not
an IO error in the sense that its retryable.

btw, AFAICT TCM_NON_EXISTENT_LUN does return an ILLEGAL_REQUEST however
the host chooses to ignore the particular additional sense specifically.

While I guess similar behavior could be done in nvme, the question is
why is a non-existent namespace failure a retryable error? the namespace
is gone...

Thoughts?

Perhaps what you are seeking is a soft way to disable a namespace based
on your test case?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-04  8:47 ` Sagi Grimberg
@ 2023-12-05  3:54   ` Jirong Feng
  0 siblings, 0 replies; 28+ messages in thread
From: Jirong Feng @ 2023-12-05  3:54 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe, Christoph Hellwig
  Cc: linux-nvme, peng.xiao

Hi Sagi,

On the one hand, in multipath's perspective, if one path fails with 
NVME_SC_INVALID_NS, does that mean namespaces in all other paths are 
invalid either? if not, perhaps a retry on another path also makes 
sense. If so, I'm afraid there's no much choice but only io error 
despite of the semantical mismatch...

On the other hand, it seems there's no soft way to disable a namespace 
in nvmet for now. To do so, nvmet needs to differentiate if a namespace 
is disabled or non-existent but it just does not. After that, we might 
need a new nvme status code for disabled namespace, like 
NVME_SC_DISABLED_NS I guess? and translate it to BLK_STS_OFFLINE which 
is considered to be a path error in blk_path_error()?

Regards,

Jirong

在 2023/12/4 16:47, Sagi Grimberg 写道:
>
>> Hi all,
>>
>> I have two storage servers, each of which has an NVMe SSD. Recently 
>> I'm trying nvmet-tcp with DRBD, steps are:
>> 1. Configure DRBD for the two SSDs in two-primary mode, so that each 
>> server can accept IO on DRBD device.
>> 2. On each server, add the corresponding DRBD device to nvmet 
>> subsystem with same device uuid, so that multipath on the host side 
>> can group them into one device(My fabric type is tcp).
>> 3. On client host, nvme discover & connect the both servers, making 
>> sure DM multipath device is generated, and both paths are online.
>> 4. Execute fio randread on DM device continuously.
>> 5. On the server whose multipath status is active, under nvmet 
>> namespace configfs directory, execute "echo 0 > enable" to disable 
>> the namespace.
>> what I expect is that IO can be automatically retried and switched to 
>> the other storage server by multipath, fio goes on. But actually I 
>> see an "Operation not supported" error, and fio fails and stops. I've 
>> also tried iSCSI target, after I delete mapped lun from acl, fio 
>> continues running without any error.
>>
>> My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the 
>> kernel code, I found that:
>> 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to 
>> namespace not found.
>> 2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP 
>> for block layer.
>> 3. Multipath calls for function blk_path_error() to decide whether to 
>> retry.
>> 4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to 
>> be a path error, so it returns false, multipath will not retry.
>> I've also checked out the master branch from origin, it's almost the 
>> same. In iSCSI target, the process is similar, the only difference is 
>> that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which 
>> is considered to be a path error in function blk_path_error().
>>
>> So my question is as the subject...Is it reasonable to translate 
>> NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? 
>> Should multipath failover on this error?
>
> The host issued IO to a non-existing namespace. Semantically it is not
> an IO error in the sense that its retryable.
>
> btw, AFAICT TCM_NON_EXISTENT_LUN does return an ILLEGAL_REQUEST however
> the host chooses to ignore the particular additional sense specifically.
>
> While I guess similar behavior could be done in nvme, the question is
> why is a non-existent namespace failure a retryable error? the namespace
> is gone...
>
> Thoughts?
>
> Perhaps what you are seeking is a soft way to disable a namespace based
> on your test case?
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-04  7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng
  2023-12-04  8:47 ` Sagi Grimberg
@ 2023-12-05  4:37 ` Keith Busch
  2023-12-05  4:40   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Keith Busch @ 2023-12-05  4:37 UTC (permalink / raw)
  To: Jirong Feng
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	peng.xiao

On Mon, Dec 04, 2023 at 03:58:14PM +0800, Jirong Feng wrote:
> Hi all,
> 
> I have two storage servers, each of which has an NVMe SSD. Recently I'm
> trying nvmet-tcp with DRBD, steps are:
> 1. Configure DRBD for the two SSDs in two-primary mode, so that each server
> can accept IO on DRBD device.
> 2. On each server, add the corresponding DRBD device to nvmet subsystem with
> same device uuid, so that multipath on the host side can group them into one
> device(My fabric type is tcp).
> 3. On client host, nvme discover & connect the both servers, making sure DM
> multipath device is generated, and both paths are online.
> 4. Execute fio randread on DM device continuously.
> 5. On the server whose multipath status is active, under nvmet namespace
> configfs directory, execute "echo 0 > enable" to disable the namespace.
> what I expect is that IO can be automatically retried and switched to the
> other storage server by multipath, fio goes on. But actually I see an
> "Operation not supported" error, and fio fails and stops. I've also tried
> iSCSI target, after I delete mapped lun from acl, fio continues running
> without any error.
> 
> My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the kernel
> code, I found that:
> 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to namespace
> not found.

So the controller through that path used to be able to access the
Namespace, then suddenly lost ability to do so, but some other path can
still access it if we retry on a failover/alternate path? I think your
target is returning the wrong error code. It should be SCT/SC 303h,
Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what
you're describing.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-05  4:37 ` Keith Busch
@ 2023-12-05  4:40   ` Christoph Hellwig
  2023-12-05  5:18     ` Keith Busch
  2023-12-05  8:50     ` Sagi Grimberg
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-12-05  4:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jirong Feng, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, peng.xiao

On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote:
> So the controller through that path used to be able to access the
> Namespace, then suddenly lost ability to do so, but some other path can
> still access it if we retry on a failover/alternate path? I think your
> target is returning the wrong error code. It should be SCT/SC 303h,
> Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what
> you're describing.

Yes, assuming ANA is actually supported by the controllers..


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-05  4:40   ` Christoph Hellwig
@ 2023-12-05  5:18     ` Keith Busch
  2023-12-05  7:06       ` Jirong Feng
  2023-12-05  8:50     ` Sagi Grimberg
  1 sibling, 1 reply; 28+ messages in thread
From: Keith Busch @ 2023-12-05  5:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jirong Feng, Jens Axboe, Sagi Grimberg, linux-nvme, peng.xiao

On Tue, Dec 05, 2023 at 05:40:35AM +0100, Christoph Hellwig wrote:
> On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote:
> > So the controller through that path used to be able to access the
> > Namespace, then suddenly lost ability to do so, but some other path can
> > still access it if we retry on a failover/alternate path? I think your
> > target is returning the wrong error code. It should be SCT/SC 303h,
> > Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what
> > you're describing.
> 
> Yes, assuming ANA is actually supported by the controllers..

Even without ANA, "Invalid Namespace" is still the wrong status code
when dynamic namespace attachement is supported. If the namespace still
exists in the subsystem but not attached to the controller processing a
command (i.e. "inactive"), the return needs be Invalid Field in Command:

  Specifying an inactive namespace identifier (refer to section 3.2.1.4)
  in a command that uses the namespace identifier shall cause the
  controller to abort the command with a status code of Invalid Field in
  Command

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-05  5:18     ` Keith Busch
@ 2023-12-05  7:06       ` Jirong Feng
  0 siblings, 0 replies; 28+ messages in thread
From: Jirong Feng @ 2023-12-05  7:06 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, linux-nvme, peng.xiao

As far as I know, according to current implementation of function 
nvmet_parse_io_cmd() in drivers/nvme/target/core.c, nvmet_req_find_ns() 
is called before nvmet_check_ana_state(), so I believe currently nvmet 
is returning NVME_SC_INVALID_NS once namespace is disabled no matter if 
ANA is supported. In nvmet, a disabled namespace acts like it does not 
exist. nvmet_check_ana_state() requires req->ns, which is assigned in 
nvmet_req_find_ns(). If namespace is unknown, nvmet can't know the state 
of its ana group either.

So, to be better up to the specification, nvmet does need to 
differentiate a namespace is disabled or non-existent?

Moreover, even if nvmet returns NVME_SC_INVALID_FIELD to the host, the 
status code is still translated to BLK_STS_NOTSUPP, multipath won't 
retry either...

在 2023/12/5 13:18, Keith Busch 写道:
> On Tue, Dec 05, 2023 at 05:40:35AM +0100, Christoph Hellwig wrote:
>> On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote:
>>> So the controller through that path used to be able to access the
>>> Namespace, then suddenly lost ability to do so, but some other path can
>>> still access it if we retry on a failover/alternate path? I think your
>>> target is returning the wrong error code. It should be SCT/SC 303h,
>>> Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what
>>> you're describing.
>> Yes, assuming ANA is actually supported by the controllers..
> Even without ANA, "Invalid Namespace" is still the wrong status code
> when dynamic namespace attachement is supported. If the namespace still
> exists in the subsystem but not attached to the controller processing a
> command (i.e. "inactive"), the return needs be Invalid Field in Command:
>
>    Specifying an inactive namespace identifier (refer to section 3.2.1.4)
>    in a command that uses the namespace identifier shall cause the
>    controller to abort the command with a status code of Invalid Field in
>    Command
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-05  4:40   ` Christoph Hellwig
  2023-12-05  5:18     ` Keith Busch
@ 2023-12-05  8:50     ` Sagi Grimberg
  2023-12-25 11:25       ` Jirong Feng
  1 sibling, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2023-12-05  8:50 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Jirong Feng, Jens Axboe, linux-nvme, peng.xiao


>> So the controller through that path used to be able to access the
>> Namespace, then suddenly lost ability to do so, but some other path can
>> still access it if we retry on a failover/alternate path? I think your
>> target is returning the wrong error code. It should be SCT/SC 303h,
>> Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what
>> you're describing.
> 
> Yes, assuming ANA is actually supported by the controllers..

Its a good point (should probably be ana inaccessible). But
semantically, this status is with respect to all the namespace in the
ana group. And the host will not see an updated ANA log page, which will
then override back the ns ana state?

The below can handle the return status, but its not clear what should
the behavior we want to have...
--
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e307a044b1a1..5fd5e74a41a8 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -726,6 +726,19 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
         NULL,
  };

+bool nvmet_subsys_nsid_exists(struct nvmet_subsys *subsys, u32 nsid)
+{
+       struct config_item *ns_item;
+       char name[4] = {};
+
+       if (sprintf(name, "%u\n", nsid) <= 0)
+               return false;
+       mutex_lock(&subsys->namespaces_group.cg_subsys->su_mutex);
+       ns_item = config_group_find_item(&subsys->namespaces_group, name);
+       mutex_unlock(&subsys->namespaces_group.cg_subsys->su_mutex);
+       return ns_item != NULL;
+}
+
  static void nvmet_ns_release(struct config_item *item)
  {
         struct nvmet_ns *ns = to_nvmet_ns(item);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 3935165048e7..426ced914a21 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -425,11 +425,15 @@ void nvmet_stop_keep_alive_timer(struct nvmet_ctrl 
*ctrl)
  u16 nvmet_req_find_ns(struct nvmet_req *req)
  {
         u32 nsid = le32_to_cpu(req->cmd->common.nsid);
+       struct nvmet_subsys *subsys = nvmet_req_subsys(req);

-       req->ns = xa_load(&nvmet_req_subsys(req)->namespaces, nsid);
+       req->ns = xa_load(&subsys->namespaces, nsid);
         if (unlikely(!req->ns)) {
                 req->error_loc = offsetof(struct nvme_common_command, 
nsid);
-               return NVME_SC_INVALID_NS | NVME_SC_DNR;
+               if (nvmet_subsys_nsid_exists(subsys, nsid))
+                       return NVME_ANA_PERSISTENT_LOSS;
+               else
+                       return NVME_SC_INVALID_NS | NVME_SC_DNR;
         }

         percpu_ref_get(&req->ns->ref);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 6c8acebe1a1a..477416abf85a 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -542,6 +542,7 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys 
*subsys,
                 struct nvmet_host *host);
  void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
                 u8 event_info, u8 log_page);
+bool nvmet_subsys_nsid_exists(struct nvmet_subsys *subsys, u32 nsid);

  #define NVMET_QUEUE_SIZE       1024
  #define NVMET_NR_QUEUES                128
--
--


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-05  8:50     ` Sagi Grimberg
@ 2023-12-25 11:25       ` Jirong Feng
  2023-12-25 11:40         ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2023-12-25 11:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

hi all,

any updates about this case?

thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-25 11:25       ` Jirong Feng
@ 2023-12-25 11:40         ` Sagi Grimberg
  2023-12-25 12:14           ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2023-12-25 11:40 UTC (permalink / raw)
  To: Jirong Feng, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

> hi all,
> 
> any updates about this case?

I think that we weren't able to find a suitable ANA status that
would cause the host to failover like you expect.

Perhaps nvmet should return NVME_SC_INTERNAL_PATH_ERROR ? Because
if the nvmet ns is disabled, it is somewhat equivalent at least
to some interpretation...

The only part that is unclear is what will the host do if it
gets an ana status but when reading the ana log page it will
see no change...

Did you test the patch I sent in one of the replies before?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-25 11:40         ` Sagi Grimberg
@ 2023-12-25 12:14           ` Jirong Feng
  2023-12-26 13:27             ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2023-12-25 12:14 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

>
> Did you test the patch I sent in one of the replies before?
>
not yet, I'll test it tomorrow ASAP.

thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-25 12:14           ` Jirong Feng
@ 2023-12-26 13:27             ` Jirong Feng
  2024-01-01  9:51               ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2023-12-26 13:27 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

I've tested the patch basing on kernel version 6.6.0. It seems not 
working...

here's my steps & results:

1. create a VM and make & install kernel from source applying the patch.

[root@fjr-nvmet-1 ~]# uname -r
6.6.0-mytest+

2. clone that VM.

3. create a shared volume and attach to the both VMs.

4. config nvmet as below:

VM1:

o- / 
......................................................................................................................... 
[...]
   o- hosts 
................................................................................................................... 
[...]
   o- ports 
................................................................................................................... 
[...]
   | o- 1 ................................................ [trtype=tcp, 
traddr=192.168.111.99, trsvcid=4420, inline_data_size=262144]
   |   o- ana_groups 
.......................................................................................................... 
[...]
   |   | o- 1 
..................................................................................................... 
[state=optimized]
   |   o- referrals 
........................................................................................................... 
[...]
   |   o- subsystems 
.......................................................................................................... 
[...]
   |     o- 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
......................................... [...]
   o- subsystems 
.............................................................................................................. 
[...]
     o- 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
[version=1.3, allow_any=1, serial=308df2776344fdd17cba]
       o- allowed_hosts 
....................................................................................................... 
[...]
       o- namespaces 
.......................................................................................................... 
[...]
         o- 1 .......................................... [path=/dev/vdc, 
uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled]

VM2:

o- / 
......................................................................................................................... 
[...]
   o- hosts 
................................................................................................................... 
[...]
   o- ports 
................................................................................................................... 
[...]
   | o- 1 ............................................... [trtype=tcp, 
traddr=192.168.111.111, trsvcid=4420, inline_data_size=262144]
   |   o- ana_groups 
.......................................................................................................... 
[...]
   |   | o- 1 
..................................................................................................... 
[state=optimized]
   |   o- referrals 
........................................................................................................... 
[...]
   |   o- subsystems 
.......................................................................................................... 
[...]
   |     o- 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
......................................... [...]
   o- subsystems 
.............................................................................................................. 
[...]
     o- 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
[version=1.3, allow_any=1, serial=0dcf77d36826000cc5a0]
       o- allowed_hosts 
....................................................................................................... 
[...]
       o- namespaces 
.......................................................................................................... 
[...]
         o- 1 .......................................... [path=/dev/vdc, 
uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled]

5. create a host vm(CentOS 8.1, kernel version 
4.18.0-147.3.1.el8_1.aarch64), config dm multipath

[root@fjr-vm1 ~]# cat /etc/multipath/conf.d/nvme.conf
devices {
     device {
                 vendor                  "NVME"
                 product                 "Linux"
                 path_selector           "round-robin 0"
                 path_grouping_policy    failover
                 uid_attribute           ID_SERIAL
                 prio                    "ANA"
                 path_checker            "none"
                 #rr_min_io               100
                 #rr_min_io_rq            "1"
                 #fast_io_fail_tmo        15
                 #dev_loss_tmo            600
                 #rr_weight               uniform
                 rr_weight               priorities
                 failback                immediate
                 no_path_retry           queue
     }
}


6. connect nvme on host, finally it looks like:

[root@fjr-vm1 ~]# nvme list
Node             SN Model                                    Namespace 
Usage                      Format           FW Rev
---------------- -------------------- 
---------------------------------------- --------- 
-------------------------- ---------------- --------
/dev/nvme0n1     0dcf77d36826000cc5a0 
Linux                                    1         107.37  GB / 107.37  
GB    512   B +  0 B   6.6.0-my
/dev/nvme1n1     308df2776344fdd17cba 
Linux                                    1         107.37  GB / 107.37  
GB    512   B +  0 B   6.6.0-my

[root@fjr-vm1 ~]# nvme list-subsys
nvme-subsys0 - 
NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06
\
  +- nvme0 tcp traddr=192.168.111.111 trsvcid=4420 live
  +- nvme1 tcp traddr=192.168.111.99 trsvcid=4420 live

[root@fjr-vm1 ~]# multipath -ll
mpatha (Linux_0dcf77d36826000cc5a0) dm-0 NVME,Linux
size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 1:1:1:1 nvme1n1 259:1 active ready running
`-+- policy='round-robin 0' prio=50 status=enabled
   `- 0:1:1:1 nvme0n1 259:0 active ready running

7. execute fio on host, and disable namespace on the vm corresponding to 
nvme1, the same error goes again:

fio: io_u error on file /dev/dm-0: Operation not supported: write 
offset=14734594048, buflen=4096
fio: io_u error on file /dev/dm-0: Operation not supported: write 
offset=106607394816, buflen=4096
fio: pid=16076, err=95/file:io_u.c:1747, func=io_u error, 
error=Operation not supported





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2023-12-26 13:27             ` Jirong Feng
@ 2024-01-01  9:51               ` Sagi Grimberg
  2024-01-02 10:33                 ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-01-01  9:51 UTC (permalink / raw)
  To: Jirong Feng, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao


> I've tested the patch basing on kernel version 6.6.0. It seems not 
> working...

Can you paste the log output (host and controller)?

> 
> here's my steps & results:
> 
> 1. create a VM and make & install kernel from source applying the patch.
> 
> [root@fjr-nvmet-1 ~]# uname -r
> 6.6.0-mytest+
> 
> 2. clone that VM.
> 
> 3. create a shared volume and attach to the both VMs.
> 
> 4. config nvmet as below:
> 
> VM1:
> 
> o- / 
> ......................................................................................................................... [...]
>    o- hosts 
> ................................................................................................................... [...]
>    o- ports 
> ................................................................................................................... [...]
>    | o- 1 ................................................ [trtype=tcp, 
> traddr=192.168.111.99, trsvcid=4420, inline_data_size=262144]
>    |   o- ana_groups 
> .......................................................................................................... [...]
>    |   | o- 1 
> ..................................................................................................... [state=optimized]
>    |   o- referrals 
> ........................................................................................................... [...]
>    |   o- subsystems 
> .......................................................................................................... [...]
>    |     o- 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...]
>    o- subsystems 
> .............................................................................................................. [...]
>      o- 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=308df2776344fdd17cba]
>        o- allowed_hosts 
> ....................................................................................................... [...]
>        o- namespaces 
> .......................................................................................................... [...]
>          o- 1 .......................................... [path=/dev/vdc, 
> uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled]
> 
> VM2:
> 
> o- / 
> ......................................................................................................................... [...]
>    o- hosts 
> ................................................................................................................... [...]
>    o- ports 
> ................................................................................................................... [...]
>    | o- 1 ............................................... [trtype=tcp, 
> traddr=192.168.111.111, trsvcid=4420, inline_data_size=262144]
>    |   o- ana_groups 
> .......................................................................................................... [...]
>    |   | o- 1 
> ..................................................................................................... [state=optimized]
>    |   o- referrals 
> ........................................................................................................... [...]
>    |   o- subsystems 
> .......................................................................................................... [...]
>    |     o- 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...]
>    o- subsystems 
> .............................................................................................................. [...]
>      o- 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=0dcf77d36826000cc5a0]
>        o- allowed_hosts 
> ....................................................................................................... [...]
>        o- namespaces 
> .......................................................................................................... [...]
>          o- 1 .......................................... [path=/dev/vdc, 
> uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled]
> 
> 5. create a host vm(CentOS 8.1, kernel version 
> 4.18.0-147.3.1.el8_1.aarch64), config dm multipath
> 
> [root@fjr-vm1 ~]# cat /etc/multipath/conf.d/nvme.conf
> devices {
>      device {
>                  vendor                  "NVME"
>                  product                 "Linux"
>                  path_selector           "round-robin 0"
>                  path_grouping_policy    failover
>                  uid_attribute           ID_SERIAL
>                  prio                    "ANA"
>                  path_checker            "none"
>                  #rr_min_io               100
>                  #rr_min_io_rq            "1"
>                  #fast_io_fail_tmo        15
>                  #dev_loss_tmo            600
>                  #rr_weight               uniform
>                  rr_weight               priorities
>                  failback                immediate
>                  no_path_retry           queue
>      }
> }
> 
> 
> 6. connect nvme on host, finally it looks like:
> 
> [root@fjr-vm1 ~]# nvme list
> Node             SN Model                                    Namespace 
> Usage                      Format           FW Rev
> ---------------- -------------------- 
> ---------------------------------------- --------- 
> -------------------------- ---------------- --------
> /dev/nvme0n1     0dcf77d36826000cc5a0 
> Linux                                    1         107.37  GB / 107.37 
> GB    512   B +  0 B   6.6.0-my
> /dev/nvme1n1     308df2776344fdd17cba 
> Linux                                    1         107.37  GB / 107.37 
> GB    512   B +  0 B   6.6.0-my
> 
> [root@fjr-vm1 ~]# nvme list-subsys
> nvme-subsys0 - 
> NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06
> \
>   +- nvme0 tcp traddr=192.168.111.111 trsvcid=4420 live
>   +- nvme1 tcp traddr=192.168.111.99 trsvcid=4420 live
> 
> [root@fjr-vm1 ~]# multipath -ll
> mpatha (Linux_0dcf77d36826000cc5a0) dm-0 NVME,Linux
> size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | `- 1:1:1:1 nvme1n1 259:1 active ready running
> `-+- policy='round-robin 0' prio=50 status=enabled
>    `- 0:1:1:1 nvme0n1 259:0 active ready running
> 
> 7. execute fio on host, and disable namespace on the vm corresponding to 
> nvme1, the same error goes again:
> 
> fio: io_u error on file /dev/dm-0: Operation not supported: write 
> offset=14734594048, buflen=4096
> fio: io_u error on file /dev/dm-0: Operation not supported: write 
> offset=106607394816, buflen=4096
> fio: pid=16076, err=95/file:io_u.c:1747, func=io_u error, 
> error=Operation not supported
> 
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-01  9:51               ` Sagi Grimberg
@ 2024-01-02 10:33                 ` Jirong Feng
  2024-01-02 12:46                   ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-01-02 10:33 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

in function nvmet_subsys_nsid_exists() added in the patch, remove the 
'\n' from line `if (sprintf(name, "%u\n", nsid) <= 0)`, the patch works 
exactly as what I expect, fio keeps running and multipath does failover.

> Can you paste the log output (host and controller)?
>
host:

[Tue Jan  2 10:22:11 2024] print_req_error: 8 callbacks suppressed
[Tue Jan  2 10:22:11 2024] print_req_error: I/O error, dev nvme1n1, 
sector 186257448 flags ca01
[Tue Jan  2 10:22:11 2024] device-mapper: multipath: Failing path 259:1.
[Tue Jan  2 10:22:11 2024] nvme nvme1: rescanning namespaces.
[Tue Jan  2 10:22:11 2024] device-mapper: multipath round-robin: 
repeat_count > 1 is deprecated, using 1 instead

target:

[Tue Jan  2 10:21:57 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs
[Tue Jan  2 10:22:12 2024] nvmet: jirong add: returning 
NVME_ANA_PERSISTENT_LOSS
[Tue Jan  2 10:22:12 2024] nvmet_tcp: failed cmd 00000000de551a59 id 37 
opcode 1, data_len: 4096
[Tue Jan  2 10:22:12 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Tue Jan  2 10:22:17 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-02 10:33                 ` Jirong Feng
@ 2024-01-02 12:46                   ` Sagi Grimberg
  2024-01-03 10:24                     ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-01-02 12:46 UTC (permalink / raw)
  To: Jirong Feng, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

> in function nvmet_subsys_nsid_exists() added in the patch, remove the 
> '\n' from line `if (sprintf(name, "%u\n", nsid) <= 0)`, the patch works 
> exactly as what I expect, fio keeps running and multipath does failover.

OK, can you please check nvme native mpath as well?

What I suspect will happen is that the host will be unable to
failover because it will re-read the ana log page and not find
anything wrong with the actual path, hence will just retry on
the same namespace.

That may be transient because there should be an AEN on the way
to the host to remove the (path'd) namespace altogether.

The status reporting is not consistent with the actual ana state
hence the approach has a semantic problem. perhaps we want another
error that is a path status, but not semantically an ana status.

Can you try returning NVME_SC_CTRL_PATH_ERROR instead of
NVME_SC_ANA_PERSISTENT_LOSS ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-02 12:46                   ` Sagi Grimberg
@ 2024-01-03 10:24                     ` Jirong Feng
  2024-01-04 11:56                       ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-01-03 10:24 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

> OK, can you please check nvme native mpath as well?

switch to nvme native mpath:

[root@fjr-vm1 ~]# nvme list-subsys
nvme-subsys0 - 
NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06
\
  +- nvme0 tcp traddr=192.168.111.99 trsvcid=4420 live
  +- nvme1 tcp traddr=192.168.111.111 trsvcid=4420 live
[root@fjr-vm1 ~]# multipath -ll
uuid.cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [nvme]:nvme0n1 NVMe,Linux,6.6.0-my
size=209715200 features='n/a' hwhandler='ANA' wp=rw
|-+- policy='n/a' prio=50 status=optimized
| `- 0:0:1 nvme0c0n1 0:0 n/a optimized live
`-+- policy='n/a' prio=50 status=optimized
   `- 0:1:1 nvme0c1n1 0:0 n/a optimized live

fio still keeps running without any error, just for this time. (see below)

host dmesg:

[Wed Jan  3 07:42:55 2024] nvme nvme0: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:42:55 2024] nvme nvme1: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:00 2024] nvme nvme0: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:00 2024] nvme nvme1: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 0
[Wed Jan  3 07:43:05 2024] nvme nvme0: ANA group 1: optimized.
[Wed Jan  3 07:43:05 2024] nvme nvme0: creating 4 I/O queues.
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 1
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 2
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 3
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 4
[Wed Jan  3 07:43:05 2024] nvme nvme0: rescanning namespaces.
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 0
[Wed Jan  3 07:43:05 2024] nvme nvme0: ANA group 1: optimized.
[Wed Jan  3 07:43:05 2024] nvme nvme0: creating 4 I/O queues.
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 1
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 2
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 3
[Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 4
[Wed Jan  3 07:43:05 2024] nvme nvme1: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:10 2024] nvme nvme0: reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:10 2024] nvme nvme1: reschedule traffic based 
keep-alive timer

target dmesg:

[Wed Jan  3 07:41:23 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs
[Wed Jan  3 07:41:33 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs
[Wed Jan  3 07:41:43 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs
[Wed Jan  3 07:41:58 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:42:14 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:42:29 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:42:44 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:00 2024] nvmet: ctrl 1 reschedule traffic based 
keep-alive timer
[Wed Jan  3 07:43:04 2024] nvmet: fjr add: returning 
NVME_ANA_PERSISTENT_LOSS
[Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 0000000034dfe760 id 14 
opcode 1, data_len: 4096
[Wed Jan  3 07:43:04 2024] nvmet: got cmd 12 while CC.EN == 0 on qid = 0
[Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 00000000228b330a id 31 
opcode 12, data_len: 0
[Wed Jan  3 07:43:04 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs
[Wed Jan  3 07:43:04 2024] nvmet: ctrl 1 stop keep-alive
[Wed Jan  3 07:43:04 2024] nvmet: creating nvm controller 2 for 
subsystem 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
for NQN 
nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 1 to ctrl 2.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 2 to ctrl 2.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 3 to ctrl 2.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 4 to ctrl 2.
[Wed Jan  3 07:43:04 2024] nvmet: fjr add: returning 
NVME_ANA_PERSISTENT_LOSS
[Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 00000000d9d3dba9 id 100 
opcode 1, data_len: 4096
[Wed Jan  3 07:43:04 2024] nvmet: ctrl 1 start keep-alive timer for 15 secs
[Wed Jan  3 07:43:04 2024] nvmet: ctrl 2 stop keep-alive
[Wed Jan  3 07:43:04 2024] nvmet: creating nvm controller 1 for 
subsystem 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
for NQN 
nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 1 to ctrl 1.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 2 to ctrl 1.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 3 to ctrl 1.
[Wed Jan  3 07:43:04 2024] nvmet: adding queue 4 to ctrl 1.
[Wed Jan  3 07:43:14 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs

>
> Can you try returning NVME_SC_CTRL_PATH_ERROR instead of
> NVME_SC_ANA_PERSISTENT_LOSS ?

I enabled/disabled again and again, found that fio keeps running for 
most time, but occasionally(about 10% or less) fails and stops with error.

fio: io_u error on file /dev/nvme0n1: Input/output error: write 
offset=100662296576, buflen=4096
fio: pid=1485, err=5/file:io_u.c:1747, func=io_u error, 
error=Input/output error

fio_iops: (groupid=0, jobs=1): err= 5 (file:io_u.c:1747, func=io_u 
error, error=Input/output error): pid=1485: Wed Jan  3 08:44:09 2024

host dmesg:

[Wed Jan  3 08:44:06 2024] nvme nvme1: reschedule traffic based 
keep-alive timer
[Wed Jan  3 08:44:07 2024] nvme nvme0: reschedule traffic based 
keep-alive timer
[Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 0
[Wed Jan  3 08:44:09 2024] nvme nvme0: ANA group 1: optimized.
[Wed Jan  3 08:44:09 2024] nvme nvme0: creating 4 I/O queues.
[Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 1
[Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 2
[Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 3
[Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 4
[Wed Jan  3 08:44:09 2024] nvme nvme0: rescanning namespaces.
[Wed Jan  3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical 
block 0, async page read
[Wed Jan  3 08:44:09 2024]  nvme0n1: unable to read partition table
[Wed Jan  3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical 
block 6, async page read
[Wed Jan  3 08:44:11 2024] nvme nvme1: reschedule traffic based 
keep-alive timer
[Wed Jan  3 08:44:14 2024] nvme nvme0: reschedule traffic based 
keep-alive timer

target dmesg:

[Wed Jan  3 08:44:08 2024] nvmet: fjr add: returning NVME_SC_CTRL_PATH_ERROR
[Wed Jan  3 08:44:08 2024] nvmet_tcp: failed cmd 00000000c11e0ae7 id 53 
opcode 1, data_len: 4096
[Wed Jan  3 08:44:08 2024] nvmet: fjr add: returning NVME_SC_CTRL_PATH_ERROR
[Wed Jan  3 08:44:08 2024] nvmet_tcp: failed cmd 00000000e0d12c37 id 54 
opcode 1, data_len: 4096
[Wed Jan  3 08:44:08 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs
[Wed Jan  3 08:44:08 2024] nvmet: ctrl 1 stop keep-alive
[Wed Jan  3 08:44:08 2024] nvmet: creating nvm controller 2 for 
subsystem 
nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 
for NQN 
nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
[Wed Jan  3 08:44:08 2024] nvmet: adding queue 1 to ctrl 2.
[Wed Jan  3 08:44:08 2024] nvmet: adding queue 2 to ctrl 2.
[Wed Jan  3 08:44:08 2024] nvmet: adding queue 3 to ctrl 2.
[Wed Jan  3 08:44:08 2024] nvmet: adding queue 4 to ctrl 2.
[Wed Jan  3 08:44:18 2024] nvmet: ctrl 2 update keep-alive timer for 15 secs
[Wed Jan  3 08:44:28 2024] nvmet: ctrl 2 update keep-alive timer for 15 secs

then back to returning NVME_ANA_PERSISTENT_LOSS, fio occasionally fails 
too. log output are pretty the same.

then back to dm multipath, for about 50 times enable/disable, fio never 
fails.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-03 10:24                     ` Jirong Feng
@ 2024-01-04 11:56                       ` Sagi Grimberg
  2024-01-30  9:36                         ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-01-04 11:56 UTC (permalink / raw)
  To: Jirong Feng, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao



On 1/3/24 12:24, Jirong Feng wrote:
>> OK, can you please check nvme native mpath as well?
> 
> switch to nvme native mpath:
> 
> [root@fjr-vm1 ~]# nvme list-subsys
> nvme-subsys0 - 
> NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06
> \
>   +- nvme0 tcp traddr=192.168.111.99 trsvcid=4420 live
>   +- nvme1 tcp traddr=192.168.111.111 trsvcid=4420 live
> [root@fjr-vm1 ~]# multipath -ll
> uuid.cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [nvme]:nvme0n1 
> NVMe,Linux,6.6.0-my
> size=209715200 features='n/a' hwhandler='ANA' wp=rw
> |-+- policy='n/a' prio=50 status=optimized
> | `- 0:0:1 nvme0c0n1 0:0 n/a optimized live
> `-+- policy='n/a' prio=50 status=optimized
>    `- 0:1:1 nvme0c1n1 0:0 n/a optimized live
> 
> fio still keeps running without any error, just for this time. (see below)
> 
> host dmesg:
> 
> [Wed Jan  3 07:42:55 2024] nvme nvme0: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:42:55 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:00 2024] nvme nvme0: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:00 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 0
> [Wed Jan  3 07:43:05 2024] nvme nvme0: ANA group 1: optimized.
> [Wed Jan  3 07:43:05 2024] nvme nvme0: creating 4 I/O queues.
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 1
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 2
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 3
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 4
> [Wed Jan  3 07:43:05 2024] nvme nvme0: rescanning namespaces.
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 0
> [Wed Jan  3 07:43:05 2024] nvme nvme0: ANA group 1: optimized.
> [Wed Jan  3 07:43:05 2024] nvme nvme0: creating 4 I/O queues.
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 1
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 2
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 3
> [Wed Jan  3 07:43:05 2024] nvme nvme0: connecting queue 4
> [Wed Jan  3 07:43:05 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:10 2024] nvme nvme0: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:10 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> 
> target dmesg:
> 
> [Wed Jan  3 07:41:23 2024] nvmet: ctrl 1 update keep-alive timer for 15 
> secs
> [Wed Jan  3 07:41:33 2024] nvmet: ctrl 1 update keep-alive timer for 15 
> secs
> [Wed Jan  3 07:41:43 2024] nvmet: ctrl 1 update keep-alive timer for 15 
> secs
> [Wed Jan  3 07:41:58 2024] nvmet: ctrl 1 reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:42:14 2024] nvmet: ctrl 1 reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:42:29 2024] nvmet: ctrl 1 reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:42:44 2024] nvmet: ctrl 1 reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:00 2024] nvmet: ctrl 1 reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 07:43:04 2024] nvmet: fjr add: returning 
> NVME_ANA_PERSISTENT_LOSS
> [Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 0000000034dfe760 id 14 
> opcode 1, data_len: 4096
> [Wed Jan  3 07:43:04 2024] nvmet: got cmd 12 while CC.EN == 0 on qid = 0
> [Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 00000000228b330a id 31 
> opcode 12, data_len: 0
> [Wed Jan  3 07:43:04 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs
> [Wed Jan  3 07:43:04 2024] nvmet: ctrl 1 stop keep-alive
> [Wed Jan  3 07:43:04 2024] nvmet: creating nvm controller 2 for 
> subsystem 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 1 to ctrl 2.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 2 to ctrl 2.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 3 to ctrl 2.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 4 to ctrl 2.
> [Wed Jan  3 07:43:04 2024] nvmet: fjr add: returning 
> NVME_ANA_PERSISTENT_LOSS
> [Wed Jan  3 07:43:04 2024] nvmet_tcp: failed cmd 00000000d9d3dba9 id 100 
> opcode 1, data_len: 4096
> [Wed Jan  3 07:43:04 2024] nvmet: ctrl 1 start keep-alive timer for 15 secs
> [Wed Jan  3 07:43:04 2024] nvmet: ctrl 2 stop keep-alive
> [Wed Jan  3 07:43:04 2024] nvmet: creating nvm controller 1 for 
> subsystem 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 1 to ctrl 1.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 2 to ctrl 1.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 3 to ctrl 1.
> [Wed Jan  3 07:43:04 2024] nvmet: adding queue 4 to ctrl 1.
> [Wed Jan  3 07:43:14 2024] nvmet: ctrl 1 update keep-alive timer for 15 
> secs
> 
>>
>> Can you try returning NVME_SC_CTRL_PATH_ERROR instead of
>> NVME_SC_ANA_PERSISTENT_LOSS ?
> 
> I enabled/disabled again and again, found that fio keeps running for 
> most time, but occasionally(about 10% or less) fails and stops with error.
> 
> fio: io_u error on file /dev/nvme0n1: Input/output error: write 
> offset=100662296576, buflen=4096
> fio: pid=1485, err=5/file:io_u.c:1747, func=io_u error, 
> error=Input/output error
> 
> fio_iops: (groupid=0, jobs=1): err= 5 (file:io_u.c:1747, func=io_u 
> error, error=Input/output error): pid=1485: Wed Jan  3 08:44:09 2024
> 
> host dmesg:
> 
> [Wed Jan  3 08:44:06 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 08:44:07 2024] nvme nvme0: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 0
> [Wed Jan  3 08:44:09 2024] nvme nvme0: ANA group 1: optimized.
> [Wed Jan  3 08:44:09 2024] nvme nvme0: creating 4 I/O queues.
> [Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 1
> [Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 2
> [Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 3
> [Wed Jan  3 08:44:09 2024] nvme nvme0: connecting queue 4
> [Wed Jan  3 08:44:09 2024] nvme nvme0: rescanning namespaces.
> [Wed Jan  3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical 
> block 0, async page read
> [Wed Jan  3 08:44:09 2024]  nvme0n1: unable to read partition table
> [Wed Jan  3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical 
> block 6, async page read
> [Wed Jan  3 08:44:11 2024] nvme nvme1: reschedule traffic based 
> keep-alive timer
> [Wed Jan  3 08:44:14 2024] nvme nvme0: reschedule traffic based 
> keep-alive timer
> 
> target dmesg:
> 
> [Wed Jan  3 08:44:08 2024] nvmet: fjr add: returning 
> NVME_SC_CTRL_PATH_ERROR
> [Wed Jan  3 08:44:08 2024] nvmet_tcp: failed cmd 00000000c11e0ae7 id 53 
> opcode 1, data_len: 4096
> [Wed Jan  3 08:44:08 2024] nvmet: fjr add: returning 
> NVME_SC_CTRL_PATH_ERROR
> [Wed Jan  3 08:44:08 2024] nvmet_tcp: failed cmd 00000000e0d12c37 id 54 
> opcode 1, data_len: 4096
> [Wed Jan  3 08:44:08 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs
> [Wed Jan  3 08:44:08 2024] nvmet: ctrl 1 stop keep-alive
> [Wed Jan  3 08:44:08 2024] nvmet: creating nvm controller 2 for 
> subsystem 
> nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be.
> [Wed Jan  3 08:44:08 2024] nvmet: adding queue 1 to ctrl 2.
> [Wed Jan  3 08:44:08 2024] nvmet: adding queue 2 to ctrl 2.
> [Wed Jan  3 08:44:08 2024] nvmet: adding queue 3 to ctrl 2.
> [Wed Jan  3 08:44:08 2024] nvmet: adding queue 4 to ctrl 2.
> [Wed Jan  3 08:44:18 2024] nvmet: ctrl 2 update keep-alive timer for 15 
> secs
> [Wed Jan  3 08:44:28 2024] nvmet: ctrl 2 update keep-alive timer for 15 
> secs
> 
> then back to returning NVME_ANA_PERSISTENT_LOSS, fio occasionally fails 
> too. log output are pretty the same.
> 
> then back to dm multipath, for about 50 times enable/disable, fio never 
> fails.
> 

Hmm, its interesting why you fail only in particular ios and not every
io. I suspect that there is a timing issue here.

Looking at the code, I suspect that ios continue being sent to the 
path'd namespace although they shouldn't. The reason is that if we
return an ana error, then the host will re-read the ana log page again
and find the namespace eligible for IO (the action of disable/enable
namespace does not impact the ana log), or,
we return a path error which is not ana error, in this case the host
will not re-read the ana log page, and the namespace will be re-selected
in the next IO (or at least nothing prevents it).

First of all, I think that the most suitable status for nvmet to return
in this case is: NVME_SC_INTERNAL_PATH_ERROR

 From the spec:
Internal Path Error: The command was not completed as the result of a
controller internal error that is specific to the controller processing
the command. Retries for the request function should be based on the
setting of the DNR bit (refer to Figure 92).

In the host code, I don't see any reference to such error status
returned by the controller. So I think we may want to pair it with
something like (this untested hunk):
--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0a88d7bdc5e3..0fb82056ba5f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -97,6 +97,14 @@ void nvme_failover_req(struct request *req)
         if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
                 set_bit(NVME_NS_ANA_PENDING, &ns->flags);
                 queue_work(nvme_wq, &ns->ctrl->ana_work);
+       } else if ((status & 0x7ff) == NVME_SC_INTERNAL_PATH_ERROR) {
+               /*
+                * The ctrl is telling us it is unable to reach the
+                * ns in a way that does not impact the entire ana
+                * group. The only way we can stop sending io to this
+                * specific namespace is by clearing its ready bit.
+                */
+               clear_bit(NVME_NS_READY, &ns->flags);
         }

         spin_lock_irqsave(&ns->head->requeue_lock, flags);
--

Keith, Christoph, do you agree that the host action when it sees an
error status like NVME_SC_INTERNAL_PATH_ERROR it needs to stop sending
IO to the namespace but not change anything related to ana?


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-04 11:56                       ` Sagi Grimberg
@ 2024-01-30  9:36                         ` Jirong Feng
  2024-01-30 11:29                           ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-01-30  9:36 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

Now I suspect that my testcase is inappropriate for nvme native multipath.
according to the base spec, chapter 2.4.1, nvme native multipath aims at
accessing a certain namespace through multiple paths, not how to group
different namespaces into one device. Therefore, in fabrics' case, a
namespace must belong to one subsystem on a single target server. Looking at
the latest code of nvme driver host, the host does refuse those namespaces
reporting the same uuid on two different subsystems(in function
nvme_global_check_duplicate_ids), which is exactly what I'm doing. The
testcase seems to be a misuse of nvme native multipath.

However, the testcase is pretty reasonable for dm-mpath. In a cloud 
scenario,
we usually need a volume to be synced and exposed on multiple target servers
for high availability reason. dm-mpath can do that, only if we choose 
group by
serial. Namespaces from different subsystems reporting different uuid, but
with same serial, can be recognized as one device by dm-mpath.

The only problem seems just to be the returning code for dm-mpath to
failover. native mpath should not encounter this case.

please correct me if I'm wrong :)

Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-30  9:36                         ` Jirong Feng
@ 2024-01-30 11:29                           ` Sagi Grimberg
  2024-01-31  6:25                             ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-01-30 11:29 UTC (permalink / raw)
  To: Jirong Feng, Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, linux-nvme, peng.xiao

> Now I suspect that my testcase is inappropriate for nvme native multipath.
> according to the base spec, chapter 2.4.1, nvme native multipath aims at
> accessing a certain namespace through multiple paths, not how to group
> different namespaces into one device. Therefore, in fabrics' case, a
> namespace must belong to one subsystem on a single target server.

Nothing restricts a subsystem to a single server.

You can expose the same subsystem from two different servers afair (as
well as the same nsid uuid, ana groups etc).

> Looking at
> the latest code of nvme driver host, the host does refuse those namespaces
> reporting the same uuid on two different subsystems(in function
> nvme_global_check_duplicate_ids), which is exactly what I'm doing. The
> testcase seems to be a misuse of nvme native multipath.

multipathing scope is within a subsystem as far as linux is concerned.

> However, the testcase is pretty reasonable for dm-mpath. In a cloud 
> scenario,
> we usually need a volume to be synced and exposed on multiple target 
> servers
> for high availability reason. dm-mpath can do that, only if we choose 
> group by
> serial. Namespaces from different subsystems reporting different uuid, but
> with same serial, can be recognized as one device by dm-mpath.
> 
> The only problem seems just to be the returning code for dm-mpath to
> failover. native mpath should not encounter this case.
> 
> please correct me if I'm wrong :)

As mentioned, afair (Hannes can correct me if I'm wrong) you can
make an nvmet subsystem span more than 1 server assuming that the
backend device is consistent (i.e. using drbd). The only thing that
you need to pay attention is that the cntlid range is not overlapping
in each of the servers that expose the nvmet subsystem (cntlid_min/max
configfs attributes).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-30 11:29                           ` Sagi Grimberg
@ 2024-01-31  6:25                             ` Christoph Hellwig
  2024-03-20  3:17                               ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-01-31  6:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jirong Feng, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-nvme, peng.xiao

On Tue, Jan 30, 2024 at 01:29:33PM +0200, Sagi Grimberg wrote:
> As mentioned, afair (Hannes can correct me if I'm wrong) you can
> make an nvmet subsystem span more than 1 server assuming that the
> backend device is consistent (i.e. using drbd). The only thing that
> you need to pay attention is that the cntlid range is not overlapping
> in each of the servers that expose the nvmet subsystem (cntlid_min/max
> configfs attributes).

You can even make a subsystem span multiple "servers" without shared
storage, but in that case you'd better now allow simultaneous access to
any given namespace through path pointing to the different "servers".
ANA comes in pretty handy for that.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-01-31  6:25                             ` Christoph Hellwig
@ 2024-03-20  3:17                               ` Jirong Feng
  2024-03-20  8:51                                 ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-03-20  3:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

hi,

just kindly ask, how about the previous patch? will it be merged?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-03-20  3:17                               ` Jirong Feng
@ 2024-03-20  8:51                                 ` Sagi Grimberg
  2024-03-21  3:06                                   ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-03-20  8:51 UTC (permalink / raw)
  To: Jirong Feng
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

On 20/03/2024 5:17, Jirong Feng wrote:
> hi,
>
> just kindly ask, how about the previous patch? will it be merged?
>

Hey Jirong,

We do not yet understand if this works for Linux nvme-mpath (which iirc 
requires a suggested host-side patch).
Once we understand that we can take the changes to mainline.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-03-20  8:51                                 ` Sagi Grimberg
@ 2024-03-21  3:06                                   ` Jirong Feng
  2024-04-07 22:28                                     ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-03-21  3:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

> Hey Jirong,
>
> We do not yet understand if this works for Linux nvme-mpath (which 
> iirc requires a suggested host-side patch).
> Once we understand that we can take the changes to mainline.
>
My last test(native multipath) was done on kernel 4.18.0-147.3.1.el8_1, 
the result is ocassional failure.

Refer to your previous reply to my question, this time I changed the 
cntlid_min/cntlid_max range, making it single subsystem from different 
targets.
then I retested, here's the result:

1. on kernel 4.18.0-147.3.1.el8_1, the failure still occurs.
2. on kernel 6.6.0, no failure. (about 50 times)
3. on kernel 6.6.0 applying your host-side patch, no failure.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-03-21  3:06                                   ` Jirong Feng
@ 2024-04-07 22:28                                     ` Sagi Grimberg
  2024-04-12  7:52                                       ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-04-07 22:28 UTC (permalink / raw)
  To: Jirong Feng
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

On 21/03/2024 5:06, Jirong Feng wrote:
>
>> Hey Jirong,
>>
>> We do not yet understand if this works for Linux nvme-mpath (which 
>> iirc requires a suggested host-side patch).
>> Once we understand that we can take the changes to mainline.
>>
> My last test(native multipath) was done on kernel 
> 4.18.0-147.3.1.el8_1, the result is ocassional failure.
>
> Refer to your previous reply to my question, this time I changed the 
> cntlid_min/cntlid_max range, making it single subsystem from different 
> targets.
> then I retested, here's the result:
>
> 1. on kernel 4.18.0-147.3.1.el8_1, the failure still occurs.
> 2. on kernel 6.6.0, no failure. (about 50 times)
> 3. on kernel 6.6.0 applying your host-side patch, no failure.

So essentially there is no need for the host side patch? interesting. 
Are you sure?

Can you please also try with mpath iopolicy=round-robin?

I'm asking because I cannot understand what is preventing this path from 
being selected again and
again for I/O....

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-04-07 22:28                                     ` Sagi Grimberg
@ 2024-04-12  7:52                                       ` Jirong Feng
  2024-04-12  8:57                                         ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Jirong Feng @ 2024-04-12  7:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

> So essentially there is no need for the host side patch? interesting. 
> Are you sure?

At least no failure is observed in a newer version(6.6.0) of kernel so 
far. I can only tell that I've tested it for hundreds of times.

In addition, I've got some scripts to enable/disable it continually, we 
can observe it a few more days.

> Can you please also try with mpath iopolicy=round-robin?
All my previous tests were done with round-robin. I retested again today 
both round-robin and numa, the results are still the same.

> I'm asking because I cannot understand what is preventing this path 
> from being selected again and
> again for I/O....

Perhaps we need to dive into the code of old 
version(4.18.0-147.3.1.el8_1) and see what's different?

Or should I try apply the host side patch to the old version and test again?

Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-04-12  7:52                                       ` Jirong Feng
@ 2024-04-12  8:57                                         ` Sagi Grimberg
  2024-04-22  9:47                                           ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-04-12  8:57 UTC (permalink / raw)
  To: Jirong Feng
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

On 12/04/2024 10:52, Jirong Feng wrote:
>> So essentially there is no need for the host side patch? interesting. 
>> Are you sure?
>
> At least no failure is observed in a newer version(6.6.0) of kernel so 
> far. I can only tell that I've tested it for hundreds of times.
>
> In addition, I've got some scripts to enable/disable it continually, 
> we can observe it a few more days.
>
>
>> Can you please also try with mpath iopolicy=round-robin?
> All my previous tests were done with round-robin. I retested again 
> today both round-robin and numa, the results are still the same.
>
>
>> I'm asking because I cannot understand what is preventing this path 
>> from being selected again and
>> again for I/O....
>
> Perhaps we need to dive into the code of old 
> version(4.18.0-147.3.1.el8_1) and see what's different?
>
> Or should I try apply the host side patch to the old version and test 
> again?

What I think you want is to trace if the path where you disabled the 
namespace is actually being selected
over and over again, and failed over...

Can you please activate tracing and see where your mpath commands are 
actually going from?

I'd trace nvme_setup_cmd, and see that once you disable one nvmet ns, it 
is not selected by the mpath
namespace as a valid ns.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-04-12  8:57                                         ` Sagi Grimberg
@ 2024-04-22  9:47                                           ` Sagi Grimberg
  2024-04-23  3:15                                             ` Jirong Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2024-04-22  9:47 UTC (permalink / raw)
  To: Jirong Feng
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig



On 12/04/2024 11:57, Sagi Grimberg wrote:
>
>
> On 12/04/2024 10:52, Jirong Feng wrote:
>>> So essentially there is no need for the host side patch? 
>>> interesting. Are you sure?
>>
>> At least no failure is observed in a newer version(6.6.0) of kernel 
>> so far. I can only tell that I've tested it for hundreds of times.
>>
>> In addition, I've got some scripts to enable/disable it continually, 
>> we can observe it a few more days.
>>
>>
>>> Can you please also try with mpath iopolicy=round-robin?
>> All my previous tests were done with round-robin. I retested again 
>> today both round-robin and numa, the results are still the same.
>>
>>
>>> I'm asking because I cannot understand what is preventing this path 
>>> from being selected again and
>>> again for I/O....
>>
>> Perhaps we need to dive into the code of old 
>> version(4.18.0-147.3.1.el8_1) and see what's different?
>>
>> Or should I try apply the host side patch to the old version and test 
>> again?
>
> What I think you want is to trace if the path where you disabled the 
> namespace is actually being selected
> over and over again, and failed over...
>
> Can you please activate tracing and see where your mpath commands are 
> actually going from?
>
> I'd trace nvme_setup_cmd, and see that once you disable one nvmet ns, 
> it is not selected by the mpath
> namespace as a valid ns.

Any update on this?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure?
  2024-04-22  9:47                                           ` Sagi Grimberg
@ 2024-04-23  3:15                                             ` Jirong Feng
  0 siblings, 0 replies; 28+ messages in thread
From: Jirong Feng @ 2024-04-23  3:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig

Sorry for not replying in time.

> Any update on this?
>
Not yet, just too busy on work these days.

I'll find some time in this week.

Thanks for your attention to this case

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-04-23  3:16 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-04  7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng
2023-12-04  8:47 ` Sagi Grimberg
2023-12-05  3:54   ` Jirong Feng
2023-12-05  4:37 ` Keith Busch
2023-12-05  4:40   ` Christoph Hellwig
2023-12-05  5:18     ` Keith Busch
2023-12-05  7:06       ` Jirong Feng
2023-12-05  8:50     ` Sagi Grimberg
2023-12-25 11:25       ` Jirong Feng
2023-12-25 11:40         ` Sagi Grimberg
2023-12-25 12:14           ` Jirong Feng
2023-12-26 13:27             ` Jirong Feng
2024-01-01  9:51               ` Sagi Grimberg
2024-01-02 10:33                 ` Jirong Feng
2024-01-02 12:46                   ` Sagi Grimberg
2024-01-03 10:24                     ` Jirong Feng
2024-01-04 11:56                       ` Sagi Grimberg
2024-01-30  9:36                         ` Jirong Feng
2024-01-30 11:29                           ` Sagi Grimberg
2024-01-31  6:25                             ` Christoph Hellwig
2024-03-20  3:17                               ` Jirong Feng
2024-03-20  8:51                                 ` Sagi Grimberg
2024-03-21  3:06                                   ` Jirong Feng
2024-04-07 22:28                                     ` Sagi Grimberg
2024-04-12  7:52                                       ` Jirong Feng
2024-04-12  8:57                                         ` Sagi Grimberg
2024-04-22  9:47                                           ` Sagi Grimberg
2024-04-23  3:15                                             ` Jirong Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox