* Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? @ 2023-12-04 7:58 Jirong Feng 2023-12-04 8:47 ` Sagi Grimberg 2023-12-05 4:37 ` Keith Busch 0 siblings, 2 replies; 28+ messages in thread From: Jirong Feng @ 2023-12-04 7:58 UTC (permalink / raw) To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, peng.xiao Hi all, I have two storage servers, each of which has an NVMe SSD. Recently I'm trying nvmet-tcp with DRBD, steps are: 1. Configure DRBD for the two SSDs in two-primary mode, so that each server can accept IO on DRBD device. 2. On each server, add the corresponding DRBD device to nvmet subsystem with same device uuid, so that multipath on the host side can group them into one device(My fabric type is tcp). 3. On client host, nvme discover & connect the both servers, making sure DM multipath device is generated, and both paths are online. 4. Execute fio randread on DM device continuously. 5. On the server whose multipath status is active, under nvmet namespace configfs directory, execute "echo 0 > enable" to disable the namespace. what I expect is that IO can be automatically retried and switched to the other storage server by multipath, fio goes on. But actually I see an "Operation not supported" error, and fio fails and stops. I've also tried iSCSI target, after I delete mapped lun from acl, fio continues running without any error. My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the kernel code, I found that: 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to namespace not found. 2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP for block layer. 3. Multipath calls for function blk_path_error() to decide whether to retry. 4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to be a path error, so it returns false, multipath will not retry. I've also checked out the master branch from origin, it's almost the same. In iSCSI target, the process is similar, the only difference is that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which is considered to be a path error in function blk_path_error(). So my question is as the subject...Is it reasonable to translate NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? Should multipath failover on this error? Thanks & Best Regards, Jirong ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-04 7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng @ 2023-12-04 8:47 ` Sagi Grimberg 2023-12-05 3:54 ` Jirong Feng 2023-12-05 4:37 ` Keith Busch 1 sibling, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2023-12-04 8:47 UTC (permalink / raw) To: Jirong Feng, Keith Busch, Jens Axboe, Christoph Hellwig Cc: linux-nvme, peng.xiao > Hi all, > > I have two storage servers, each of which has an NVMe SSD. Recently I'm > trying nvmet-tcp with DRBD, steps are: > 1. Configure DRBD for the two SSDs in two-primary mode, so that each > server can accept IO on DRBD device. > 2. On each server, add the corresponding DRBD device to nvmet subsystem > with same device uuid, so that multipath on the host side can group them > into one device(My fabric type is tcp). > 3. On client host, nvme discover & connect the both servers, making sure > DM multipath device is generated, and both paths are online. > 4. Execute fio randread on DM device continuously. > 5. On the server whose multipath status is active, under nvmet namespace > configfs directory, execute "echo 0 > enable" to disable the namespace. > what I expect is that IO can be automatically retried and switched to > the other storage server by multipath, fio goes on. But actually I see > an "Operation not supported" error, and fio fails and stops. I've also > tried iSCSI target, after I delete mapped lun from acl, fio continues > running without any error. > > My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the > kernel code, I found that: > 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to > namespace not found. > 2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP > for block layer. > 3. Multipath calls for function blk_path_error() to decide whether to > retry. > 4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to be > a path error, so it returns false, multipath will not retry. > I've also checked out the master branch from origin, it's almost the > same. In iSCSI target, the process is similar, the only difference is > that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which is > considered to be a path error in function blk_path_error(). > > So my question is as the subject...Is it reasonable to translate > NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? > Should multipath failover on this error? The host issued IO to a non-existing namespace. Semantically it is not an IO error in the sense that its retryable. btw, AFAICT TCM_NON_EXISTENT_LUN does return an ILLEGAL_REQUEST however the host chooses to ignore the particular additional sense specifically. While I guess similar behavior could be done in nvme, the question is why is a non-existent namespace failure a retryable error? the namespace is gone... Thoughts? Perhaps what you are seeking is a soft way to disable a namespace based on your test case? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-04 8:47 ` Sagi Grimberg @ 2023-12-05 3:54 ` Jirong Feng 0 siblings, 0 replies; 28+ messages in thread From: Jirong Feng @ 2023-12-05 3:54 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch, Jens Axboe, Christoph Hellwig Cc: linux-nvme, peng.xiao Hi Sagi, On the one hand, in multipath's perspective, if one path fails with NVME_SC_INVALID_NS, does that mean namespaces in all other paths are invalid either? if not, perhaps a retry on another path also makes sense. If so, I'm afraid there's no much choice but only io error despite of the semantical mismatch... On the other hand, it seems there's no soft way to disable a namespace in nvmet for now. To do so, nvmet needs to differentiate if a namespace is disabled or non-existent but it just does not. After that, we might need a new nvme status code for disabled namespace, like NVME_SC_DISABLED_NS I guess? and translate it to BLK_STS_OFFLINE which is considered to be a path error in blk_path_error()? Regards, Jirong 在 2023/12/4 16:47, Sagi Grimberg 写道: > >> Hi all, >> >> I have two storage servers, each of which has an NVMe SSD. Recently >> I'm trying nvmet-tcp with DRBD, steps are: >> 1. Configure DRBD for the two SSDs in two-primary mode, so that each >> server can accept IO on DRBD device. >> 2. On each server, add the corresponding DRBD device to nvmet >> subsystem with same device uuid, so that multipath on the host side >> can group them into one device(My fabric type is tcp). >> 3. On client host, nvme discover & connect the both servers, making >> sure DM multipath device is generated, and both paths are online. >> 4. Execute fio randread on DM device continuously. >> 5. On the server whose multipath status is active, under nvmet >> namespace configfs directory, execute "echo 0 > enable" to disable >> the namespace. >> what I expect is that IO can be automatically retried and switched to >> the other storage server by multipath, fio goes on. But actually I >> see an "Operation not supported" error, and fio fails and stops. I've >> also tried iSCSI target, after I delete mapped lun from acl, fio >> continues running without any error. >> >> My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the >> kernel code, I found that: >> 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to >> namespace not found. >> 2. On host side, nvme driver translates this error to BLK_STS_NOTSUPP >> for block layer. >> 3. Multipath calls for function blk_path_error() to decide whether to >> retry. >> 4. In function blk_path_error(), BLK_STS_NOTSUPP is not considered to >> be a path error, so it returns false, multipath will not retry. >> I've also checked out the master branch from origin, it's almost the >> same. In iSCSI target, the process is similar, the only difference is >> that TCM_NON_EXISTENT_LUN will be translated to BLK_STS_IOERR, which >> is considered to be a path error in function blk_path_error(). >> >> So my question is as the subject...Is it reasonable to translate >> NVME_SC_INVALID_NS to BLK_STS_IOERR just like what iSCSI target does? >> Should multipath failover on this error? > > The host issued IO to a non-existing namespace. Semantically it is not > an IO error in the sense that its retryable. > > btw, AFAICT TCM_NON_EXISTENT_LUN does return an ILLEGAL_REQUEST however > the host chooses to ignore the particular additional sense specifically. > > While I guess similar behavior could be done in nvme, the question is > why is a non-existent namespace failure a retryable error? the namespace > is gone... > > Thoughts? > > Perhaps what you are seeking is a soft way to disable a namespace based > on your test case? > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-04 7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng 2023-12-04 8:47 ` Sagi Grimberg @ 2023-12-05 4:37 ` Keith Busch 2023-12-05 4:40 ` Christoph Hellwig 1 sibling, 1 reply; 28+ messages in thread From: Keith Busch @ 2023-12-05 4:37 UTC (permalink / raw) To: Jirong Feng Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, peng.xiao On Mon, Dec 04, 2023 at 03:58:14PM +0800, Jirong Feng wrote: > Hi all, > > I have two storage servers, each of which has an NVMe SSD. Recently I'm > trying nvmet-tcp with DRBD, steps are: > 1. Configure DRBD for the two SSDs in two-primary mode, so that each server > can accept IO on DRBD device. > 2. On each server, add the corresponding DRBD device to nvmet subsystem with > same device uuid, so that multipath on the host side can group them into one > device(My fabric type is tcp). > 3. On client host, nvme discover & connect the both servers, making sure DM > multipath device is generated, and both paths are online. > 4. Execute fio randread on DM device continuously. > 5. On the server whose multipath status is active, under nvmet namespace > configfs directory, execute "echo 0 > enable" to disable the namespace. > what I expect is that IO can be automatically retried and switched to the > other storage server by multipath, fio goes on. But actually I see an > "Operation not supported" error, and fio fails and stops. I've also tried > iSCSI target, after I delete mapped lun from acl, fio continues running > without any error. > > My kernel version is 4.18.0-147.5.1(rhel 8.1). After checked out the kernel > code, I found that: > 1. On target side, nvmet returns NVME_SC_INVALID_NS to host due to namespace > not found. So the controller through that path used to be able to access the Namespace, then suddenly lost ability to do so, but some other path can still access it if we retry on a failover/alternate path? I think your target is returning the wrong error code. It should be SCT/SC 303h, Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what you're describing. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-05 4:37 ` Keith Busch @ 2023-12-05 4:40 ` Christoph Hellwig 2023-12-05 5:18 ` Keith Busch 2023-12-05 8:50 ` Sagi Grimberg 0 siblings, 2 replies; 28+ messages in thread From: Christoph Hellwig @ 2023-12-05 4:40 UTC (permalink / raw) To: Keith Busch Cc: Jirong Feng, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, peng.xiao On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote: > So the controller through that path used to be able to access the > Namespace, then suddenly lost ability to do so, but some other path can > still access it if we retry on a failover/alternate path? I think your > target is returning the wrong error code. It should be SCT/SC 303h, > Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what > you're describing. Yes, assuming ANA is actually supported by the controllers.. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-05 4:40 ` Christoph Hellwig @ 2023-12-05 5:18 ` Keith Busch 2023-12-05 7:06 ` Jirong Feng 2023-12-05 8:50 ` Sagi Grimberg 1 sibling, 1 reply; 28+ messages in thread From: Keith Busch @ 2023-12-05 5:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Jirong Feng, Jens Axboe, Sagi Grimberg, linux-nvme, peng.xiao On Tue, Dec 05, 2023 at 05:40:35AM +0100, Christoph Hellwig wrote: > On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote: > > So the controller through that path used to be able to access the > > Namespace, then suddenly lost ability to do so, but some other path can > > still access it if we retry on a failover/alternate path? I think your > > target is returning the wrong error code. It should be SCT/SC 303h, > > Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what > > you're describing. > > Yes, assuming ANA is actually supported by the controllers.. Even without ANA, "Invalid Namespace" is still the wrong status code when dynamic namespace attachement is supported. If the namespace still exists in the subsystem but not attached to the controller processing a command (i.e. "inactive"), the return needs be Invalid Field in Command: Specifying an inactive namespace identifier (refer to section 3.2.1.4) in a command that uses the namespace identifier shall cause the controller to abort the command with a status code of Invalid Field in Command ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-05 5:18 ` Keith Busch @ 2023-12-05 7:06 ` Jirong Feng 0 siblings, 0 replies; 28+ messages in thread From: Jirong Feng @ 2023-12-05 7:06 UTC (permalink / raw) To: Keith Busch, Christoph Hellwig Cc: Jens Axboe, Sagi Grimberg, linux-nvme, peng.xiao As far as I know, according to current implementation of function nvmet_parse_io_cmd() in drivers/nvme/target/core.c, nvmet_req_find_ns() is called before nvmet_check_ana_state(), so I believe currently nvmet is returning NVME_SC_INVALID_NS once namespace is disabled no matter if ANA is supported. In nvmet, a disabled namespace acts like it does not exist. nvmet_check_ana_state() requires req->ns, which is assigned in nvmet_req_find_ns(). If namespace is unknown, nvmet can't know the state of its ana group either. So, to be better up to the specification, nvmet does need to differentiate a namespace is disabled or non-existent? Moreover, even if nvmet returns NVME_SC_INVALID_FIELD to the host, the status code is still translated to BLK_STS_NOTSUPP, multipath won't retry either... 在 2023/12/5 13:18, Keith Busch 写道: > On Tue, Dec 05, 2023 at 05:40:35AM +0100, Christoph Hellwig wrote: >> On Mon, Dec 04, 2023 at 09:37:56PM -0700, Keith Busch wrote: >>> So the controller through that path used to be able to access the >>> Namespace, then suddenly lost ability to do so, but some other path can >>> still access it if we retry on a failover/alternate path? I think your >>> target is returning the wrong error code. It should be SCT/SC 303h, >>> Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what >>> you're describing. >> Yes, assuming ANA is actually supported by the controllers.. > Even without ANA, "Invalid Namespace" is still the wrong status code > when dynamic namespace attachement is supported. If the namespace still > exists in the subsystem but not attached to the controller processing a > command (i.e. "inactive"), the return needs be Invalid Field in Command: > > Specifying an inactive namespace identifier (refer to section 3.2.1.4) > in a command that uses the namespace identifier shall cause the > controller to abort the command with a status code of Invalid Field in > Command > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-05 4:40 ` Christoph Hellwig 2023-12-05 5:18 ` Keith Busch @ 2023-12-05 8:50 ` Sagi Grimberg 2023-12-25 11:25 ` Jirong Feng 1 sibling, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2023-12-05 8:50 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: Jirong Feng, Jens Axboe, linux-nvme, peng.xiao >> So the controller through that path used to be able to access the >> Namespace, then suddenly lost ability to do so, but some other path can >> still access it if we retry on a failover/alternate path? I think your >> target is returning the wrong error code. It should be SCT/SC 303h, >> Asymmetric Access Persistent Loss (NVME_SC_ANA_TRANSITION), for what >> you're describing. > > Yes, assuming ANA is actually supported by the controllers.. Its a good point (should probably be ana inaccessible). But semantically, this status is with respect to all the namespace in the ana group. And the host will not see an updated ANA log page, which will then override back the ns ana state? The below can handle the return status, but its not clear what should the behavior we want to have... -- diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index e307a044b1a1..5fd5e74a41a8 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -726,6 +726,19 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { NULL, }; +bool nvmet_subsys_nsid_exists(struct nvmet_subsys *subsys, u32 nsid) +{ + struct config_item *ns_item; + char name[4] = {}; + + if (sprintf(name, "%u\n", nsid) <= 0) + return false; + mutex_lock(&subsys->namespaces_group.cg_subsys->su_mutex); + ns_item = config_group_find_item(&subsys->namespaces_group, name); + mutex_unlock(&subsys->namespaces_group.cg_subsys->su_mutex); + return ns_item != NULL; +} + static void nvmet_ns_release(struct config_item *item) { struct nvmet_ns *ns = to_nvmet_ns(item); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 3935165048e7..426ced914a21 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -425,11 +425,15 @@ void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl) u16 nvmet_req_find_ns(struct nvmet_req *req) { u32 nsid = le32_to_cpu(req->cmd->common.nsid); + struct nvmet_subsys *subsys = nvmet_req_subsys(req); - req->ns = xa_load(&nvmet_req_subsys(req)->namespaces, nsid); + req->ns = xa_load(&subsys->namespaces, nsid); if (unlikely(!req->ns)) { req->error_loc = offsetof(struct nvme_common_command, nsid); - return NVME_SC_INVALID_NS | NVME_SC_DNR; + if (nvmet_subsys_nsid_exists(subsys, nsid)) + return NVME_ANA_PERSISTENT_LOSS; + else + return NVME_SC_INVALID_NS | NVME_SC_DNR; } percpu_ref_get(&req->ns->ref); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 6c8acebe1a1a..477416abf85a 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -542,6 +542,7 @@ void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, struct nvmet_host *host); void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, u8 event_info, u8 log_page); +bool nvmet_subsys_nsid_exists(struct nvmet_subsys *subsys, u32 nsid); #define NVMET_QUEUE_SIZE 1024 #define NVMET_NR_QUEUES 128 -- -- ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-05 8:50 ` Sagi Grimberg @ 2023-12-25 11:25 ` Jirong Feng 2023-12-25 11:40 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2023-12-25 11:25 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao hi all, any updates about this case? thanks ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-25 11:25 ` Jirong Feng @ 2023-12-25 11:40 ` Sagi Grimberg 2023-12-25 12:14 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2023-12-25 11:40 UTC (permalink / raw) To: Jirong Feng, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > hi all, > > any updates about this case? I think that we weren't able to find a suitable ANA status that would cause the host to failover like you expect. Perhaps nvmet should return NVME_SC_INTERNAL_PATH_ERROR ? Because if the nvmet ns is disabled, it is somewhat equivalent at least to some interpretation... The only part that is unclear is what will the host do if it gets an ana status but when reading the ana log page it will see no change... Did you test the patch I sent in one of the replies before? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-25 11:40 ` Sagi Grimberg @ 2023-12-25 12:14 ` Jirong Feng 2023-12-26 13:27 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2023-12-25 12:14 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > > Did you test the patch I sent in one of the replies before? > not yet, I'll test it tomorrow ASAP. thanks ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-25 12:14 ` Jirong Feng @ 2023-12-26 13:27 ` Jirong Feng 2024-01-01 9:51 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2023-12-26 13:27 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao I've tested the patch basing on kernel version 6.6.0. It seems not working... here's my steps & results: 1. create a VM and make & install kernel from source applying the patch. [root@fjr-nvmet-1 ~]# uname -r 6.6.0-mytest+ 2. clone that VM. 3. create a shared volume and attach to the both VMs. 4. config nvmet as below: VM1: o- / ......................................................................................................................... [...] o- hosts ................................................................................................................... [...] o- ports ................................................................................................................... [...] | o- 1 ................................................ [trtype=tcp, traddr=192.168.111.99, trsvcid=4420, inline_data_size=262144] | o- ana_groups .......................................................................................................... [...] | | o- 1 ..................................................................................................... [state=optimized] | o- referrals ........................................................................................................... [...] | o- subsystems .......................................................................................................... [...] | o- nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...] o- subsystems .............................................................................................................. [...] o- nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=308df2776344fdd17cba] o- allowed_hosts ....................................................................................................... [...] o- namespaces .......................................................................................................... [...] o- 1 .......................................... [path=/dev/vdc, uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled] VM2: o- / ......................................................................................................................... [...] o- hosts ................................................................................................................... [...] o- ports ................................................................................................................... [...] | o- 1 ............................................... [trtype=tcp, traddr=192.168.111.111, trsvcid=4420, inline_data_size=262144] | o- ana_groups .......................................................................................................... [...] | | o- 1 ..................................................................................................... [state=optimized] | o- referrals ........................................................................................................... [...] | o- subsystems .......................................................................................................... [...] | o- nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...] o- subsystems .............................................................................................................. [...] o- nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=0dcf77d36826000cc5a0] o- allowed_hosts ....................................................................................................... [...] o- namespaces .......................................................................................................... [...] o- 1 .......................................... [path=/dev/vdc, uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled] 5. create a host vm(CentOS 8.1, kernel version 4.18.0-147.3.1.el8_1.aarch64), config dm multipath [root@fjr-vm1 ~]# cat /etc/multipath/conf.d/nvme.conf devices { device { vendor "NVME" product "Linux" path_selector "round-robin 0" path_grouping_policy failover uid_attribute ID_SERIAL prio "ANA" path_checker "none" #rr_min_io 100 #rr_min_io_rq "1" #fast_io_fail_tmo 15 #dev_loss_tmo 600 #rr_weight uniform rr_weight priorities failback immediate no_path_retry queue } } 6. connect nvme on host, finally it looks like: [root@fjr-vm1 ~]# nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 0dcf77d36826000cc5a0 Linux 1 107.37 GB / 107.37 GB 512 B + 0 B 6.6.0-my /dev/nvme1n1 308df2776344fdd17cba Linux 1 107.37 GB / 107.37 GB 512 B + 0 B 6.6.0-my [root@fjr-vm1 ~]# nvme list-subsys nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 \ +- nvme0 tcp traddr=192.168.111.111 trsvcid=4420 live +- nvme1 tcp traddr=192.168.111.99 trsvcid=4420 live [root@fjr-vm1 ~]# multipath -ll mpatha (Linux_0dcf77d36826000cc5a0) dm-0 NVME,Linux size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 1:1:1:1 nvme1n1 259:1 active ready running `-+- policy='round-robin 0' prio=50 status=enabled `- 0:1:1:1 nvme0n1 259:0 active ready running 7. execute fio on host, and disable namespace on the vm corresponding to nvme1, the same error goes again: fio: io_u error on file /dev/dm-0: Operation not supported: write offset=14734594048, buflen=4096 fio: io_u error on file /dev/dm-0: Operation not supported: write offset=106607394816, buflen=4096 fio: pid=16076, err=95/file:io_u.c:1747, func=io_u error, error=Operation not supported ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2023-12-26 13:27 ` Jirong Feng @ 2024-01-01 9:51 ` Sagi Grimberg 2024-01-02 10:33 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-01-01 9:51 UTC (permalink / raw) To: Jirong Feng, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > I've tested the patch basing on kernel version 6.6.0. It seems not > working... Can you paste the log output (host and controller)? > > here's my steps & results: > > 1. create a VM and make & install kernel from source applying the patch. > > [root@fjr-nvmet-1 ~]# uname -r > 6.6.0-mytest+ > > 2. clone that VM. > > 3. create a shared volume and attach to the both VMs. > > 4. config nvmet as below: > > VM1: > > o- / > ......................................................................................................................... [...] > o- hosts > ................................................................................................................... [...] > o- ports > ................................................................................................................... [...] > | o- 1 ................................................ [trtype=tcp, > traddr=192.168.111.99, trsvcid=4420, inline_data_size=262144] > | o- ana_groups > .......................................................................................................... [...] > | | o- 1 > ..................................................................................................... [state=optimized] > | o- referrals > ........................................................................................................... [...] > | o- subsystems > .......................................................................................................... [...] > | o- > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...] > o- subsystems > .............................................................................................................. [...] > o- > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=308df2776344fdd17cba] > o- allowed_hosts > ....................................................................................................... [...] > o- namespaces > .......................................................................................................... [...] > o- 1 .......................................... [path=/dev/vdc, > uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled] > > VM2: > > o- / > ......................................................................................................................... [...] > o- hosts > ................................................................................................................... [...] > o- ports > ................................................................................................................... [...] > | o- 1 ............................................... [trtype=tcp, > traddr=192.168.111.111, trsvcid=4420, inline_data_size=262144] > | o- ana_groups > .......................................................................................................... [...] > | | o- 1 > ..................................................................................................... [state=optimized] > | o- referrals > ........................................................................................................... [...] > | o- subsystems > .......................................................................................................... [...] > | o- > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 ......................................... [...] > o- subsystems > .............................................................................................................. [...] > o- > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [version=1.3, allow_any=1, serial=0dcf77d36826000cc5a0] > o- allowed_hosts > ....................................................................................................... [...] > o- namespaces > .......................................................................................................... [...] > o- 1 .......................................... [path=/dev/vdc, > uuid=cf4bb93c-949f-4532-a5c1-b8bd267a4e06, grpid=1, enabled] > > 5. create a host vm(CentOS 8.1, kernel version > 4.18.0-147.3.1.el8_1.aarch64), config dm multipath > > [root@fjr-vm1 ~]# cat /etc/multipath/conf.d/nvme.conf > devices { > device { > vendor "NVME" > product "Linux" > path_selector "round-robin 0" > path_grouping_policy failover > uid_attribute ID_SERIAL > prio "ANA" > path_checker "none" > #rr_min_io 100 > #rr_min_io_rq "1" > #fast_io_fail_tmo 15 > #dev_loss_tmo 600 > #rr_weight uniform > rr_weight priorities > failback immediate > no_path_retry queue > } > } > > > 6. connect nvme on host, finally it looks like: > > [root@fjr-vm1 ~]# nvme list > Node SN Model Namespace > Usage Format FW Rev > ---------------- -------------------- > ---------------------------------------- --------- > -------------------------- ---------------- -------- > /dev/nvme0n1 0dcf77d36826000cc5a0 > Linux 1 107.37 GB / 107.37 > GB 512 B + 0 B 6.6.0-my > /dev/nvme1n1 308df2776344fdd17cba > Linux 1 107.37 GB / 107.37 > GB 512 B + 0 B 6.6.0-my > > [root@fjr-vm1 ~]# nvme list-subsys > nvme-subsys0 - > NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 > \ > +- nvme0 tcp traddr=192.168.111.111 trsvcid=4420 live > +- nvme1 tcp traddr=192.168.111.99 trsvcid=4420 live > > [root@fjr-vm1 ~]# multipath -ll > mpatha (Linux_0dcf77d36826000cc5a0) dm-0 NVME,Linux > size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw > |-+- policy='round-robin 0' prio=50 status=active > | `- 1:1:1:1 nvme1n1 259:1 active ready running > `-+- policy='round-robin 0' prio=50 status=enabled > `- 0:1:1:1 nvme0n1 259:0 active ready running > > 7. execute fio on host, and disable namespace on the vm corresponding to > nvme1, the same error goes again: > > fio: io_u error on file /dev/dm-0: Operation not supported: write > offset=14734594048, buflen=4096 > fio: io_u error on file /dev/dm-0: Operation not supported: write > offset=106607394816, buflen=4096 > fio: pid=16076, err=95/file:io_u.c:1747, func=io_u error, > error=Operation not supported > > > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-01 9:51 ` Sagi Grimberg @ 2024-01-02 10:33 ` Jirong Feng 2024-01-02 12:46 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-01-02 10:33 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao in function nvmet_subsys_nsid_exists() added in the patch, remove the '\n' from line `if (sprintf(name, "%u\n", nsid) <= 0)`, the patch works exactly as what I expect, fio keeps running and multipath does failover. > Can you paste the log output (host and controller)? > host: [Tue Jan 2 10:22:11 2024] print_req_error: 8 callbacks suppressed [Tue Jan 2 10:22:11 2024] print_req_error: I/O error, dev nvme1n1, sector 186257448 flags ca01 [Tue Jan 2 10:22:11 2024] device-mapper: multipath: Failing path 259:1. [Tue Jan 2 10:22:11 2024] nvme nvme1: rescanning namespaces. [Tue Jan 2 10:22:11 2024] device-mapper: multipath round-robin: repeat_count > 1 is deprecated, using 1 instead target: [Tue Jan 2 10:21:57 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs [Tue Jan 2 10:22:12 2024] nvmet: jirong add: returning NVME_ANA_PERSISTENT_LOSS [Tue Jan 2 10:22:12 2024] nvmet_tcp: failed cmd 00000000de551a59 id 37 opcode 1, data_len: 4096 [Tue Jan 2 10:22:12 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Tue Jan 2 10:22:17 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-02 10:33 ` Jirong Feng @ 2024-01-02 12:46 ` Sagi Grimberg 2024-01-03 10:24 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-01-02 12:46 UTC (permalink / raw) To: Jirong Feng, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > in function nvmet_subsys_nsid_exists() added in the patch, remove the > '\n' from line `if (sprintf(name, "%u\n", nsid) <= 0)`, the patch works > exactly as what I expect, fio keeps running and multipath does failover. OK, can you please check nvme native mpath as well? What I suspect will happen is that the host will be unable to failover because it will re-read the ana log page and not find anything wrong with the actual path, hence will just retry on the same namespace. That may be transient because there should be an AEN on the way to the host to remove the (path'd) namespace altogether. The status reporting is not consistent with the actual ana state hence the approach has a semantic problem. perhaps we want another error that is a path status, but not semantically an ana status. Can you try returning NVME_SC_CTRL_PATH_ERROR instead of NVME_SC_ANA_PERSISTENT_LOSS ? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-02 12:46 ` Sagi Grimberg @ 2024-01-03 10:24 ` Jirong Feng 2024-01-04 11:56 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-01-03 10:24 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > OK, can you please check nvme native mpath as well? switch to nvme native mpath: [root@fjr-vm1 ~]# nvme list-subsys nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 \ +- nvme0 tcp traddr=192.168.111.99 trsvcid=4420 live +- nvme1 tcp traddr=192.168.111.111 trsvcid=4420 live [root@fjr-vm1 ~]# multipath -ll uuid.cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [nvme]:nvme0n1 NVMe,Linux,6.6.0-my size=209715200 features='n/a' hwhandler='ANA' wp=rw |-+- policy='n/a' prio=50 status=optimized | `- 0:0:1 nvme0c0n1 0:0 n/a optimized live `-+- policy='n/a' prio=50 status=optimized `- 0:1:1 nvme0c1n1 0:0 n/a optimized live fio still keeps running without any error, just for this time. (see below) host dmesg: [Wed Jan 3 07:42:55 2024] nvme nvme0: reschedule traffic based keep-alive timer [Wed Jan 3 07:42:55 2024] nvme nvme1: reschedule traffic based keep-alive timer [Wed Jan 3 07:43:00 2024] nvme nvme0: reschedule traffic based keep-alive timer [Wed Jan 3 07:43:00 2024] nvme nvme1: reschedule traffic based keep-alive timer [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 0 [Wed Jan 3 07:43:05 2024] nvme nvme0: ANA group 1: optimized. [Wed Jan 3 07:43:05 2024] nvme nvme0: creating 4 I/O queues. [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 1 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 2 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 3 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 4 [Wed Jan 3 07:43:05 2024] nvme nvme0: rescanning namespaces. [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 0 [Wed Jan 3 07:43:05 2024] nvme nvme0: ANA group 1: optimized. [Wed Jan 3 07:43:05 2024] nvme nvme0: creating 4 I/O queues. [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 1 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 2 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 3 [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 4 [Wed Jan 3 07:43:05 2024] nvme nvme1: reschedule traffic based keep-alive timer [Wed Jan 3 07:43:10 2024] nvme nvme0: reschedule traffic based keep-alive timer [Wed Jan 3 07:43:10 2024] nvme nvme1: reschedule traffic based keep-alive timer target dmesg: [Wed Jan 3 07:41:23 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs [Wed Jan 3 07:41:33 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs [Wed Jan 3 07:41:43 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs [Wed Jan 3 07:41:58 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Wed Jan 3 07:42:14 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Wed Jan 3 07:42:29 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Wed Jan 3 07:42:44 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Wed Jan 3 07:43:00 2024] nvmet: ctrl 1 reschedule traffic based keep-alive timer [Wed Jan 3 07:43:04 2024] nvmet: fjr add: returning NVME_ANA_PERSISTENT_LOSS [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 0000000034dfe760 id 14 opcode 1, data_len: 4096 [Wed Jan 3 07:43:04 2024] nvmet: got cmd 12 while CC.EN == 0 on qid = 0 [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 00000000228b330a id 31 opcode 12, data_len: 0 [Wed Jan 3 07:43:04 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs [Wed Jan 3 07:43:04 2024] nvmet: ctrl 1 stop keep-alive [Wed Jan 3 07:43:04 2024] nvmet: creating nvm controller 2 for subsystem nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 1 to ctrl 2. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 2 to ctrl 2. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 3 to ctrl 2. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 4 to ctrl 2. [Wed Jan 3 07:43:04 2024] nvmet: fjr add: returning NVME_ANA_PERSISTENT_LOSS [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 00000000d9d3dba9 id 100 opcode 1, data_len: 4096 [Wed Jan 3 07:43:04 2024] nvmet: ctrl 1 start keep-alive timer for 15 secs [Wed Jan 3 07:43:04 2024] nvmet: ctrl 2 stop keep-alive [Wed Jan 3 07:43:04 2024] nvmet: creating nvm controller 1 for subsystem nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 1 to ctrl 1. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 2 to ctrl 1. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 3 to ctrl 1. [Wed Jan 3 07:43:04 2024] nvmet: adding queue 4 to ctrl 1. [Wed Jan 3 07:43:14 2024] nvmet: ctrl 1 update keep-alive timer for 15 secs > > Can you try returning NVME_SC_CTRL_PATH_ERROR instead of > NVME_SC_ANA_PERSISTENT_LOSS ? I enabled/disabled again and again, found that fio keeps running for most time, but occasionally(about 10% or less) fails and stops with error. fio: io_u error on file /dev/nvme0n1: Input/output error: write offset=100662296576, buflen=4096 fio: pid=1485, err=5/file:io_u.c:1747, func=io_u error, error=Input/output error fio_iops: (groupid=0, jobs=1): err= 5 (file:io_u.c:1747, func=io_u error, error=Input/output error): pid=1485: Wed Jan 3 08:44:09 2024 host dmesg: [Wed Jan 3 08:44:06 2024] nvme nvme1: reschedule traffic based keep-alive timer [Wed Jan 3 08:44:07 2024] nvme nvme0: reschedule traffic based keep-alive timer [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 0 [Wed Jan 3 08:44:09 2024] nvme nvme0: ANA group 1: optimized. [Wed Jan 3 08:44:09 2024] nvme nvme0: creating 4 I/O queues. [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 1 [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 2 [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 3 [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 4 [Wed Jan 3 08:44:09 2024] nvme nvme0: rescanning namespaces. [Wed Jan 3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical block 0, async page read [Wed Jan 3 08:44:09 2024] nvme0n1: unable to read partition table [Wed Jan 3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical block 6, async page read [Wed Jan 3 08:44:11 2024] nvme nvme1: reschedule traffic based keep-alive timer [Wed Jan 3 08:44:14 2024] nvme nvme0: reschedule traffic based keep-alive timer target dmesg: [Wed Jan 3 08:44:08 2024] nvmet: fjr add: returning NVME_SC_CTRL_PATH_ERROR [Wed Jan 3 08:44:08 2024] nvmet_tcp: failed cmd 00000000c11e0ae7 id 53 opcode 1, data_len: 4096 [Wed Jan 3 08:44:08 2024] nvmet: fjr add: returning NVME_SC_CTRL_PATH_ERROR [Wed Jan 3 08:44:08 2024] nvmet_tcp: failed cmd 00000000e0d12c37 id 54 opcode 1, data_len: 4096 [Wed Jan 3 08:44:08 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs [Wed Jan 3 08:44:08 2024] nvmet: ctrl 1 stop keep-alive [Wed Jan 3 08:44:08 2024] nvmet: creating nvm controller 2 for subsystem nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. [Wed Jan 3 08:44:08 2024] nvmet: adding queue 1 to ctrl 2. [Wed Jan 3 08:44:08 2024] nvmet: adding queue 2 to ctrl 2. [Wed Jan 3 08:44:08 2024] nvmet: adding queue 3 to ctrl 2. [Wed Jan 3 08:44:08 2024] nvmet: adding queue 4 to ctrl 2. [Wed Jan 3 08:44:18 2024] nvmet: ctrl 2 update keep-alive timer for 15 secs [Wed Jan 3 08:44:28 2024] nvmet: ctrl 2 update keep-alive timer for 15 secs then back to returning NVME_ANA_PERSISTENT_LOSS, fio occasionally fails too. log output are pretty the same. then back to dm multipath, for about 50 times enable/disable, fio never fails. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-03 10:24 ` Jirong Feng @ 2024-01-04 11:56 ` Sagi Grimberg 2024-01-30 9:36 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-01-04 11:56 UTC (permalink / raw) To: Jirong Feng, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao On 1/3/24 12:24, Jirong Feng wrote: >> OK, can you please check nvme native mpath as well? > > switch to nvme native mpath: > > [root@fjr-vm1 ~]# nvme list-subsys > nvme-subsys0 - > NQN=nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 > \ > +- nvme0 tcp traddr=192.168.111.99 trsvcid=4420 live > +- nvme1 tcp traddr=192.168.111.111 trsvcid=4420 live > [root@fjr-vm1 ~]# multipath -ll > uuid.cf4bb93c-949f-4532-a5c1-b8bd267a4e06 [nvme]:nvme0n1 > NVMe,Linux,6.6.0-my > size=209715200 features='n/a' hwhandler='ANA' wp=rw > |-+- policy='n/a' prio=50 status=optimized > | `- 0:0:1 nvme0c0n1 0:0 n/a optimized live > `-+- policy='n/a' prio=50 status=optimized > `- 0:1:1 nvme0c1n1 0:0 n/a optimized live > > fio still keeps running without any error, just for this time. (see below) > > host dmesg: > > [Wed Jan 3 07:42:55 2024] nvme nvme0: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:42:55 2024] nvme nvme1: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:00 2024] nvme nvme0: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:00 2024] nvme nvme1: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 0 > [Wed Jan 3 07:43:05 2024] nvme nvme0: ANA group 1: optimized. > [Wed Jan 3 07:43:05 2024] nvme nvme0: creating 4 I/O queues. > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 1 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 2 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 3 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 4 > [Wed Jan 3 07:43:05 2024] nvme nvme0: rescanning namespaces. > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 0 > [Wed Jan 3 07:43:05 2024] nvme nvme0: ANA group 1: optimized. > [Wed Jan 3 07:43:05 2024] nvme nvme0: creating 4 I/O queues. > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 1 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 2 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 3 > [Wed Jan 3 07:43:05 2024] nvme nvme0: connecting queue 4 > [Wed Jan 3 07:43:05 2024] nvme nvme1: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:10 2024] nvme nvme0: reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:10 2024] nvme nvme1: reschedule traffic based > keep-alive timer > > target dmesg: > > [Wed Jan 3 07:41:23 2024] nvmet: ctrl 1 update keep-alive timer for 15 > secs > [Wed Jan 3 07:41:33 2024] nvmet: ctrl 1 update keep-alive timer for 15 > secs > [Wed Jan 3 07:41:43 2024] nvmet: ctrl 1 update keep-alive timer for 15 > secs > [Wed Jan 3 07:41:58 2024] nvmet: ctrl 1 reschedule traffic based > keep-alive timer > [Wed Jan 3 07:42:14 2024] nvmet: ctrl 1 reschedule traffic based > keep-alive timer > [Wed Jan 3 07:42:29 2024] nvmet: ctrl 1 reschedule traffic based > keep-alive timer > [Wed Jan 3 07:42:44 2024] nvmet: ctrl 1 reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:00 2024] nvmet: ctrl 1 reschedule traffic based > keep-alive timer > [Wed Jan 3 07:43:04 2024] nvmet: fjr add: returning > NVME_ANA_PERSISTENT_LOSS > [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 0000000034dfe760 id 14 > opcode 1, data_len: 4096 > [Wed Jan 3 07:43:04 2024] nvmet: got cmd 12 while CC.EN == 0 on qid = 0 > [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 00000000228b330a id 31 > opcode 12, data_len: 0 > [Wed Jan 3 07:43:04 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs > [Wed Jan 3 07:43:04 2024] nvmet: ctrl 1 stop keep-alive > [Wed Jan 3 07:43:04 2024] nvmet: creating nvm controller 2 for > subsystem > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 1 to ctrl 2. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 2 to ctrl 2. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 3 to ctrl 2. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 4 to ctrl 2. > [Wed Jan 3 07:43:04 2024] nvmet: fjr add: returning > NVME_ANA_PERSISTENT_LOSS > [Wed Jan 3 07:43:04 2024] nvmet_tcp: failed cmd 00000000d9d3dba9 id 100 > opcode 1, data_len: 4096 > [Wed Jan 3 07:43:04 2024] nvmet: ctrl 1 start keep-alive timer for 15 secs > [Wed Jan 3 07:43:04 2024] nvmet: ctrl 2 stop keep-alive > [Wed Jan 3 07:43:04 2024] nvmet: creating nvm controller 1 for > subsystem > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 1 to ctrl 1. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 2 to ctrl 1. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 3 to ctrl 1. > [Wed Jan 3 07:43:04 2024] nvmet: adding queue 4 to ctrl 1. > [Wed Jan 3 07:43:14 2024] nvmet: ctrl 1 update keep-alive timer for 15 > secs > >> >> Can you try returning NVME_SC_CTRL_PATH_ERROR instead of >> NVME_SC_ANA_PERSISTENT_LOSS ? > > I enabled/disabled again and again, found that fio keeps running for > most time, but occasionally(about 10% or less) fails and stops with error. > > fio: io_u error on file /dev/nvme0n1: Input/output error: write > offset=100662296576, buflen=4096 > fio: pid=1485, err=5/file:io_u.c:1747, func=io_u error, > error=Input/output error > > fio_iops: (groupid=0, jobs=1): err= 5 (file:io_u.c:1747, func=io_u > error, error=Input/output error): pid=1485: Wed Jan 3 08:44:09 2024 > > host dmesg: > > [Wed Jan 3 08:44:06 2024] nvme nvme1: reschedule traffic based > keep-alive timer > [Wed Jan 3 08:44:07 2024] nvme nvme0: reschedule traffic based > keep-alive timer > [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 0 > [Wed Jan 3 08:44:09 2024] nvme nvme0: ANA group 1: optimized. > [Wed Jan 3 08:44:09 2024] nvme nvme0: creating 4 I/O queues. > [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 1 > [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 2 > [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 3 > [Wed Jan 3 08:44:09 2024] nvme nvme0: connecting queue 4 > [Wed Jan 3 08:44:09 2024] nvme nvme0: rescanning namespaces. > [Wed Jan 3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical > block 0, async page read > [Wed Jan 3 08:44:09 2024] nvme0n1: unable to read partition table > [Wed Jan 3 08:44:09 2024] Buffer I/O error on dev nvme0n1, logical > block 6, async page read > [Wed Jan 3 08:44:11 2024] nvme nvme1: reschedule traffic based > keep-alive timer > [Wed Jan 3 08:44:14 2024] nvme nvme0: reschedule traffic based > keep-alive timer > > target dmesg: > > [Wed Jan 3 08:44:08 2024] nvmet: fjr add: returning > NVME_SC_CTRL_PATH_ERROR > [Wed Jan 3 08:44:08 2024] nvmet_tcp: failed cmd 00000000c11e0ae7 id 53 > opcode 1, data_len: 4096 > [Wed Jan 3 08:44:08 2024] nvmet: fjr add: returning > NVME_SC_CTRL_PATH_ERROR > [Wed Jan 3 08:44:08 2024] nvmet_tcp: failed cmd 00000000e0d12c37 id 54 > opcode 1, data_len: 4096 > [Wed Jan 3 08:44:08 2024] nvmet: ctrl 2 start keep-alive timer for 15 secs > [Wed Jan 3 08:44:08 2024] nvmet: ctrl 1 stop keep-alive > [Wed Jan 3 08:44:08 2024] nvmet: creating nvm controller 2 for > subsystem > nqn.2014-08.org.nvmexpress:NVMf:uuid:cf4bb93c-949f-4532-a5c1-b8bd267a4e06 for NQN nqn.2014-08.org.nvmexpress:uuid:1d8f7c82-9deb-4bc8-8292-5ff32ee3a2be. > [Wed Jan 3 08:44:08 2024] nvmet: adding queue 1 to ctrl 2. > [Wed Jan 3 08:44:08 2024] nvmet: adding queue 2 to ctrl 2. > [Wed Jan 3 08:44:08 2024] nvmet: adding queue 3 to ctrl 2. > [Wed Jan 3 08:44:08 2024] nvmet: adding queue 4 to ctrl 2. > [Wed Jan 3 08:44:18 2024] nvmet: ctrl 2 update keep-alive timer for 15 > secs > [Wed Jan 3 08:44:28 2024] nvmet: ctrl 2 update keep-alive timer for 15 > secs > > then back to returning NVME_ANA_PERSISTENT_LOSS, fio occasionally fails > too. log output are pretty the same. > > then back to dm multipath, for about 50 times enable/disable, fio never > fails. > Hmm, its interesting why you fail only in particular ios and not every io. I suspect that there is a timing issue here. Looking at the code, I suspect that ios continue being sent to the path'd namespace although they shouldn't. The reason is that if we return an ana error, then the host will re-read the ana log page again and find the namespace eligible for IO (the action of disable/enable namespace does not impact the ana log), or, we return a path error which is not ana error, in this case the host will not re-read the ana log page, and the namespace will be re-selected in the next IO (or at least nothing prevents it). First of all, I think that the most suitable status for nvmet to return in this case is: NVME_SC_INTERNAL_PATH_ERROR From the spec: Internal Path Error: The command was not completed as the result of a controller internal error that is specific to the controller processing the command. Retries for the request function should be based on the setting of the DNR bit (refer to Figure 92). In the host code, I don't see any reference to such error status returned by the controller. So I think we may want to pair it with something like (this untested hunk): -- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0a88d7bdc5e3..0fb82056ba5f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -97,6 +97,14 @@ void nvme_failover_req(struct request *req) if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) { set_bit(NVME_NS_ANA_PENDING, &ns->flags); queue_work(nvme_wq, &ns->ctrl->ana_work); + } else if ((status & 0x7ff) == NVME_SC_INTERNAL_PATH_ERROR) { + /* + * The ctrl is telling us it is unable to reach the + * ns in a way that does not impact the entire ana + * group. The only way we can stop sending io to this + * specific namespace is by clearing its ready bit. + */ + clear_bit(NVME_NS_READY, &ns->flags); } spin_lock_irqsave(&ns->head->requeue_lock, flags); -- Keith, Christoph, do you agree that the host action when it sees an error status like NVME_SC_INTERNAL_PATH_ERROR it needs to stop sending IO to the namespace but not change anything related to ana? ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-04 11:56 ` Sagi Grimberg @ 2024-01-30 9:36 ` Jirong Feng 2024-01-30 11:29 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-01-30 9:36 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao Now I suspect that my testcase is inappropriate for nvme native multipath. according to the base spec, chapter 2.4.1, nvme native multipath aims at accessing a certain namespace through multiple paths, not how to group different namespaces into one device. Therefore, in fabrics' case, a namespace must belong to one subsystem on a single target server. Looking at the latest code of nvme driver host, the host does refuse those namespaces reporting the same uuid on two different subsystems(in function nvme_global_check_duplicate_ids), which is exactly what I'm doing. The testcase seems to be a misuse of nvme native multipath. However, the testcase is pretty reasonable for dm-mpath. In a cloud scenario, we usually need a volume to be synced and exposed on multiple target servers for high availability reason. dm-mpath can do that, only if we choose group by serial. Namespaces from different subsystems reporting different uuid, but with same serial, can be recognized as one device by dm-mpath. The only problem seems just to be the returning code for dm-mpath to failover. native mpath should not encounter this case. please correct me if I'm wrong :) Thanks ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-30 9:36 ` Jirong Feng @ 2024-01-30 11:29 ` Sagi Grimberg 2024-01-31 6:25 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-01-30 11:29 UTC (permalink / raw) To: Jirong Feng, Christoph Hellwig, Keith Busch Cc: Jens Axboe, linux-nvme, peng.xiao > Now I suspect that my testcase is inappropriate for nvme native multipath. > according to the base spec, chapter 2.4.1, nvme native multipath aims at > accessing a certain namespace through multiple paths, not how to group > different namespaces into one device. Therefore, in fabrics' case, a > namespace must belong to one subsystem on a single target server. Nothing restricts a subsystem to a single server. You can expose the same subsystem from two different servers afair (as well as the same nsid uuid, ana groups etc). > Looking at > the latest code of nvme driver host, the host does refuse those namespaces > reporting the same uuid on two different subsystems(in function > nvme_global_check_duplicate_ids), which is exactly what I'm doing. The > testcase seems to be a misuse of nvme native multipath. multipathing scope is within a subsystem as far as linux is concerned. > However, the testcase is pretty reasonable for dm-mpath. In a cloud > scenario, > we usually need a volume to be synced and exposed on multiple target > servers > for high availability reason. dm-mpath can do that, only if we choose > group by > serial. Namespaces from different subsystems reporting different uuid, but > with same serial, can be recognized as one device by dm-mpath. > > The only problem seems just to be the returning code for dm-mpath to > failover. native mpath should not encounter this case. > > please correct me if I'm wrong :) As mentioned, afair (Hannes can correct me if I'm wrong) you can make an nvmet subsystem span more than 1 server assuming that the backend device is consistent (i.e. using drbd). The only thing that you need to pay attention is that the cntlid range is not overlapping in each of the servers that expose the nvmet subsystem (cntlid_min/max configfs attributes). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-30 11:29 ` Sagi Grimberg @ 2024-01-31 6:25 ` Christoph Hellwig 2024-03-20 3:17 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2024-01-31 6:25 UTC (permalink / raw) To: Sagi Grimberg Cc: Jirong Feng, Christoph Hellwig, Keith Busch, Jens Axboe, linux-nvme, peng.xiao On Tue, Jan 30, 2024 at 01:29:33PM +0200, Sagi Grimberg wrote: > As mentioned, afair (Hannes can correct me if I'm wrong) you can > make an nvmet subsystem span more than 1 server assuming that the > backend device is consistent (i.e. using drbd). The only thing that > you need to pay attention is that the cntlid range is not overlapping > in each of the servers that expose the nvmet subsystem (cntlid_min/max > configfs attributes). You can even make a subsystem span multiple "servers" without shared storage, but in that case you'd better now allow simultaneous access to any given namespace through path pointing to the different "servers". ANA comes in pretty handy for that. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-01-31 6:25 ` Christoph Hellwig @ 2024-03-20 3:17 ` Jirong Feng 2024-03-20 8:51 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-03-20 3:17 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig hi, just kindly ask, how about the previous patch? will it be merged? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-03-20 3:17 ` Jirong Feng @ 2024-03-20 8:51 ` Sagi Grimberg 2024-03-21 3:06 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-03-20 8:51 UTC (permalink / raw) To: Jirong Feng Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig On 20/03/2024 5:17, Jirong Feng wrote: > hi, > > just kindly ask, how about the previous patch? will it be merged? > Hey Jirong, We do not yet understand if this works for Linux nvme-mpath (which iirc requires a suggested host-side patch). Once we understand that we can take the changes to mainline. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-03-20 8:51 ` Sagi Grimberg @ 2024-03-21 3:06 ` Jirong Feng 2024-04-07 22:28 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-03-21 3:06 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig > Hey Jirong, > > We do not yet understand if this works for Linux nvme-mpath (which > iirc requires a suggested host-side patch). > Once we understand that we can take the changes to mainline. > My last test(native multipath) was done on kernel 4.18.0-147.3.1.el8_1, the result is ocassional failure. Refer to your previous reply to my question, this time I changed the cntlid_min/cntlid_max range, making it single subsystem from different targets. then I retested, here's the result: 1. on kernel 4.18.0-147.3.1.el8_1, the failure still occurs. 2. on kernel 6.6.0, no failure. (about 50 times) 3. on kernel 6.6.0 applying your host-side patch, no failure. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-03-21 3:06 ` Jirong Feng @ 2024-04-07 22:28 ` Sagi Grimberg 2024-04-12 7:52 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-04-07 22:28 UTC (permalink / raw) To: Jirong Feng Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig On 21/03/2024 5:06, Jirong Feng wrote: > >> Hey Jirong, >> >> We do not yet understand if this works for Linux nvme-mpath (which >> iirc requires a suggested host-side patch). >> Once we understand that we can take the changes to mainline. >> > My last test(native multipath) was done on kernel > 4.18.0-147.3.1.el8_1, the result is ocassional failure. > > Refer to your previous reply to my question, this time I changed the > cntlid_min/cntlid_max range, making it single subsystem from different > targets. > then I retested, here's the result: > > 1. on kernel 4.18.0-147.3.1.el8_1, the failure still occurs. > 2. on kernel 6.6.0, no failure. (about 50 times) > 3. on kernel 6.6.0 applying your host-side patch, no failure. So essentially there is no need for the host side patch? interesting. Are you sure? Can you please also try with mpath iopolicy=round-robin? I'm asking because I cannot understand what is preventing this path from being selected again and again for I/O.... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-04-07 22:28 ` Sagi Grimberg @ 2024-04-12 7:52 ` Jirong Feng 2024-04-12 8:57 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Jirong Feng @ 2024-04-12 7:52 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig > So essentially there is no need for the host side patch? interesting. > Are you sure? At least no failure is observed in a newer version(6.6.0) of kernel so far. I can only tell that I've tested it for hundreds of times. In addition, I've got some scripts to enable/disable it continually, we can observe it a few more days. > Can you please also try with mpath iopolicy=round-robin? All my previous tests were done with round-robin. I retested again today both round-robin and numa, the results are still the same. > I'm asking because I cannot understand what is preventing this path > from being selected again and > again for I/O.... Perhaps we need to dive into the code of old version(4.18.0-147.3.1.el8_1) and see what's different? Or should I try apply the host side patch to the old version and test again? Thanks ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-04-12 7:52 ` Jirong Feng @ 2024-04-12 8:57 ` Sagi Grimberg 2024-04-22 9:47 ` Sagi Grimberg 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-04-12 8:57 UTC (permalink / raw) To: Jirong Feng Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig On 12/04/2024 10:52, Jirong Feng wrote: >> So essentially there is no need for the host side patch? interesting. >> Are you sure? > > At least no failure is observed in a newer version(6.6.0) of kernel so > far. I can only tell that I've tested it for hundreds of times. > > In addition, I've got some scripts to enable/disable it continually, > we can observe it a few more days. > > >> Can you please also try with mpath iopolicy=round-robin? > All my previous tests were done with round-robin. I retested again > today both round-robin and numa, the results are still the same. > > >> I'm asking because I cannot understand what is preventing this path >> from being selected again and >> again for I/O.... > > Perhaps we need to dive into the code of old > version(4.18.0-147.3.1.el8_1) and see what's different? > > Or should I try apply the host side patch to the old version and test > again? What I think you want is to trace if the path where you disabled the namespace is actually being selected over and over again, and failed over... Can you please activate tracing and see where your mpath commands are actually going from? I'd trace nvme_setup_cmd, and see that once you disable one nvmet ns, it is not selected by the mpath namespace as a valid ns. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-04-12 8:57 ` Sagi Grimberg @ 2024-04-22 9:47 ` Sagi Grimberg 2024-04-23 3:15 ` Jirong Feng 0 siblings, 1 reply; 28+ messages in thread From: Sagi Grimberg @ 2024-04-22 9:47 UTC (permalink / raw) To: Jirong Feng Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig On 12/04/2024 11:57, Sagi Grimberg wrote: > > > On 12/04/2024 10:52, Jirong Feng wrote: >>> So essentially there is no need for the host side patch? >>> interesting. Are you sure? >> >> At least no failure is observed in a newer version(6.6.0) of kernel >> so far. I can only tell that I've tested it for hundreds of times. >> >> In addition, I've got some scripts to enable/disable it continually, >> we can observe it a few more days. >> >> >>> Can you please also try with mpath iopolicy=round-robin? >> All my previous tests were done with round-robin. I retested again >> today both round-robin and numa, the results are still the same. >> >> >>> I'm asking because I cannot understand what is preventing this path >>> from being selected again and >>> again for I/O.... >> >> Perhaps we need to dive into the code of old >> version(4.18.0-147.3.1.el8_1) and see what's different? >> >> Or should I try apply the host side patch to the old version and test >> again? > > What I think you want is to trace if the path where you disabled the > namespace is actually being selected > over and over again, and failed over... > > Can you please activate tracing and see where your mpath commands are > actually going from? > > I'd trace nvme_setup_cmd, and see that once you disable one nvmet ns, > it is not selected by the mpath > namespace as a valid ns. Any update on this? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? 2024-04-22 9:47 ` Sagi Grimberg @ 2024-04-23 3:15 ` Jirong Feng 0 siblings, 0 replies; 28+ messages in thread From: Jirong Feng @ 2024-04-23 3:15 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Jens Axboe, linux-nvme, peng.xiao, Christoph Hellwig Sorry for not replying in time. > Any update on this? > Not yet, just too busy on work these days. I'll find some time in this week. Thanks for your attention to this case ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-04-23 3:16 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-04 7:58 Should NVME_SC_INVALID_NS be translated to BLK_STS_IOERR instead of BLK_STS_NOTSUPP so that multipath(both native and dm) can failover on the failure? Jirong Feng 2023-12-04 8:47 ` Sagi Grimberg 2023-12-05 3:54 ` Jirong Feng 2023-12-05 4:37 ` Keith Busch 2023-12-05 4:40 ` Christoph Hellwig 2023-12-05 5:18 ` Keith Busch 2023-12-05 7:06 ` Jirong Feng 2023-12-05 8:50 ` Sagi Grimberg 2023-12-25 11:25 ` Jirong Feng 2023-12-25 11:40 ` Sagi Grimberg 2023-12-25 12:14 ` Jirong Feng 2023-12-26 13:27 ` Jirong Feng 2024-01-01 9:51 ` Sagi Grimberg 2024-01-02 10:33 ` Jirong Feng 2024-01-02 12:46 ` Sagi Grimberg 2024-01-03 10:24 ` Jirong Feng 2024-01-04 11:56 ` Sagi Grimberg 2024-01-30 9:36 ` Jirong Feng 2024-01-30 11:29 ` Sagi Grimberg 2024-01-31 6:25 ` Christoph Hellwig 2024-03-20 3:17 ` Jirong Feng 2024-03-20 8:51 ` Sagi Grimberg 2024-03-21 3:06 ` Jirong Feng 2024-04-07 22:28 ` Sagi Grimberg 2024-04-12 7:52 ` Jirong Feng 2024-04-12 8:57 ` Sagi Grimberg 2024-04-22 9:47 ` Sagi Grimberg 2024-04-23 3:15 ` Jirong Feng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox