linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blktests failures with v6.11-rc1 kernel
@ 2024-08-02  9:09 Shinichiro Kawasaki
  2024-08-02 12:16 ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Shinichiro Kawasaki @ 2024-08-02  9:09 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-scsi@vger.kernel.org, nbd@other.debian.org,
	linux-rdma@vger.kernel.org

Hi all,

I ran the latest blktests (git hash: 25efe2a1948d) with the v6.11-rc1 kernel.
Also I checked the CKI project run results with the kernel. In total, I observed
three failures as listed below.

Comparing with the previous report using the v6.10 kernel [1], two failures of
dm/002 and nbd/001,002 were addressed by blktests side fixes. The srp/002
failure with v6.10 kernel was addressed by the kernel side fix (Thanks!).
However, srp/002 had another new failure symptom with v6.11-rc1 kernel.

[1] https://lore.kernel.org/linux-block/ym5pkn7dam4vb7zmeegba4hq2avkvirjyojo4aaveseag2xyvw@j5auxpxbdkpf/

List of failures
================
#1: nvme/041 (fc transport)
#2: srp/002
#3: nvme/052 (CKI failure)


Failure description
===================

#1: nvme/041 (fc transport)

   With the trtype=fc configuration, nvme/041 fails:

  nvme/041 (Create authenticated connections)                  [failed]
      runtime  2.677s  ...  4.823s
      --- tests/nvme/041.out      2023-11-29 12:57:17.206898664 +0900
      +++ /home/shin/Blktests/blktests/results/nodev/nvme/041.out.bad     2024-03-19 14:50:56.399101323 +0900
      @@ -2,5 +2,5 @@
       Test unauthenticated connection (should fail)
       disconnected 0 controller(s)
       Test authenticated connection
      -disconnected 1 controller(s)
      +disconnected 0 controller(s)
       Test complete

   nvme/044 had same failure symptom until the kernel v6.9. A solution was
   suggested and discussed in Feb/2024 [2].

   [2] https://lore.kernel.org/linux-nvme/20240221132404.6311-1-dwagner@suse.de/

#2: srp/002

   New "atomic queue limits API" was introduce to the scsi sd driver, and it
   created a circular lock dependency. A fix patch candidate is available [3].

   [3] https://lore.kernel.org/linux-block/20240801054234.540532-1-shinichiro.kawasaki@wdc.com/

#3: nvme/052 (CKI failure)

   The CKI project reported that nvme/052 fails occasionally [4].
   This needs further debug effort.

  nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
      runtime    ...  22.209s
      --- tests/nvme/052.out	2024-07-30 18:38:29.041716566 -0400
      +++ /mnt/tests/gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/archive/production/kernel-tests-production.zip/storage/blktests/nvme/nvme-loop/blktests/results/nodev_tr_loop/nvme/052.out.bad	2024-07-30 18:45:35.438067452 -0400
      @@ -1,2 +1,4 @@
       Running nvme/052
      +cat: /sys/block/nvme1n2/uuid: No such file or directory
      +cat: /sys/block/nvme1n2/uuid: No such file or directory
       Test complete

   [4] https://datawarehouse.cki-project.org/kcidb/tests/13669275

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-02  9:09 blktests failures with v6.11-rc1 kernel Shinichiro Kawasaki
@ 2024-08-02 12:16 ` Nilay Shroff
  2024-08-02 12:34   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2024-08-02 12:16 UTC (permalink / raw)
  To: Shinichiro Kawasaki, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org



On 8/2/24 14:39, Shinichiro Kawasaki wrote:
> 
> #3: nvme/052 (CKI failure)
> 
>    The CKI project reported that nvme/052 fails occasionally [4].
>    This needs further debug effort.
> 
>   nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
>       runtime    ...  22.209s
>       --- tests/nvme/052.out	2024-07-30 18:38:29.041716566 -0400
>       +++ /mnt/tests/gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/archive/production/kernel-tests-production.zip/storage/blktests/nvme/nvme-loop/blktests/results/nodev_tr_loop/nvme/052.out.bad	2024-07-30 18:45:35.438067452 -0400
>       @@ -1,2 +1,4 @@
>        Running nvme/052
>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
>        Test complete
> 
>    [4] https://datawarehouse.cki-project.org/kcidb/tests/13669275

I just checked the console logs of the nvme/052 and from the logs it's 
apparent that all namespaces were created successfully and so it's strange
to see that the test couldn't access "/sys/block/nvme1n2/uuid". Do you know
if there's any chance of simultaneous blktests running on this machine?
 
On my test machine, I couldn't reproduce this issue on 6.11-rc1 kernel.

Thanks,
--Nilay
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-02 12:16 ` Nilay Shroff
@ 2024-08-02 12:34   ` Shinichiro Kawasaki
  2024-08-02 16:49     ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Shinichiro Kawasaki @ 2024-08-02 12:34 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-scsi@vger.kernel.org, nbd@other.debian.org,
	linux-rdma@vger.kernel.org, Yi Zhang

CC+: Yi Zhang,

On Aug 02, 2024 / 17:46, Nilay Shroff wrote:
> 
> 
> On 8/2/24 14:39, Shinichiro Kawasaki wrote:
> > 
> > #3: nvme/052 (CKI failure)
> > 
> >    The CKI project reported that nvme/052 fails occasionally [4].
> >    This needs further debug effort.
> > 
> >   nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> >       runtime    ...  22.209s
> >       --- tests/nvme/052.out	2024-07-30 18:38:29.041716566 -0400
> >       +++ /mnt/tests/gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/archive/production/kernel-tests-production.zip/storage/blktests/nvme/nvme-loop/blktests/results/nodev_tr_loop/nvme/052.out.bad	2024-07-30 18:45:35.438067452 -0400
> >       @@ -1,2 +1,4 @@
> >        Running nvme/052
> >       +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >       +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >        Test complete
> > 
> >    [4] https://datawarehouse.cki-project.org/kcidb/tests/13669275
> 
> I just checked the console logs of the nvme/052 and from the logs it's 
> apparent that all namespaces were created successfully and so it's strange
> to see that the test couldn't access "/sys/block/nvme1n2/uuid".

I agree that it's strange. I think the "No such file or directory" error
happened in _find_nvme_ns(), and it checks existence of the uuid file before
the cat command. I have no idea why the error happens.

> Do you know
> if there's any chance of simultaneous blktests running on this machine?

The error was observed in CKI project test environment. I'm not sure if such
simultaneous runs can happen in the environment.

Yi, in case you have any comment, please share.

>  
> On my test machine, I couldn't reproduce this issue on 6.11-rc1 kernel.

I tried on my two test machines (QEMU and baremetal), and couldn't reproduce
either.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-02 12:34   ` Shinichiro Kawasaki
@ 2024-08-02 16:49     ` Nilay Shroff
  2024-08-13  7:06       ` Yi Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2024-08-02 16:49 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-scsi@vger.kernel.org, nbd@other.debian.org,
	linux-rdma@vger.kernel.org, Yi Zhang



On 8/2/24 18:04, Shinichiro Kawasaki wrote:
> CC+: Yi Zhang,
> 
> On Aug 02, 2024 / 17:46, Nilay Shroff wrote:
>>
>>
>> On 8/2/24 14:39, Shinichiro Kawasaki wrote:
>>>
>>> #3: nvme/052 (CKI failure)
>>>
>>>    The CKI project reported that nvme/052 fails occasionally [4].
>>>    This needs further debug effort.
>>>
>>>   nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
>>>       runtime    ...  22.209s
>>>       --- tests/nvme/052.out	2024-07-30 18:38:29.041716566 -0400
>>>       +++ /mnt/tests/gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/archive/production/kernel-tests-production.zip/storage/blktests/nvme/nvme-loop/blktests/results/nodev_tr_loop/nvme/052.out.bad	2024-07-30 18:45:35.438067452 -0400
>>>       @@ -1,2 +1,4 @@
>>>        Running nvme/052
>>>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
>>>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
>>>        Test complete
>>>
>>>    [4] https://datawarehouse.cki-project.org/kcidb/tests/13669275
>>
>> I just checked the console logs of the nvme/052 and from the logs it's 
>> apparent that all namespaces were created successfully and so it's strange
>> to see that the test couldn't access "/sys/block/nvme1n2/uuid".
> 
> I agree that it's strange. I think the "No such file or directory" error
> happened in _find_nvme_ns(), and it checks existence of the uuid file before
> the cat command. I have no idea why the error happens.
> 
Yes exactly, and these two operations (checking the existence of uuid
and cat command) are not atomic. So the only plausible theory I have at this 
time is "if namespace is deleted after checking the existence of uuid but 
before cat command is executed" then this issue may potentially manifests. 
Furthermore, as you mentioned, this issue is seen on the test machine 
occasionally, so I asked if there's a possibility of simultaneous blktest 
or some other tests running on this system.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-02 16:49     ` Nilay Shroff
@ 2024-08-13  7:06       ` Yi Zhang
  2024-08-14 12:35         ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Yi Zhang @ 2024-08-13  7:06 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Shinichiro Kawasaki, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org

On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
>
>
> On 8/2/24 18:04, Shinichiro Kawasaki wrote:
> > CC+: Yi Zhang,
> >
> > On Aug 02, 2024 / 17:46, Nilay Shroff wrote:
> >>
> >>
> >> On 8/2/24 14:39, Shinichiro Kawasaki wrote:
> >>>
> >>> #3: nvme/052 (CKI failure)
> >>>
> >>>    The CKI project reported that nvme/052 fails occasionally [4].
> >>>    This needs further debug effort.
> >>>
> >>>   nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> >>>       runtime    ...  22.209s
> >>>       --- tests/nvme/052.out        2024-07-30 18:38:29.041716566 -0400
> >>>       +++ /mnt/tests/gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/archive/production/kernel-tests-production.zip/storage/blktests/nvme/nvme-loop/blktests/results/nodev_tr_loop/nvme/052.out.bad     2024-07-30 18:45:35.438067452 -0400
> >>>       @@ -1,2 +1,4 @@
> >>>        Running nvme/052
> >>>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>>       +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>>        Test complete
> >>>
> >>>    [4] https://datawarehouse.cki-project.org/kcidb/tests/13669275
> >>
> >> I just checked the console logs of the nvme/052 and from the logs it's
> >> apparent that all namespaces were created successfully and so it's strange
> >> to see that the test couldn't access "/sys/block/nvme1n2/uuid".
> >
> > I agree that it's strange. I think the "No such file or directory" error
> > happened in _find_nvme_ns(), and it checks existence of the uuid file before
> > the cat command. I have no idea why the error happens.
> >
> Yes exactly, and these two operations (checking the existence of uuid
> and cat command) are not atomic. So the only plausible theory I have at this
> time is "if namespace is deleted after checking the existence of uuid but
> before cat command is executed" then this issue may potentially manifests.
> Furthermore, as you mentioned, this issue is seen on the test machine
> occasionally, so I asked if there's a possibility of simultaneous blktest
> or some other tests running on this system.

There are no simultaneous tests during the CKI tests running.
I reproduced the failure on that server and always can be reproduced
within 5 times:
# sh a.sh
==============================0
nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
    runtime  21.496s  ...  21.398s
==============================1
nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
    runtime  21.398s  ...  21.974s
    --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
    +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
2024-08-13 02:53:51.635047928 -0400
    @@ -1,2 +1,5 @@
     Running nvme/052
    +cat: /sys/block/nvme1n2/uuid: No such file or directory
    +cat: /sys/block/nvme1n2/uuid: No such file or directory
    +cat: /sys/block/nvme1n2/uuid: No such file or directory
     Test complete
# uname -r
6.11.0-rc3
[root@hpe-rl300gen11-04 blktests]# lsblk
NAME                                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
zram0                               252:0    0     8G  0 disk [SWAP]
nvme0n1                             259:0    0 447.1G  0 disk
├─nvme0n1p1                         259:1    0   600M  0 part /boot/efi
├─nvme0n1p2                         259:2    0     1G  0 part /boot
└─nvme0n1p3                         259:3    0 445.5G  0 part
  └─fedora_hpe--rl300gen11--04-root 253:0    0 445.5G  0 lvm  /


>
> Thanks,
> --Nilay
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-13  7:06       ` Yi Zhang
@ 2024-08-14 12:35         ` Nilay Shroff
  2024-08-19 12:34           ` Shinichiro Kawasaki
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2024-08-14 12:35 UTC (permalink / raw)
  To: Yi Zhang
  Cc: Shinichiro Kawasaki, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org



On 8/13/24 12:36, Yi Zhang wrote:
> On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
> 
> There are no simultaneous tests during the CKI tests running.
> I reproduced the failure on that server and always can be reproduced
> within 5 times:
> # sh a.sh
> ==============================0
> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
>     runtime  21.496s  ...  21.398s
> ==============================1
> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
>     runtime  21.398s  ...  21.974s
>     --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
>     +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
> 2024-08-13 02:53:51.635047928 -0400
>     @@ -1,2 +1,5 @@
>      Running nvme/052
>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>      Test complete
> # uname -r
> 6.11.0-rc3

We may need to debug this further. Is it possible to patch blktest and 
collect some details when this issue manifests? If yes then can you please
apply the below diff and re-run your test? This patch would capture output 
of "nvme list" and "sysfs attribute tree created under namespace head node"
and store those details in 052.full file. 

diff --git a/common/nvme b/common/nvme
index 9e78f3e..780b5e3 100644
--- a/common/nvme
+++ b/common/nvme
@@ -589,8 +589,23 @@ _find_nvme_ns() {
                if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
                        continue
                fi
+               echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
+               echo -e "\n`nvme list -v`\n" >> ${FULL}
+               echo -e "\n`tree ${ns}`\n" >> ${FULL}
+
                [ -e "${ns}/uuid" ] || continue
                uuid=$(cat "${ns}/uuid")
+
+               if [ "$?" = "1" ]; then
+                       echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
+                       echo "`nvme list -v`" >> ${FULL}
+                       if [ -d "${ns}" ]; then
+                               echo -e "\n`tree ${ns}`\n" >> ${FULL}
+                       else
+                               echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
+                       fi
+               fi
+
                if [[ "${subsys_uuid}" == "${uuid}" ]]; then
                        basename "${ns}"
                fi


After applying the above diff, when this issue occurs on your system copy this 
file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across. 
This may give us some clue about what might be going wrong. 

Thanks,
--Nilay


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-14 12:35         ` Nilay Shroff
@ 2024-08-19 12:34           ` Shinichiro Kawasaki
  2024-08-19 13:35             ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Shinichiro Kawasaki @ 2024-08-19 12:34 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Yi Zhang, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org

On Aug 14, 2024 / 18:05, Nilay Shroff wrote:
> 
> 
> On 8/13/24 12:36, Yi Zhang wrote:
> > On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
> > 
> > There are no simultaneous tests during the CKI tests running.
> > I reproduced the failure on that server and always can be reproduced
> > within 5 times:
> > # sh a.sh
> > ==============================0
> > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
> >     runtime  21.496s  ...  21.398s
> > ==============================1
> > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> >     runtime  21.398s  ...  21.974s
> >     --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
> >     +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
> > 2024-08-13 02:53:51.635047928 -0400
> >     @@ -1,2 +1,5 @@
> >      Running nvme/052
> >     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >      Test complete
> > # uname -r
> > 6.11.0-rc3
> 
> We may need to debug this further. Is it possible to patch blktest and 
> collect some details when this issue manifests? If yes then can you please
> apply the below diff and re-run your test? This patch would capture output 
> of "nvme list" and "sysfs attribute tree created under namespace head node"
> and store those details in 052.full file. 
> 
> diff --git a/common/nvme b/common/nvme
> index 9e78f3e..780b5e3 100644
> --- a/common/nvme
> +++ b/common/nvme
> @@ -589,8 +589,23 @@ _find_nvme_ns() {
>                 if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
>                         continue
>                 fi
> +               echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
> +               echo -e "\n`nvme list -v`\n" >> ${FULL}
> +               echo -e "\n`tree ${ns}`\n" >> ${FULL}
> +
>                 [ -e "${ns}/uuid" ] || continue
>                 uuid=$(cat "${ns}/uuid")
> +
> +               if [ "$?" = "1" ]; then
> +                       echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
> +                       echo "`nvme list -v`" >> ${FULL}
> +                       if [ -d "${ns}" ]; then
> +                               echo -e "\n`tree ${ns}`\n" >> ${FULL}
> +                       else
> +                               echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
> +                       fi
> +               fi
> +
>                 if [[ "${subsys_uuid}" == "${uuid}" ]]; then
>                         basename "${ns}"
>                 fi
> 
> 
> After applying the above diff, when this issue occurs on your system copy this 
> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across. 
> This may give us some clue about what might be going wrong. 

Nilay, thank you for this suggestion. To follow it, I tried to recreate the
failure again, and managed to do it :) When I repeat the test case 20 or 40
times one of my test machines, the failure is observed in stable manner.

I applied your debug patch above to blktests, then I repeated the test case.
Unfortunately, the failure disappeared. When I repeat the test case 100 times,
the failure was not observed. I guess the echos for debug changed the timing to
access the sysfs uuid file, then the failure disappeared.

This helped me think about the cause. The test case repeats _create_nvmet_ns
and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid
file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to
remove the namespace, it does not wait for the completion of the removal work.
Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to
the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat
to the sysfs uuid file, it fails because the sysfs uuid file has got removed,
before recreating it for the next _create_nvmet_ns.

Based on this guess, I created a patch below. It modifies the test case to wait
for the namespace device disappears after calling _remove_nvmet_ns. (I assume
that the sysfs uuid file disappears when the device file disappears). With
this patch, the failure was not observed by repeating it 100 times. I also
reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list")
from v6.11-rc4, then confirmed that the test case with this change still can
detect the regression.

I will do some more confirmation. If it goes well, will post this change as
a formal patch.

diff --git a/tests/nvme/052 b/tests/nvme/052
index cf6061a..469cefd 100755
--- a/tests/nvme/052
+++ b/tests/nvme/052
@@ -39,15 +39,32 @@ nvmf_wait_for_ns() {
 		ns=$(_find_nvme_ns "${uuid}")
 	done
 
+	echo "$ns"
 	return 0
 }
 
+nvmf_wait_for_ns_removal() {
+	local ns=$1 i
+
+	for ((i = 0; i < 10; i++)); do
+		if [[ ! -e /dev/$ns ]]; then
+			return
+		fi
+		sleep .1
+		echo "wait removal of $ns" >> "$FULL"
+	done
+
+	if [[ -e /dev/$ns ]]; then
+		echo "Failed to remove the namespace $"
+	fi
+}
+
 test() {
 	echo "Running ${TEST_NAME}"
 
 	_setup_nvmet
 
-	local iterations=20
+	local iterations=20 ns
 
 	_nvmet_target_setup
 
@@ -63,7 +80,7 @@ test() {
 		_create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}"
 
 		# wait until async request is processed and ns is created
-		nvmf_wait_for_ns "${uuid}"
+		ns=$(nvmf_wait_for_ns "${uuid}")
 		if [ $? -eq 1 ]; then
 			echo "FAIL"
 			rm "$(_nvme_def_file_path).$i"
@@ -71,6 +88,7 @@ test() {
 		fi
 
 		_remove_nvmet_ns "${def_subsysnqn}" "${i}"
+		nvmf_wait_for_ns_removal "$ns"
 		rm "$(_nvme_def_file_path).$i"
 	}
 	done

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-19 12:34           ` Shinichiro Kawasaki
@ 2024-08-19 13:35             ` Nilay Shroff
  2024-08-20  1:50               ` Yi Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2024-08-19 13:35 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Yi Zhang, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org



On 8/19/24 18:04, Shinichiro Kawasaki wrote:
> On Aug 14, 2024 / 18:05, Nilay Shroff wrote:
>>
>>
>> On 8/13/24 12:36, Yi Zhang wrote:
>>> On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
>>>
>>> There are no simultaneous tests during the CKI tests running.
>>> I reproduced the failure on that server and always can be reproduced
>>> within 5 times:
>>> # sh a.sh
>>> ==============================0
>>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
>>>     runtime  21.496s  ...  21.398s
>>> ==============================1
>>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
>>>     runtime  21.398s  ...  21.974s
>>>     --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
>>>     +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
>>> 2024-08-13 02:53:51.635047928 -0400
>>>     @@ -1,2 +1,5 @@
>>>      Running nvme/052
>>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
>>>      Test complete
>>> # uname -r
>>> 6.11.0-rc3
>>
>> We may need to debug this further. Is it possible to patch blktest and 
>> collect some details when this issue manifests? If yes then can you please
>> apply the below diff and re-run your test? This patch would capture output 
>> of "nvme list" and "sysfs attribute tree created under namespace head node"
>> and store those details in 052.full file. 
>>
>> diff --git a/common/nvme b/common/nvme
>> index 9e78f3e..780b5e3 100644
>> --- a/common/nvme
>> +++ b/common/nvme
>> @@ -589,8 +589,23 @@ _find_nvme_ns() {
>>                 if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
>>                         continue
>>                 fi
>> +               echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
>> +               echo -e "\n`nvme list -v`\n" >> ${FULL}
>> +               echo -e "\n`tree ${ns}`\n" >> ${FULL}
>> +
>>                 [ -e "${ns}/uuid" ] || continue
>>                 uuid=$(cat "${ns}/uuid")
>> +
>> +               if [ "$?" = "1" ]; then
>> +                       echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
>> +                       echo "`nvme list -v`" >> ${FULL}
>> +                       if [ -d "${ns}" ]; then
>> +                               echo -e "\n`tree ${ns}`\n" >> ${FULL}
>> +                       else
>> +                               echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
>> +                       fi
>> +               fi
>> +
>>                 if [[ "${subsys_uuid}" == "${uuid}" ]]; then
>>                         basename "${ns}"
>>                 fi
>>
>>
>> After applying the above diff, when this issue occurs on your system copy this 
>> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across. 
>> This may give us some clue about what might be going wrong. 
> 
> Nilay, thank you for this suggestion. To follow it, I tried to recreate the
> failure again, and managed to do it :) When I repeat the test case 20 or 40
> times one of my test machines, the failure is observed in stable manner.

Shinichiro, I am glad that you were able to recreate this issue.

> I applied your debug patch above to blktests, then I repeated the test case.
> Unfortunately, the failure disappeared. When I repeat the test case 100 times,
> the failure was not observed. I guess the echos for debug changed the timing to
> access the sysfs uuid file, then the failure disappeared.

Yes this could be possible. BTW, Yi tried the same patch and with the patch applied,
this issue could be still reproduced on Yi's testbed!! 
> This helped me think about the cause. The test case repeats _create_nvmet_ns
> and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid
> file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to
> remove the namespace, it does not wait for the completion of the removal work.
> Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to
> the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat
> to the sysfs uuid file, it fails because the sysfs uuid file has got removed,
> before recreating it for the next _create_nvmet_ns.

I agree with your assessment about the plausible cause of this issue. I just reviewed
the nvme target kernel code and it's now apparent to me that we need to wait for the 
removal of the namespace before we re-create the next namespace. I think this is a miss.  
> 
> Based on this guess, I created a patch below. It modifies the test case to wait
> for the namespace device disappears after calling _remove_nvmet_ns. (I assume
> that the sysfs uuid file disappears when the device file disappears). With
> this patch, the failure was not observed by repeating it 100 times. I also
> reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list")
> from v6.11-rc4, then confirmed that the test case with this change still can
> detect the regression.
> 
I am pretty sure that your patch would solve this issue.  

> I will do some more confirmation. If it goes well, will post this change as
> a formal patch.
> 
> diff --git a/tests/nvme/052 b/tests/nvme/052
> index cf6061a..469cefd 100755
> --- a/tests/nvme/052
> +++ b/tests/nvme/052
> @@ -39,15 +39,32 @@ nvmf_wait_for_ns() {
>  		ns=$(_find_nvme_ns "${uuid}")
>  	done
>  
> +	echo "$ns"
>  	return 0
>  }
>  
> +nvmf_wait_for_ns_removal() {
> +	local ns=$1 i
> +
> +	for ((i = 0; i < 10; i++)); do
> +		if [[ ! -e /dev/$ns ]]; then
> +			return
> +		fi
> +		sleep .1
> +		echo "wait removal of $ns" >> "$FULL"
> +	done
> +
> +	if [[ -e /dev/$ns ]]; then
> +		echo "Failed to remove the namespace $"
> +	fi
> +}
> +
>  test() {
>  	echo "Running ${TEST_NAME}"
>  
>  	_setup_nvmet
>  
> -	local iterations=20
> +	local iterations=20 ns
>  
>  	_nvmet_target_setup
>  
> @@ -63,7 +80,7 @@ test() {
>  		_create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}"
>  
>  		# wait until async request is processed and ns is created
> -		nvmf_wait_for_ns "${uuid}"
> +		ns=$(nvmf_wait_for_ns "${uuid}")
>  		if [ $? -eq 1 ]; then
>  			echo "FAIL"
>  			rm "$(_nvme_def_file_path).$i"
> @@ -71,6 +88,7 @@ test() {
>  		fi
>  
>  		_remove_nvmet_ns "${def_subsysnqn}" "${i}"
> +		nvmf_wait_for_ns_removal "$ns"
>  		rm "$(_nvme_def_file_path).$i"
>  	}
>  	done

I think there's some formatting issue in the above patch. I see some stray characters
which you may cleanup/fix later when you send the formal patch.

Yi, I think you you may also try the above patch on your testbed and confirm the result.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blktests failures with v6.11-rc1 kernel
  2024-08-19 13:35             ` Nilay Shroff
@ 2024-08-20  1:50               ` Yi Zhang
  0 siblings, 0 replies; 9+ messages in thread
From: Yi Zhang @ 2024-08-20  1:50 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Shinichiro Kawasaki, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	nbd@other.debian.org, linux-rdma@vger.kernel.org

On Mon, Aug 19, 2024 at 9:35 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
>
>
> On 8/19/24 18:04, Shinichiro Kawasaki wrote:
> > On Aug 14, 2024 / 18:05, Nilay Shroff wrote:
> >>
> >>
> >> On 8/13/24 12:36, Yi Zhang wrote:
> >>> On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay@linux.ibm.com> wrote:
> >>>
> >>> There are no simultaneous tests during the CKI tests running.
> >>> I reproduced the failure on that server and always can be reproduced
> >>> within 5 times:
> >>> # sh a.sh
> >>> ==============================0
> >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
> >>>     runtime  21.496s  ...  21.398s
> >>> ==============================1
> >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> >>>     runtime  21.398s  ...  21.974s
> >>>     --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
> >>>     +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
> >>> 2024-08-13 02:53:51.635047928 -0400
> >>>     @@ -1,2 +1,5 @@
> >>>      Running nvme/052
> >>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>>     +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>>      Test complete
> >>> # uname -r
> >>> 6.11.0-rc3
> >>
> >> We may need to debug this further. Is it possible to patch blktest and
> >> collect some details when this issue manifests? If yes then can you please
> >> apply the below diff and re-run your test? This patch would capture output
> >> of "nvme list" and "sysfs attribute tree created under namespace head node"
> >> and store those details in 052.full file.
> >>
> >> diff --git a/common/nvme b/common/nvme
> >> index 9e78f3e..780b5e3 100644
> >> --- a/common/nvme
> >> +++ b/common/nvme
> >> @@ -589,8 +589,23 @@ _find_nvme_ns() {
> >>                 if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
> >>                         continue
> >>                 fi
> >> +               echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
> >> +               echo -e "\n`nvme list -v`\n" >> ${FULL}
> >> +               echo -e "\n`tree ${ns}`\n" >> ${FULL}
> >> +
> >>                 [ -e "${ns}/uuid" ] || continue
> >>                 uuid=$(cat "${ns}/uuid")
> >> +
> >> +               if [ "$?" = "1" ]; then
> >> +                       echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
> >> +                       echo "`nvme list -v`" >> ${FULL}
> >> +                       if [ -d "${ns}" ]; then
> >> +                               echo -e "\n`tree ${ns}`\n" >> ${FULL}
> >> +                       else
> >> +                               echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
> >> +                       fi
> >> +               fi
> >> +
> >>                 if [[ "${subsys_uuid}" == "${uuid}" ]]; then
> >>                         basename "${ns}"
> >>                 fi
> >>
> >>
> >> After applying the above diff, when this issue occurs on your system copy this
> >> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across.
> >> This may give us some clue about what might be going wrong.
> >
> > Nilay, thank you for this suggestion. To follow it, I tried to recreate the
> > failure again, and managed to do it :) When I repeat the test case 20 or 40
> > times one of my test machines, the failure is observed in stable manner.
>
> Shinichiro, I am glad that you were able to recreate this issue.
>
> > I applied your debug patch above to blktests, then I repeated the test case.
> > Unfortunately, the failure disappeared. When I repeat the test case 100 times,
> > the failure was not observed. I guess the echos for debug changed the timing to
> > access the sysfs uuid file, then the failure disappeared.
>
> Yes this could be possible. BTW, Yi tried the same patch and with the patch applied,
> this issue could be still reproduced on Yi's testbed!!
> > This helped me think about the cause. The test case repeats _create_nvmet_ns
> > and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid
> > file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to
> > remove the namespace, it does not wait for the completion of the removal work.
> > Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to
> > the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat
> > to the sysfs uuid file, it fails because the sysfs uuid file has got removed,
> > before recreating it for the next _create_nvmet_ns.
>
> I agree with your assessment about the plausible cause of this issue. I just reviewed
> the nvme target kernel code and it's now apparent to me that we need to wait for the
> removal of the namespace before we re-create the next namespace. I think this is a miss.
> >
> > Based on this guess, I created a patch below. It modifies the test case to wait
> > for the namespace device disappears after calling _remove_nvmet_ns. (I assume
> > that the sysfs uuid file disappears when the device file disappears). With
> > this patch, the failure was not observed by repeating it 100 times. I also
> > reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list")
> > from v6.11-rc4, then confirmed that the test case with this change still can
> > detect the regression.
> >
> I am pretty sure that your patch would solve this issue.
>
> > I will do some more confirmation. If it goes well, will post this change as
> > a formal patch.
> >
> > diff --git a/tests/nvme/052 b/tests/nvme/052
> > index cf6061a..469cefd 100755
> > --- a/tests/nvme/052
> > +++ b/tests/nvme/052
> > @@ -39,15 +39,32 @@ nvmf_wait_for_ns() {
> >               ns=$(_find_nvme_ns "${uuid}")
> >       done
> >
> > +     echo "$ns"
> >       return 0
> >  }
> >
> > +nvmf_wait_for_ns_removal() {
> > +     local ns=$1 i
> > +
> > +     for ((i = 0; i < 10; i++)); do
> > +             if [[ ! -e /dev/$ns ]]; then
> > +                     return
> > +             fi
> > +             sleep .1
> > +             echo "wait removal of $ns" >> "$FULL"
> > +     done
> > +
> > +     if [[ -e /dev/$ns ]]; then
> > +             echo "Failed to remove the namespace $"
> > +     fi
> > +}
> > +
> >  test() {
> >       echo "Running ${TEST_NAME}"
> >
> >       _setup_nvmet
> >
> > -     local iterations=20
> > +     local iterations=20 ns
> >
> >       _nvmet_target_setup
> >
> > @@ -63,7 +80,7 @@ test() {
> >               _create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}"
> >
> >               # wait until async request is processed and ns is created
> > -             nvmf_wait_for_ns "${uuid}"
> > +             ns=$(nvmf_wait_for_ns "${uuid}")
> >               if [ $? -eq 1 ]; then
> >                       echo "FAIL"
> >                       rm "$(_nvme_def_file_path).$i"
> > @@ -71,6 +88,7 @@ test() {
> >               fi
> >
> >               _remove_nvmet_ns "${def_subsysnqn}" "${i}"
> > +             nvmf_wait_for_ns_removal "$ns"
> >               rm "$(_nvme_def_file_path).$i"
> >       }
> >       done
>
> I think there's some formatting issue in the above patch. I see some stray characters
> which you may cleanup/fix later when you send the formal patch.
>
> Yi, I think you you may also try the above patch on your testbed and confirm the result.

Nilay/Shinichiro

Confirmed the failure cannot be reproduced with this patch now.

>
> Thanks,
> --Nilay
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-08-20  1:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-02  9:09 blktests failures with v6.11-rc1 kernel Shinichiro Kawasaki
2024-08-02 12:16 ` Nilay Shroff
2024-08-02 12:34   ` Shinichiro Kawasaki
2024-08-02 16:49     ` Nilay Shroff
2024-08-13  7:06       ` Yi Zhang
2024-08-14 12:35         ` Nilay Shroff
2024-08-19 12:34           ` Shinichiro Kawasaki
2024-08-19 13:35             ` Nilay Shroff
2024-08-20  1:50               ` Yi Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).