nvme_tcp: nvme connect failed after execute stress-ng: unshare

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* nvme_tcp: nvme connect failed after execute stress-ng: unshare
       [not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com>
@ 2020-09-10  4:35 ` Yi Zhang
  2020-09-14 23:50   ` Sagi Grimberg
  0 siblings, 1 reply; 4+ messages in thread
From: Yi Zhang @ 2020-09-10  4:35 UTC (permalink / raw)
  To: sagi, gscrivan, colin.king; +Cc: linux-nvme

Hello

Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it.
I'm not sure whether it's one test case issue or kernel issue, could anyone help check it.

[1]
# sh test.sh 
+ ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
stress-ng: info:  [355534] dispatching hogs: 32 unshare
stress-ng: info:  [355534] successful run completed in 5.04s
+ modprobe null-blk nr-devices=1
+ modprobe nvmet-tcp
+ modprobe nvme-tcp
+ nvmetcli restore tcp.json
+ nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420
Failed to write to /dev/nvme-fabrics: Input/output error

# dmesg | tail -9
[  700.012299] null_blk: module loaded
[  700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[  715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  715.297031] nvmet: ctrl 1 fatal error occurred!
[  749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
[  763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
[  763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[  763.429419] nvme nvme0: failed to connect queue: 0 ret=881

# uname -r
5.9.0-rc4


[2] stress-ng: unshare case
https://github.com/ColinIanKing/stress-ng.git
https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c


[3] 
commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano <gscrivan@redhat.com>
Date:   Sun Jun 7 21:40:10 2020 -0700

    ipc/namespace.c: use a work queue to free_ipc
    
    the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.


[4]
# cat tcp.json 
{
  "hosts": [],
  "ports": [
    {
      "addr": {
        "adrfam": "ipv4",
        "traddr": "127.0.0.1",
        "treq": "not specified",
        "trsvcid": "4420",
        "trtype": "tcp"
      },
      "portid": 0,
      "referrals": [],
      "subsystems": [
        "blktests-subsystem-1"
      ]
    }
  ],
  "subsystems": [
    {
      "allowed_hosts": [],
      "attr": {
        "allow_any_host": "1",
        "cntlid_max": "65519",
        "cntlid_min": "1",
        "model": "Linux",
        "pi_enable": "0",
        "serial": "7d833f5501f6b240",
        "version": "1.3"
      },
      "namespaces": [
        {
          "device": {
            "nguid": "00000000-0000-0000-0000-000000000000",
            "path": "/dev/nullb0",
            "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
          },
          "enable": 1,
          "nsid": 1
        }
      ],
      "nqn": "blktests-subsystem-1"
    }
  ]
}


Best Regards,
  Yi Zhang



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
  2020-09-10  4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang
@ 2020-09-14 23:50   ` Sagi Grimberg
  2020-09-15  1:40     ` Yi Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Sagi Grimberg @ 2020-09-14 23:50 UTC (permalink / raw)
  To: Yi Zhang, gscrivan, colin.king; +Cc: linux-nvme


> Hello
> 
> Recently I found nvme-tcp connecting always failed[1] after execute stress-ng:unshare[2], by bisecting I finally found it was introduced with commit[3], the connecting works well after revert it.
> I'm not sure whether it's one test case issue or kernel issue, could anyone help check it.

Is this failure persistent or transient?

> 
> [1]
> # sh test.sh
> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
> stress-ng: info:  [355534] dispatching hogs: 32 unshare
> stress-ng: info:  [355534] successful run completed in 5.04s
> + modprobe null-blk nr-devices=1
> + modprobe nvmet-tcp
> + modprobe nvme-tcp
> + nvmetcli restore tcp.json
> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 127.0.0.1 -s 4420
> Failed to write to /dev/nvme-fabrics: Input/output error
> 
> # dmesg | tail -9
> [  700.012299] null_blk: module loaded
> [  700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [  700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [  715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [  715.297031] nvmet: ctrl 1 fatal error occurred!
> [  749.939898] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
> [  763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
> [  763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
> [  763.429419] nvme nvme0: failed to connect queue: 0 ret=881
> 
> # uname -r
> 5.9.0-rc4
> 
> 
> [2] stress-ng: unshare case
> https://github.com/ColinIanKing/stress-ng.git
> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
> 
> 
> [3]
> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
> Author: Giuseppe Scrivano <gscrivan@redhat.com>
> Date:   Sun Jun 7 21:40:10 2020 -0700
> 
>      ipc/namespace.c: use a work queue to free_ipc
>      
>      the reason is to avoid a delay caused by the synchronize_rcu() call in
>      kern_umount() when the mqueue mount is freed.
> 
> 
> [4]
> # cat tcp.json
> {
>    "hosts": [],
>    "ports": [
>      {
>        "addr": {
>          "adrfam": "ipv4",
>          "traddr": "127.0.0.1",
>          "treq": "not specified",
>          "trsvcid": "4420",
>          "trtype": "tcp"
>        },
>        "portid": 0,
>        "referrals": [],
>        "subsystems": [
>          "blktests-subsystem-1"
>        ]
>      }
>    ],
>    "subsystems": [
>      {
>        "allowed_hosts": [],
>        "attr": {
>          "allow_any_host": "1",
>          "cntlid_max": "65519",
>          "cntlid_min": "1",
>          "model": "Linux",
>          "pi_enable": "0",
>          "serial": "7d833f5501f6b240",
>          "version": "1.3"
>        },
>        "namespaces": [
>          {
>            "device": {
>              "nguid": "00000000-0000-0000-0000-000000000000",
>              "path": "/dev/nullb0",
>              "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
>            },
>            "enable": 1,
>            "nsid": 1
>          }
>        ],
>        "nqn": "blktests-subsystem-1"
>      }
>    ]
> }
> 
> 
> Best Regards,
>    Yi Zhang
> 
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
  2020-09-14 23:50   ` Sagi Grimberg
@ 2020-09-15  1:40     ` Yi Zhang
  2020-10-28 12:19       ` Yi Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Yi Zhang @ 2020-09-15  1:40 UTC (permalink / raw)
  To: Sagi Grimberg, gscrivan, colin.king; +Cc: linux-nvme

Hi Sagi

On 9/15/20 7:50 AM, Sagi Grimberg wrote:
>
>> Hello
>>
>> Recently I found nvme-tcp connecting always failed[1] after execute 
>> stress-ng:unshare[2], by bisecting I finally found it was introduced 
>> with commit[3], the connecting works well after revert it.
>> I'm not sure whether it's one test case issue or kernel issue, could 
>> anyone help check it.
>
> Is this failure persistent or transient?
>
It's persistent, and most of recent CKI job with 5.8 stable kernel also 
showed this failure.

>>
>> [1]
>> # sh test.sh
>> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
>> stress-ng: info:  [355534] dispatching hogs: 32 unshare
>> stress-ng: info:  [355534] successful run completed in 5.04s
>> + modprobe null-blk nr-devices=1
>> + modprobe nvmet-tcp
>> + modprobe nvme-tcp
>> + nvmetcli restore tcp.json
>> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 
>> 127.0.0.1 -s 4420
>> Failed to write to /dev/nvme-fabrics: Input/output error
>>
>> # dmesg | tail -9
>> [  700.012299] null_blk: module loaded
>> [  700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [  700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [  715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [  715.297031] nvmet: ctrl 1 fatal error occurred!
>> [  749.939898] nvmet: creating controller 1 for subsystem 
>> nqn.2014-08.org.nvmexpress.discovery for NQN 
>> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
>> [  763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
>> [  763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 881
>> [  763.429419] nvme nvme0: failed to connect queue: 0 ret=881
>>
>> # uname -r
>> 5.9.0-rc4
>>
>>
>> [2] stress-ng: unshare case
>> https://github.com/ColinIanKing/stress-ng.git
>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
>>
>>
>> [3]
>> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
>> Author: Giuseppe Scrivano <gscrivan@redhat.com>
>> Date:   Sun Jun 7 21:40:10 2020 -0700
>>
>>      ipc/namespace.c: use a work queue to free_ipc
>>           the reason is to avoid a delay caused by the 
>> synchronize_rcu() call in
>>      kern_umount() when the mqueue mount is freed.
>>
>>
>> [4]
>> # cat tcp.json
>> {
>>    "hosts": [],
>>    "ports": [
>>      {
>>        "addr": {
>>          "adrfam": "ipv4",
>>          "traddr": "127.0.0.1",
>>          "treq": "not specified",
>>          "trsvcid": "4420",
>>          "trtype": "tcp"
>>        },
>>        "portid": 0,
>>        "referrals": [],
>>        "subsystems": [
>>          "blktests-subsystem-1"
>>        ]
>>      }
>>    ],
>>    "subsystems": [
>>      {
>>        "allowed_hosts": [],
>>        "attr": {
>>          "allow_any_host": "1",
>>          "cntlid_max": "65519",
>>          "cntlid_min": "1",
>>          "model": "Linux",
>>          "pi_enable": "0",
>>          "serial": "7d833f5501f6b240",
>>          "version": "1.3"
>>        },
>>        "namespaces": [
>>          {
>>            "device": {
>>              "nguid": "00000000-0000-0000-0000-000000000000",
>>              "path": "/dev/nullb0",
>>              "uuid": "b07c7eef-8428-47bf-8e79-26ec8c30f334"
>>            },
>>            "enable": 1,
>>            "nsid": 1
>>          }
>>        ],
>>        "nqn": "blktests-subsystem-1"
>>      }
>>    ]
>> }
>>
>>
>> Best Regards,
>>    Yi Zhang
>>
>>
>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: nvme_tcp: nvme connect failed after execute stress-ng: unshare
  2020-09-15  1:40     ` Yi Zhang
@ 2020-10-28 12:19       ` Yi Zhang
  0 siblings, 0 replies; 4+ messages in thread
From: Yi Zhang @ 2020-10-28 12:19 UTC (permalink / raw)
  To: Sagi Grimberg, gscrivan; +Cc: colin.king, linux-nvme



On 9/15/20 9:40 AM, Yi Zhang wrote:
> Hi Sagi
>
> On 9/15/20 7:50 AM, Sagi Grimberg wrote:
>>
>>> Hello
>>>
>>> Recently I found nvme-tcp connecting always failed[1] after execute 
>>> stress-ng:unshare[2], by bisecting I finally found it was introduced 
>>> with commit[3], the connecting works well after revert it.
>>> I'm not sure whether it's one test case issue or kernel issue, could 
>>> anyone help check it.
>>
>> Is this failure persistent or transient?
>>
> It's persistent, and most of recent CKI job with 5.8 stable kernel 
> also showed this failure.
>

Found  "WARNING: possible circular locking" with latest 5.10.0-rc1

[  230.353213] run blktests nvme/003 at 2020-10-28 07:41:08
[  230.666092] loop: module loaded
[  230.870124] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  230.902252] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[  246.097201] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  246.104886] nvmet: ctrl 1 fatal error occurred!

[  246.113796] ======================================================
[  246.120042] WARNING: possible circular locking dependency detected
[  246.126288] 5.10.0-rc1 #1 Not tainted
[  246.129987] ------------------------------------------------------
[  246.136231] kworker/0:2/251 is trying to acquire lock:
[  246.141423] ffff00012ec0b830 
((work_completion)(&queue->io_work)){+.+.}-{0:0}, at: 
__flush_work+0x48/0x4f0
[  246.151202]
                but task is already holding lock:
[  246.157095] ffff8000147e3dd8 
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: 
process_one_work+0x1bc/0x670
[  246.167743]
                which lock already depends on the new lock.

[  246.176007]
                the existing dependency chain (in reverse order) is:
[  246.183569]
                -> #2 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[  246.191849]        validate_chain+0x5f0/0xa00
[  246.196251]        __lock_acquire+0x3b0/0xb70
[  246.200652]        lock_acquire+0x104/0x428
[  246.204878]        process_one_work+0x21c/0x670
[  246.209456]        worker_thread+0x54/0x408
[  246.213684]        kthread+0x11c/0x148
[  246.217473]        ret_from_fork+0x10/0x18
[  246.221611]
                -> #1 ((wq_completion)events){+.+.}-{0:0}:
[  246.228602]        validate_chain+0x5f0/0xa00
[  246.233005]        __lock_acquire+0x3b0/0xb70
[  246.237407]        lock_acquire+0x104/0x428
[  246.241634]        flush_workqueue+0x8c/0x4a8
[  246.246041]        nvmet_tcp_install_queue+0x100/0x110 [nvmet_tcp]
[  246.252309]        nvmet_install_queue+0xc4/0x150 [nvmet]
[  246.257775]        nvmet_execute_admin_connect+0xf4/0x1e0 [nvmet]
[  246.263935]        nvmet_tcp_io_work+0x8dc/0x970 [nvmet_tcp]
[  246.269654]        process_one_work+0x268/0x670
[  246.274231]        worker_thread+0x54/0x408
[  246.278458]        kthread+0x11c/0x148
[  246.282246]        ret_from_fork+0x10/0x18
[  246.288493]
                -> #0 ((work_completion)(&queue->io_work)){+.+.}-{0:0}:
[  246.300559]        check_noncircular+0x100/0x128
[  246.307367]        check_prev_add+0xa4/0x668
[  246.313777]        validate_chain+0x5f0/0xa00
[  246.320230]        __lock_acquire+0x3b0/0xb70
[  246.326652]        lock_acquire+0x104/0x428
[  246.332879]        __flush_work+0x70/0x4f0
[  246.338994]        flush_work+0x1c/0x28
[  246.344833]        nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp]
[  246.353405]        process_one_work+0x268/0x670
[  246.359955]        worker_thread+0x54/0x408
[  246.366131]        kthread+0x11c/0x148
[  246.371843]        ret_from_fork+0x10/0x18
[  246.377876]
                other info that might help us debug this:

[  246.391416] Chain exists of:
                  (work_completion)(&queue->io_work) --> 
(wq_completion)events --> (work_completion)(&queue->release_work)

[  246.411895]  Possible unsafe locking scenario:

[  246.421363]        CPU0                    CPU1
[  246.427687]        ----                    ----
[  246.433978] lock((work_completion)(&queue->release_work));
[  246.441413] lock((wq_completion)events);
[  246.449816] lock((work_completion)(&queue->release_work));
[  246.459780]   lock((work_completion)(&queue->io_work));
[  246.466808]
                 *** DEADLOCK ***

[  246.478008] 2 locks held by kworker/0:2/251:
[  246.484053]  #0: ffff000100051b38 
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x670
[  246.495355]  #1: ffff8000147e3dd8 
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: 
process_one_work+0x1bc/0x670
[  246.508301]
                stack backtrace:
[  246.516512] CPU: 0 PID: 251 Comm: kworker/0:2 Not tainted 5.10.0-rc1 #1
[  246.525121] Hardware name: GIGABYTE R120-T32-00/MT30-GS1-00, BIOS F02 
08/06/2019
[  246.534560] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp]
[  246.543156] Call trace:
[  246.547634]  dump_backtrace+0x0/0x1c8
[  246.553325]  show_stack+0x20/0x70
[  246.558653]  dump_stack+0xf8/0x164
[  246.564043]  print_circular_bug.isra.0+0x234/0x290
[  246.570834]  check_noncircular+0x100/0x128
[  246.576931]  check_prev_add+0xa4/0x668
[  246.582676]  validate_chain+0x5f0/0xa00
[  246.588512]  __lock_acquire+0x3b0/0xb70
[  246.594346]  lock_acquire+0x104/0x428
[  246.599986]  __flush_work+0x70/0x4f0
[  246.605506]  flush_work+0x1c/0x28
[  246.610765]  nvmet_tcp_release_queue_work+0xd0/0x230 [nvmet_tcp]
[  246.618768]  process_one_work+0x268/0x670
[  246.624777]  worker_thread+0x54/0x408
[  246.630438]  kthread+0x11c/0x148
[  246.635661]  ret_from_fork+0x10/0x18
[  295.375201] nvme nvme0: queue 0: timeout request 0x0 type 4
[  295.383074] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[  295.391912] nvme nvme0: failed to connect queue: 0 ret=881
[  479.792299] nvmet: creating controller 1 for subsystem 
nqn.2014-08.org.nvmexpress.discovery for NQN 
nqn.2014-08.org.nvmexpress:uuid:0e657e35bc5f402480543a94c7a292ed.


>>>
>>> [1]
>>> # sh test.sh
>>> + ./stress-ng/stress-ng --unshare 0 --timeout 5 --log-file unshare.log
>>> stress-ng: info:  [355534] dispatching hogs: 32 unshare
>>> stress-ng: info:  [355534] successful run completed in 5.04s
>>> + modprobe null-blk nr-devices=1
>>> + modprobe nvmet-tcp
>>> + modprobe nvme-tcp
>>> + nvmetcli restore tcp.json
>>> + nvme connect -t tcp -n nqn.2014-08.org.nvmexpress.discovery -a 
>>> 127.0.0.1 -s 4420
>>> Failed to write to /dev/nvme-fabrics: Input/output error
>>>
>>> # dmesg | tail -9
>>> [  700.012299] null_blk: module loaded
>>> [  700.073415] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [  700.073923] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [  715.291020] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>> [  715.297031] nvmet: ctrl 1 fatal error occurred!
>>> [  749.939898] nvmet: creating controller 1 for subsystem 
>>> nqn.2014-08.org.nvmexpress.discovery for NQN 
>>> nqn.2014-08.org.nvmexpress:uuid:e405e6bb-8e28-4a73-b338-3fddb5746b8c.
>>> [  763.417376] nvme nvme0: queue 0: timeout request 0x0 type 4
>>> [  763.422979] nvme nvme0: Connect command failed, error wo/DNR bit: 
>>> 881
>>> [  763.429419] nvme nvme0: failed to connect queue: 0 ret=881
>>>
>>> # uname -r
>>> 5.9.0-rc4
>>>
>>>
>>> [2] stress-ng: unshare case
>>> https://github.com/ColinIanKing/stress-ng.git
>>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-unshare.c
>>>
>>>
>>> [3]
>>> commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
>>> Author: Giuseppe Scrivano <gscrivan@redhat.com>
>>> Date:   Sun Jun 7 21:40:10 2020 -0700
>>>
>>>      ipc/namespace.c: use a work queue to free_ipc
>>>           the reason is to avoid a delay caused by the 
>>> synchronize_rcu() call in
>>>      kern_umount() when the mqueue mount is freed.


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-10-28 12:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1594444669.711245.1599710442490.JavaMail.zimbra@redhat.com>
2020-09-10  4:35 ` nvme_tcp: nvme connect failed after execute stress-ng: unshare Yi Zhang
2020-09-14 23:50   ` Sagi Grimberg
2020-09-15  1:40     ` Yi Zhang
2020-10-28 12:19       ` Yi Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox