nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

From: swise@opengridcomputing.com (Steve Wise)
Subject: nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect
Date: Tue, 23 Aug 2016 09:58:13 -0500	[thread overview]
Message-ID: <00e301d1fd4e$c5868960$50939c20$@opengridcomputing.com> (raw)
In-Reply-To: <00df01d1fd4d$10ea8890$32bf99b0$@opengridcomputing.com>

> Hey guys, when I force an nvmf host into kato recovery/reconnect mode by
> killing the target, and then reboot the host, it hangs forever because the
> nvmf host controllers never get a delete command, so they stay stuck in
> reconnect state.
> 
> Here is the dmesg log:
> 
> <... one nvmf device connected...>
> 
> [  255.079939] nvme nvme1: creating 32 I/O queues.
> [  255.377218] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420
> 
> 
> <... target rebooted here via 'reboot -f'...>
> 
> [  264.768555] cxgb4 0000:83:00.4: Port 0 link down, reason: Link Down
> [  264.777520] cxgb4 0000:83:00.4 eth10: link down
> [  265.177225] nvme nvme1: RECV for CQE 0xffff88101d6f3568 failed with
> status WR flushed (5)
> [  265.177306] nvme nvme1: reconnecting in 10 seconds
> [  265.748213] cxgb4 0000:82:00.4: Port 0 link down, reason: Link Down
> [  265.755478] cxgb4 0000:82:00.4 eth2: link down
> [  266.183927] mlx4_en: eth14: Link Down
> [  276.387127] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  283.116153] nvme nvme1: Failed reconnect attempt, requeueing...
> 
> <... host 'reboot' issued here...>
> 
> Stopping certmonger: [  OK  ]
> 
> Running guests on default URI: no running guests.
> 
> Stopping libvirtd daemon: [  OK  ]
> Stopping atd: [  OK  ]
> Shutting down console mouse services: [  OK  ]
> Stopping ksmtuned: [  OK  ]
> Stopping abrt daemon: [  OK  ]
> Stopping sshd: [  OK  ]
> Stopping mcelog
> Stopping xinetd: [  OK  ]
> Stopping crond: [  OK  ]
> Stopping automount: [  OK  ]
> Stopping HAL daemon: [  OK  ]
> Stopping block device availability: Deactivating block devices:
> [  OK  ]
> Stopping cgdcbxd: [  OK  ]
> Stopping lldpad: [  OK  ]
> Stopping system message bus: [  OK  ]
> Shutting down ca[  290.560113] CacheFiles: File cache on sda2
> unregistering
> chefilesd: [  290.566076] FS-Cache: Withdrawing cache "mycache"
> [  OK  ]
> Stopping rpcbind: [  OK  ]
> Stopping auditd: [  290.809894] audit: type=1305 audit(1471963093.850:82):
> audit_pid=0 old=3011 auid=4294967295 ses=4294967295 res=1
> [  OK  ]
> [  290.908238] audit: type=1305 audit(1471963093.948:83): audit_enabled=0
> old=1 auid=4294967295 ses=4294967295 res=1
> Shutting down system logger: [  OK  ]
> Shutting down interface eth8:  [  OK  ]
> Shutting down loopback interface:  [  OK  ]
> Stopping cgconfig service: [  OK  ]
> Stopping virt-who: [  OK  ]
> [  294.307812] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  301.035260] nvme nvme1: Failed reconnect attempt, requeueing...
> [  312.228468] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  312.234310] nvme nvme1: Failed reconnect attempt, requeueing...
> [  323.492871] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  323.498713] nvme nvme1: Failed reconnect attempt, requeueing...
> [  334.757296] nvme nvme1: rdma_resolve_addr wait failed (-110).
> [  334.763162] nvme nvme1: Failed reconnect attempt, requeueing...
> 
> <..stuck forever...>
> 

Eventually I see this stuck thread:

[  492.971125] INFO: task vgs:4755 blocked for more than 120 seconds.
[  492.977409]       Tainted: G            E
4.8.0-rc2nvmf-4.8-rc-rebased-rc2-harsha+ #16
[  492.985606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[  492.993538] vgs             D ffff880fefadf8e8     0  4755   4754 0x10000080
[  493.000749]  ffff880fefadf8e8 ffffffff81c0d4c0 ffff880fefa1ab80
ffff88103415fd30
[  493.010277]  00000001e7fb1240 ffff880fefadf8b8 ffff880fefadc008
ffff88103ee19300
[  493.019728]  7fffffffffffffff 0000000000000000 0000000000000000
ffff880fefadf938
[  493.029121] Call Trace:
[  493.033295]  [<ffffffff816ddde0>] schedule+0x40/0xb0
[  493.039937]  [<ffffffff816e0a8d>] schedule_timeout+0x2ad/0x410
[  493.047364]  [<ffffffff8132d6d2>] ? blk_flush_plug_list+0x132/0x2e0
[  493.055234]  [<ffffffff810fe67c>] ? ktime_get+0x4c/0xc0
[  493.061957]  [<ffffffff8132c92c>] ? generic_make_request+0xfc/0x1d0
[  493.069721]  [<ffffffff816dd6c4>] io_schedule_timeout+0xa4/0x110
[  493.077160]  [<ffffffff81269cb9>] dio_await_one+0x99/0xe0
[  493.083973]  [<ffffffff8126d359>] do_blockdev_direct_IO+0x919/0xc00
[  493.091636]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  493.097946]  [<ffffffff81267350>] ? I_BDEV+0x20/0x20
[  493.104195]  [<ffffffff8115527b>] ? rb_reserve_next_event+0xdb/0x230
[  493.111831]  [<ffffffff811547ba>] ? rb_commit+0x10a/0x1a0
[  493.118450]  [<ffffffff8126d67a>] __blockdev_direct_IO+0x3a/0x40
[  493.125657]  [<ffffffff81267b83>] blkdev_direct_IO+0x43/0x50
[  493.132468]  [<ffffffff81199ef7>] generic_file_read_iter+0xf7/0x110
[  493.139890]  [<ffffffff81267657>] blkdev_read_iter+0x37/0x40
[  493.146664]  [<ffffffff8122b15c>] __vfs_read+0xfc/0x120
[  493.153009]  [<ffffffff8122b22e>] vfs_read+0xae/0xf0
[  493.158877]  [<ffffffff81249633>] ? __fdget+0x13/0x20
[  493.164810]  [<ffffffff8122bd36>] SyS_read+0x56/0xc0
[  493.170651]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
[  493.177027]  [<ffffffff8106f057>] ? do_page_fault+0x37/0x90
[  493.183474]  [<ffffffff816e1921>] entry_SYSCALL64_slow_path+0x25/0x25

next      parent reply	other threads:[~2016-08-23 14:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <00df01d1fd4d$10ea8890$32bf99b0$@opengridcomputing.com>
2016-08-23 14:58 ` Steve Wise [this message]
2016-08-23 14:46 nvmf host shutdown hangs when nvmf controllers are in recovery/reconnect Steve Wise
2016-08-24 10:40 ` Sagi Grimberg
2016-08-24 11:20   ` Sagi Grimberg
2016-08-24 20:25     ` Steve Wise
     [not found]     ` <021d01d1fe45$af92ff60$0eb8fe20$@opengridcomputing.com>
2016-08-24 20:34       ` Steve Wise
     [not found]       ` <022201d1fe46$e85649f0$b902ddd0$@opengridcomputing.com>
2016-08-24 20:47         ` Steve Wise
2016-08-25 21:58     ` Sagi Grimberg
2016-08-25 22:05       ` Steve Wise

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='00e301d1fd4e$c5868960$50939c20$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).