linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: mlin@kernel.org (Ming Lin)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Mon, 27 Jun 2016 10:26:36 -0700	[thread overview]
Message-ID: <1467048396.7205.3.camel@ssi> (raw)
In-Reply-To: <20160616203437.GA19079@lst.de>

On Thu, 2016-06-16@22:34 +0200, 'Christoph Hellwig' wrote:
> On Thu, Jun 16, 2016@03:28:06PM -0500, Steve Wise wrote:
> > > Just to follow, does Christoph's patch fix the crash?
> > 
> > It does. 
> 
> Unfortunately I think it's still wrong because it will only delete
> a single queue per controller.  We'll probably need something
> like this instead, which does the same think but also has a retry
> loop for additional queues:
> 
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b1c6e5b..425b55c 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1293,19 +1293,20 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
>  
>  static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl *ctrl)
>  {
> -	struct nvmet_rdma_queue *queue, *next;
> -	static LIST_HEAD(del_list);
> +	struct nvmet_rdma_queue *queue;
>  
> +restart:
>  	mutex_lock(&nvmet_rdma_queue_mutex);
> -	list_for_each_entry_safe(queue, next,
> -			&nvmet_rdma_queue_list, queue_list) {
> -		if (queue->nvme_sq.ctrl->cntlid == ctrl->cntlid)
> -			list_move_tail(&queue->queue_list, &del_list);
> +	list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
> +		if (queue->nvme_sq.ctrl == ctrl) {
> +			list_del_init(&queue->queue_list);
> +			mutex_unlock(&nvmet_rdma_queue_mutex);
> +
> +			__nvmet_rdma_queue_disconnect(queue);
> +			goto restart;
> +		}
>  	}
>  	mutex_unlock(&nvmet_rdma_queue_mutex);
> -
> -	list_for_each_entry_safe(queue, next, &del_list, queue_list)
> -		nvmet_rdma_queue_disconnect(queue);
>  }
>  
>  static int nvmet_rdma_add_port(struct nvmet_port *port)

Run below test over weekend on host side(nvmf-all.3),

#!/bin/bash

while [ 1 ] ; do
	ifconfig eth5 down ; sleep $(( 10 + ($RANDOM & 0x7) )); ifconfig eth5 up ;sleep $(( 10 + ($RANDOM & 0x7) ))
done

Then target side hit below crash:

[122730.252874] nvmet: creating controller 1 for NQN nqn.2014-08.org.nvmexpress:NVMf:uuid:53ea06bc-e1d0-4d59-a6e9-138684f3662b.
[122730.281665] nvmet: adding queue 1 to ctrl 1.
[122730.287133] nvmet: adding queue 2 to ctrl 1.
[122730.292672] nvmet: adding queue 3 to ctrl 1.
[122730.298197] nvmet: adding queue 4 to ctrl 1.
[122730.303742] nvmet: adding queue 5 to ctrl 1.
[122730.309375] nvmet: adding queue 6 to ctrl 1.
[122730.315015] nvmet: adding queue 7 to ctrl 1.
[122730.320688] nvmet: adding queue 8 to ctrl 1.
[122732.014747] mlx4_en: eth4: Link Down
[122745.298422] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[122745.305601] BUG: unable to handle kernel paging request at 0000010173180018
[122745.313755] IP: [<ffffffffc08bb7fa>] nvmet_rdma_delete_ctrl+0x4a/0xa0 [nvmet_rdma]
[122745.322513] PGD 0
[122745.325667] Oops: 0000 [#1] PREEMPT SMP
[122745.462435] CPU: 0 PID: 4849 Comm: kworker/0:3 Tainted: G           OE   4.7.0-rc2+ #256
[122745.472376] Hardware name: Dell Inc. OptiPlex 7010/0773VG, BIOS A12 01/10/2013
[122745.481433] Workqueue: events nvmet_keep_alive_timer [nvmet]
[122745.488909] task: ffff880035346a00 ti: ffff8800d1078000 task.ti: ffff8800d1078000
[122745.498246] RIP: 0010:[<ffffffffc08bb7fa>]  [<ffffffffc08bb7fa>] nvmet_rdma_delete_ctrl+0x4a/0xa0 [nvmet_rdma]
[122745.510170] RSP: 0018:ffff8800d107bdf0  EFLAGS: 00010207
[122745.517384] RAX: 0000010173180100 RBX: 000001017317ffe0 RCX: 0000000000000000
[122745.526464] RDX: 0000010173180100 RSI: ffff88012020dc28 RDI: ffffffffc08bf080
[122745.535566] RBP: ffff8800d107be00 R08: 0000000000000000 R09: ffff8800c7bd7bc0
[122745.544715] R10: 000000000000f000 R11: 0000000000015c68 R12: ffff8800b85c3400
[122745.553873] R13: ffff88012021ac00 R14: 0000000000000000 R15: ffff880120216300
[122745.563025] FS:  0000000000000000(0000) GS:ffff880120200000(0000) knlGS:0000000000000000
[122745.573152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[122745.580923] CR2: 0000010173180018 CR3: 0000000001c06000 CR4: 00000000001406f0
[122745.590125] Stack:
[122745.594187]  ffff8800b85c34c8 ffff8800b85c34c8 ffff8800d107be18 ffffffffc090d0be
[122745.603771]  ffff8800d7cfaf00 ffff8800d107be60 ffffffff81083019 ffff8800d7cfaf30
[122745.613374]  0000000035346a00 ffff880120216320 ffff8800d7cfaf30 ffff880035346a00
[122745.622992] Call Trace:
[122745.627593]  [<ffffffffc090d0be>] nvmet_keep_alive_timer+0x2e/0x40 [nvmet]
[122745.636685]  [<ffffffff81083019>] process_one_work+0x159/0x370
[122745.644745]  [<ffffffff81083356>] worker_thread+0x126/0x490
[122745.652545]  [<ffffffff816f17fe>] ? __schedule+0x1de/0x590
[122745.660217]  [<ffffffff81083230>] ? process_one_work+0x370/0x370
[122745.668387]  [<ffffffff81088864>] kthread+0xc4/0xe0
[122745.675437]  [<ffffffff816f571f>] ret_from_fork+0x1f/0x40
[122745.683025]  [<ffffffff810887a0>] ? kthread_create_on_node+0x170/0x170

(gdb) list *nvmet_rdma_delete_ctrl+0x4a
0x82a is in nvmet_rdma_delete_ctrl (/home/mlin/linux-nvmeof/drivers/nvme/target/rdma.c:1301).
1296		struct nvmet_rdma_queue *queue;
1297	
1298	restart:
1299		mutex_lock(&nvmet_rdma_queue_mutex);
1300		list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
1301			if (queue->nvme_sq.ctrl == ctrl) {
1302				list_del_init(&queue->queue_list);
1303				mutex_unlock(&nvmet_rdma_queue_mutex);
1304	
1305				__nvmet_rdma_queue_disconnect(queue);

  parent reply	other threads:[~2016-06-27 17:26 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10   ` Christoph Hellwig
2016-06-16 15:17     ` Steve Wise
2016-06-16 19:11     ` Sagi Grimberg
2016-06-16 20:38       ` Christoph Hellwig
2016-06-16 21:37         ` Sagi Grimberg
2016-06-16 21:40           ` Sagi Grimberg
2016-06-21 16:01           ` Christoph Hellwig
2016-06-22 10:22             ` Sagi Grimberg
2016-06-16 15:24   ` Steve Wise
2016-06-16 16:41     ` Steve Wise
2016-06-16 15:56   ` Steve Wise
2016-06-16 19:55     ` Sagi Grimberg
2016-06-16 19:59       ` Steve Wise
2016-06-16 20:07         ` Sagi Grimberg
2016-06-16 20:12           ` Steve Wise
2016-06-16 20:27             ` Ming Lin
2016-06-16 20:28               ` Steve Wise
2016-06-16 20:34                 ` 'Christoph Hellwig'
2016-06-16 20:49                   ` Steve Wise
2016-06-16 21:06                     ` Steve Wise
2016-06-16 21:42                       ` Sagi Grimberg
2016-06-16 21:47                         ` Ming Lin
2016-06-16 21:53                           ` Steve Wise
2016-06-16 21:46                       ` Steve Wise
2016-06-27 22:29                       ` Ming Lin
2016-06-28  9:14                         ` 'Christoph Hellwig'
2016-06-28 14:15                           ` Steve Wise
2016-06-28 15:51                             ` 'Christoph Hellwig'
2016-06-28 16:31                               ` Steve Wise
2016-06-28 16:49                                 ` Ming Lin
2016-06-28 19:20                                   ` Steve Wise
2016-06-28 19:43                                     ` Steve Wise
2016-06-28 21:04                                       ` Ming Lin
2016-06-29 14:11                                         ` Steve Wise
2016-06-27 17:26                   ` Ming Lin [this message]
2016-06-16 20:35           ` Steve Wise
2016-06-16 20:01       ` Steve Wise
2016-06-17 14:05       ` Steve Wise
     [not found]       ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16         ` Steve Wise
2016-06-17 17:20           ` Ming Lin
2016-06-19 11:57             ` Sagi Grimberg
2016-06-21 14:18               ` Steve Wise
2016-06-21 17:33                 ` Ming Lin
2016-06-21 17:59                   ` Steve Wise
     [not found]               ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42                 ` Steve Wise
2016-06-27 14:19                   ` Steve Wise
2016-06-28  8:50                     ` 'Christoph Hellwig'
2016-07-04  9:57                       ` Yoichi Hayakawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1467048396.7205.3.camel@ssi \
    --to=mlin@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).