All of lore.kernel.org
 help / color / mirror / Atom feed
From: mlin@kernel.org (Ming Lin)
Subject: target crash / host hang with nvme-all.3 branch of nvme-fabrics
Date: Mon, 27 Jun 2016 10:26:36 -0700	[thread overview]
Message-ID: <1467048396.7205.3.camel@ssi> (raw)
In-Reply-To: <20160616203437.GA19079@lst.de>

On Thu, 2016-06-16@22:34 +0200, 'Christoph Hellwig' wrote:
> On Thu, Jun 16, 2016@03:28:06PM -0500, Steve Wise wrote:
> > > Just to follow, does Christoph's patch fix the crash?
> > 
> > It does. 
> 
> Unfortunately I think it's still wrong because it will only delete
> a single queue per controller.  We'll probably need something
> like this instead, which does the same think but also has a retry
> loop for additional queues:
> 
> 
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b1c6e5b..425b55c 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1293,19 +1293,20 @@ static int nvmet_rdma_cm_handler(struct rdma_cm_id *cm_id,
>  
>  static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl *ctrl)
>  {
> -	struct nvmet_rdma_queue *queue, *next;
> -	static LIST_HEAD(del_list);
> +	struct nvmet_rdma_queue *queue;
>  
> +restart:
>  	mutex_lock(&nvmet_rdma_queue_mutex);
> -	list_for_each_entry_safe(queue, next,
> -			&nvmet_rdma_queue_list, queue_list) {
> -		if (queue->nvme_sq.ctrl->cntlid == ctrl->cntlid)
> -			list_move_tail(&queue->queue_list, &del_list);
> +	list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
> +		if (queue->nvme_sq.ctrl == ctrl) {
> +			list_del_init(&queue->queue_list);
> +			mutex_unlock(&nvmet_rdma_queue_mutex);
> +
> +			__nvmet_rdma_queue_disconnect(queue);
> +			goto restart;
> +		}
>  	}
>  	mutex_unlock(&nvmet_rdma_queue_mutex);
> -
> -	list_for_each_entry_safe(queue, next, &del_list, queue_list)
> -		nvmet_rdma_queue_disconnect(queue);
>  }
>  
>  static int nvmet_rdma_add_port(struct nvmet_port *port)

Run below test over weekend on host side(nvmf-all.3),

#!/bin/bash

while [ 1 ] ; do
	ifconfig eth5 down ; sleep $(( 10 + ($RANDOM & 0x7) )); ifconfig eth5 up ;sleep $(( 10 + ($RANDOM & 0x7) ))
done

Then target side hit below crash:

[122730.252874] nvmet: creating controller 1 for NQN nqn.2014-08.org.nvmexpress:NVMf:uuid:53ea06bc-e1d0-4d59-a6e9-138684f3662b.
[122730.281665] nvmet: adding queue 1 to ctrl 1.
[122730.287133] nvmet: adding queue 2 to ctrl 1.
[122730.292672] nvmet: adding queue 3 to ctrl 1.
[122730.298197] nvmet: adding queue 4 to ctrl 1.
[122730.303742] nvmet: adding queue 5 to ctrl 1.
[122730.309375] nvmet: adding queue 6 to ctrl 1.
[122730.315015] nvmet: adding queue 7 to ctrl 1.
[122730.320688] nvmet: adding queue 8 to ctrl 1.
[122732.014747] mlx4_en: eth4: Link Down
[122745.298422] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[122745.305601] BUG: unable to handle kernel paging request at 0000010173180018
[122745.313755] IP: [<ffffffffc08bb7fa>] nvmet_rdma_delete_ctrl+0x4a/0xa0 [nvmet_rdma]
[122745.322513] PGD 0
[122745.325667] Oops: 0000 [#1] PREEMPT SMP
[122745.462435] CPU: 0 PID: 4849 Comm: kworker/0:3 Tainted: G           OE   4.7.0-rc2+ #256
[122745.472376] Hardware name: Dell Inc. OptiPlex 7010/0773VG, BIOS A12 01/10/2013
[122745.481433] Workqueue: events nvmet_keep_alive_timer [nvmet]
[122745.488909] task: ffff880035346a00 ti: ffff8800d1078000 task.ti: ffff8800d1078000
[122745.498246] RIP: 0010:[<ffffffffc08bb7fa>]  [<ffffffffc08bb7fa>] nvmet_rdma_delete_ctrl+0x4a/0xa0 [nvmet_rdma]
[122745.510170] RSP: 0018:ffff8800d107bdf0  EFLAGS: 00010207
[122745.517384] RAX: 0000010173180100 RBX: 000001017317ffe0 RCX: 0000000000000000
[122745.526464] RDX: 0000010173180100 RSI: ffff88012020dc28 RDI: ffffffffc08bf080
[122745.535566] RBP: ffff8800d107be00 R08: 0000000000000000 R09: ffff8800c7bd7bc0
[122745.544715] R10: 000000000000f000 R11: 0000000000015c68 R12: ffff8800b85c3400
[122745.553873] R13: ffff88012021ac00 R14: 0000000000000000 R15: ffff880120216300
[122745.563025] FS:  0000000000000000(0000) GS:ffff880120200000(0000) knlGS:0000000000000000
[122745.573152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[122745.580923] CR2: 0000010173180018 CR3: 0000000001c06000 CR4: 00000000001406f0
[122745.590125] Stack:
[122745.594187]  ffff8800b85c34c8 ffff8800b85c34c8 ffff8800d107be18 ffffffffc090d0be
[122745.603771]  ffff8800d7cfaf00 ffff8800d107be60 ffffffff81083019 ffff8800d7cfaf30
[122745.613374]  0000000035346a00 ffff880120216320 ffff8800d7cfaf30 ffff880035346a00
[122745.622992] Call Trace:
[122745.627593]  [<ffffffffc090d0be>] nvmet_keep_alive_timer+0x2e/0x40 [nvmet]
[122745.636685]  [<ffffffff81083019>] process_one_work+0x159/0x370
[122745.644745]  [<ffffffff81083356>] worker_thread+0x126/0x490
[122745.652545]  [<ffffffff816f17fe>] ? __schedule+0x1de/0x590
[122745.660217]  [<ffffffff81083230>] ? process_one_work+0x370/0x370
[122745.668387]  [<ffffffff81088864>] kthread+0xc4/0xe0
[122745.675437]  [<ffffffff816f571f>] ret_from_fork+0x1f/0x40
[122745.683025]  [<ffffffff810887a0>] ? kthread_create_on_node+0x170/0x170

(gdb) list *nvmet_rdma_delete_ctrl+0x4a
0x82a is in nvmet_rdma_delete_ctrl (/home/mlin/linux-nvmeof/drivers/nvme/target/rdma.c:1301).
1296		struct nvmet_rdma_queue *queue;
1297	
1298	restart:
1299		mutex_lock(&nvmet_rdma_queue_mutex);
1300		list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
1301			if (queue->nvme_sq.ctrl == ctrl) {
1302				list_del_init(&queue->queue_list);
1303				mutex_unlock(&nvmet_rdma_queue_mutex);
1304	
1305				__nvmet_rdma_queue_disconnect(queue);

  parent reply	other threads:[~2016-06-27 17:26 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 14:53 target crash / host hang with nvme-all.3 branch of nvme-fabrics Steve Wise
2016-06-16 14:57 ` Christoph Hellwig
2016-06-16 15:10   ` Christoph Hellwig
2016-06-16 15:17     ` Steve Wise
2016-06-16 19:11     ` Sagi Grimberg
2016-06-16 20:38       ` Christoph Hellwig
2016-06-16 21:37         ` Sagi Grimberg
2016-06-16 21:40           ` Sagi Grimberg
2016-06-21 16:01           ` Christoph Hellwig
2016-06-22 10:22             ` Sagi Grimberg
2016-06-16 15:24   ` Steve Wise
2016-06-16 16:41     ` Steve Wise
2016-06-16 15:56   ` Steve Wise
2016-06-16 19:55     ` Sagi Grimberg
2016-06-16 19:59       ` Steve Wise
2016-06-16 20:07         ` Sagi Grimberg
2016-06-16 20:12           ` Steve Wise
2016-06-16 20:27             ` Ming Lin
2016-06-16 20:28               ` Steve Wise
2016-06-16 20:34                 ` 'Christoph Hellwig'
2016-06-16 20:49                   ` Steve Wise
2016-06-16 21:06                     ` Steve Wise
2016-06-16 21:42                       ` Sagi Grimberg
2016-06-16 21:47                         ` Ming Lin
2016-06-16 21:53                           ` Steve Wise
2016-06-16 21:46                       ` Steve Wise
2016-06-27 22:29                       ` Ming Lin
2016-06-28  9:14                         ` 'Christoph Hellwig'
2016-06-28 14:15                           ` Steve Wise
2016-06-28 15:51                             ` 'Christoph Hellwig'
2016-06-28 16:31                               ` Steve Wise
2016-06-28 16:49                                 ` Ming Lin
2016-06-28 19:20                                   ` Steve Wise
2016-06-28 19:43                                     ` Steve Wise
2016-06-28 21:04                                       ` Ming Lin
2016-06-29 14:11                                         ` Steve Wise
2016-06-27 17:26                   ` Ming Lin [this message]
2016-06-16 20:35           ` Steve Wise
2016-06-16 20:01       ` Steve Wise
2016-06-17 14:05       ` Steve Wise
     [not found]       ` <005f01d1c8a1$5a229240$0e67b6c0$@opengridcomputing.com>
2016-06-17 14:16         ` Steve Wise
2016-06-17 17:20           ` Ming Lin
2016-06-19 11:57             ` Sagi Grimberg
2016-06-21 14:18               ` Steve Wise
2016-06-21 17:33                 ` Ming Lin
2016-06-21 17:59                   ` Steve Wise
     [not found]               ` <006e01d1cbc7$d0d9cc40$728d64c0$@opengridcomputing.com>
2016-06-22 13:42                 ` Steve Wise
2016-06-27 14:19                   ` Steve Wise
2016-06-28  8:50                     ` 'Christoph Hellwig'
2016-07-04  9:57                       ` Yoichi Hayakawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1467048396.7205.3.camel@ssi \
    --to=mlin@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.