linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Doug Ledford <dledford@redhat.com>
To: Robert LeBlanc <robert@leblancnet.us>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Zhu Lingshan <lszhu@suse.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	linux-scsi@vger.kernel.org, Sagi Grimberg <sagi@grimberg.me>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: iscsi_trx going into D state
Date: Thu, 22 Dec 2016 14:15:43 -0500	[thread overview]
Message-ID: <6beeb461-0391-cb29-8cf7-74f4c8feda34@redhat.com> (raw)
In-Reply-To: <CAANLjFpoWruzYEDohQrAKCxp9NSGhZM3mv=x+9+L7x1+fwYN6w@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 9202 bytes --]

On 12/21/2016 6:39 PM, Robert LeBlanc wrote:
> I hit a new backtrace today, hopefully it adds something.
> 
> # cat /proc/19659/stack
> [<ffffffff815304d1>] iscsit_stop_session+0x1b1/0x1c0
> [<ffffffff81521c62>] iscsi_check_for_session_reinstatement+0x1e2/0x270
> [<ffffffff81524660>] iscsi_target_check_for_existing_instances+0x30/0x40
> [<ffffffff815247a8>] iscsi_target_do_login+0x138/0x630
> [<ffffffff815259be>] iscsi_target_start_negotiation+0x4e/0xa0
> [<ffffffff8152355e>] __iscsi_target_login_thread+0x83e/0xf20
> [<ffffffff81523c64>] iscsi_target_login_thread+0x24/0x30
> [<ffffffff810a3059>] kthread+0xd9/0xf0
> [<ffffffff817732d5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # cat /proc/21342/stack
> [<ffffffffa0292b10>] __ib_drain_sq+0x190/0x1c0 [ib_core]
> [<ffffffffa0292b65>] ib_drain_sq+0x25/0x30 [ib_core]
> [<ffffffffa0292d72>] ib_drain_qp+0x12/0x30 [ib_core]
> [<ffffffffa062c5ff>] isert_wait_conn+0x5f/0x2d0 [ib_isert]
> [<ffffffff815309b7>] iscsit_close_connection+0x157/0x860
> [<ffffffff8151f10b>] iscsit_take_action_for_connection_exit+0x7b/0xf0
> [<ffffffff81530265>] iscsi_target_rx_thread+0x95/0xa0
> [<ffffffff810a3059>] kthread+0xd9/0xf0
> [<ffffffff817732d5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> # ps aux | grep iscsi | grep D
> root     19659  0.0  0.0      0     0 ?        D    16:12   0:00 [iscsi_np]
> root     21342  0.0  0.0      0     0 ?        D    16:29   0:00 [iscsi_trx]
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

That looks suspiciously like the __ib_drain_sq is stuck forever waiting
on a completion that never comes.

> 
> On Thu, Dec 15, 2016 at 1:38 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> Nicholas,
>>
>> I've found that the kernels I used were not able to be inspected using
>> crash and I could not build the debug info for them. So I built a new
>> 4.9 kernel and verified that I could inspect the crash. It is located
>> at [1].
>>
>> [1] http://mirrors.betterservers.com/trace/crash2.tar.xz
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Dec 12, 2016 at 4:57 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>> Nicholas,
>>>
>>> After lots of set backs and having to give up trying to get kernel
>>> dumps on our "production" systems, I've been able to work out the
>>> issues we had with kdump and replicate the issue on my dev boxes. I
>>> have dumps from 4.4.30 and 4.9-rc8 (makedumpfile would not dump, so it
>>> is a straight copy of /proc/vmcore from the crash kernel). In each
>>> crash directory, I put a details.txt file that has the process IDs
>>> that were having problems and a brief description of the set-up at the
>>> time. This was mostly replicated by starting fio and pulling the
>>> Infiniband cable until fio gave up. This hardware also has Mellanox
>>> ConnectX4-LX cards and I also replicated the issue over RoCE using 4.9
>>> since it has the drivers in-box. Please let me know if you need more
>>> info, I can test much faster now. The cores/kernels/modules are
>>> located at [1].
>>>
>>> [1] http://mirrors.betterservers.com/trace/crash.tar.xz
>>>
>>> Thanks,
>>> Robert
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Fri, Nov 4, 2016 at 3:57 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>>> We hit this yesterday, this time it was on the tx thread (the other
>>>> ones before seem to be on the rx thread). We weren't able to get a
>>>> kernel dump on this. We'll try to get one next time.
>>>>
>>>> # ps axuw | grep "D.*iscs[i]"
>>>> root     12383  0.0  0.0      0     0 ?        D    Nov03   0:04 [iscsi_np]
>>>> root     23016  0.0  0.0      0     0 ?        D    Nov03   0:00 [iscsi_ttx]
>>>> root     23018  0.0  0.0      0     0 ?        D    Nov03   0:00 [iscsi_ttx]
>>>> # cat /proc/12383/stack
>>>> [<ffffffff814f24af>] iscsit_stop_session+0x19f/0x1d0
>>>> [<ffffffff814e3c66>] iscsi_check_for_session_reinstatement+0x1e6/0x270
>>>> [<ffffffff814e6620>] iscsi_target_check_for_existing_instances+0x30/0x40
>>>> [<ffffffff814e6770>] iscsi_target_do_login+0x140/0x640
>>>> [<ffffffff814e7b0c>] iscsi_target_start_negotiation+0x1c/0xb0
>>>> [<ffffffff814e585b>] iscsi_target_login_thread+0xa9b/0xfc0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>> # cat /proc/23016/stack
>>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0
>>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert]
>>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870
>>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100
>>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>> # cat /proc/23018/stack
>>>> [<ffffffff814ce0d9>] target_wait_for_sess_cmds+0x49/0x1a0
>>>> [<ffffffffa058b92b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert]
>>>> [<ffffffff814f2642>] iscsit_close_connection+0x162/0x870
>>>> [<ffffffff814e110f>] iscsit_take_action_for_connection_exit+0x7f/0x100
>>>> [<ffffffff814f122a>] iscsi_target_tx_thread+0x1aa/0x1d0
>>>> [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> From dmesg:
>>>> [  394.476332] INFO: rcu_sched self-detected stall on CPU
>>>> [  394.476334]  20-...: (23976 ticks this GP)
>>>> idle=edd/140000000000001/0 softirq=292/292 fqs=18788
>>>> [  394.476336]   (t=24003 jiffies g=3146 c=3145 q=0)
>>>> [  394.476337] Task dump for CPU 20:
>>>> [  394.476338] kworker/u68:2   R  running task        0 12906      2 0x00000008
>>>> [  394.476345] Workqueue: isert_comp_wq isert_cq_work [ib_isert]
>>>> [  394.476346]  ffff883f2fe38000 00000000f805705e ffff883f7fd03da8
>>>> ffffffff810ac8ff
>>>> [  394.476347]  0000000000000014 ffffffff81adb680 ffff883f7fd03dc0
>>>> ffffffff810af239
>>>> [  394.476348]  0000000000000015 ffff883f7fd03df0 ffffffff810e1cd0
>>>> ffff883f7fd17b80
>>>> [  394.476348] Call Trace:
>>>> [  394.476354]  <IRQ>  [<ffffffff810ac8ff>] sched_show_task+0xaf/0x110
>>>> [  394.476355]  [<ffffffff810af239>] dump_cpu_task+0x39/0x40
>>>> [  394.476357]  [<ffffffff810e1cd0>] rcu_dump_cpu_stacks+0x80/0xb0
>>>> [  394.476359]  [<ffffffff810e6100>] rcu_check_callbacks+0x540/0x820
>>>> [  394.476360]  [<ffffffff810afe11>] ? account_system_time+0x81/0x110
>>>> [  394.476363]  [<ffffffff810faa60>] ? tick_sched_do_timer+0x50/0x50
>>>> [  394.476364]  [<ffffffff810eb599>] update_process_times+0x39/0x60
>>>> [  394.476365]  [<ffffffff810fa815>] tick_sched_handle.isra.17+0x25/0x60
>>>> [  394.476366]  [<ffffffff810faa9d>] tick_sched_timer+0x3d/0x70
>>>> [  394.476368]  [<ffffffff810ec182>] __hrtimer_run_queues+0x102/0x290
>>>> [  394.476369]  [<ffffffff810ec668>] hrtimer_interrupt+0xa8/0x1a0
>>>> [  394.476372]  [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60
>>>> [  394.476374]  [<ffffffff8172423d>] smp_apic_timer_interrupt+0x3d/0x50
>>>> [  394.476376]  [<ffffffff817224f7>] apic_timer_interrupt+0x87/0x90
>>>> [  394.476379]  <EOI>  [<ffffffff810d71be>] ? console_unlock+0x41e/0x4e0
>>>> [  394.476380]  [<ffffffff810d757c>] vprintk_emit+0x2fc/0x500
>>>> [  394.476382]  [<ffffffff810d78ff>] vprintk_default+0x1f/0x30
>>>> [  394.476384]  [<ffffffff81174dde>] printk+0x5d/0x74
>>>> [  394.476388]  [<ffffffff814bce21>] transport_lookup_cmd_lun+0x1d1/0x200
>>>> [  394.476390]  [<ffffffff814ee8c0>] iscsit_setup_scsi_cmd+0x230/0x540
>>>> [  394.476392]  [<ffffffffa058dbf3>] isert_rx_do_work+0x3f3/0x7f0 [ib_isert]
>>>> [  394.476394]  [<ffffffffa058e174>] isert_cq_work+0x184/0x770 [ib_isert]
>>>> [  394.476396]  [<ffffffff8109740f>] process_one_work+0x14f/0x400
>>>> [  394.476397]  [<ffffffff81097c84>] worker_thread+0x114/0x470
>>>> [  394.476398]  [<ffffffff8171d32a>] ? __schedule+0x34a/0x7f0
>>>> [  394.476399]  [<ffffffff81097b70>] ? rescuer_thread+0x310/0x310
>>>> [  394.476400]  [<ffffffff8109d7c8>] kthread+0xd8/0xf0
>>>> [  394.476402]  [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60
>>>> [  394.476403]  [<ffffffff81721a8f>] ret_from_fork+0x3f/0x70
>>>> [  394.476404]  [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60
>>>> [  405.716632] Unexpected ret: -104 send data 360
>>>> [  405.721711] tx_data returned -32, expecting 360.
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

When you combine this trace with the newest one, it really makes me
thing there is something of a bad interaction between the new drain cq
API and the iser/isert implementation to use said API.  Sagi, Christoph?

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

  reply	other threads:[~2016-12-22 19:16 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-30 17:14 iscsi_trx going into D state Robert LeBlanc
     [not found] ` <CAANLjFoj9-qscJOSf2jtKYt2+4cQxMHNJ9q2QTey4wyG5OTSAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-04  7:55   ` Johannes Thumshirn
     [not found]     ` <20161004075545.j52mg3a2jckrchlp-qw2SdCWA0PpjqqEj2zc+bA@public.gmane.org>
2016-10-04  9:11       ` Hannes Reinecke
2016-10-04 11:46         ` Christoph Hellwig
     [not found]           ` <20161004114642.GA2377-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-10-04 16:39             ` Robert LeBlanc
2016-10-05 17:40           ` Robert LeBlanc
2016-10-05 18:03             ` Christoph Hellwig
2016-10-05 18:19               ` Robert LeBlanc
2016-10-08  2:59 ` Zhu Lingshan
2016-10-17 16:32   ` Robert LeBlanc
     [not found]     ` <CAANLjFobXiBO2tXxTBB-8BQjM8FC0wmxdxQvEd6Rp=1LZkrvpA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-17 19:03       ` Robert LeBlanc
2016-10-17 19:11       ` Robert LeBlanc
     [not found]         ` <CAANLjFoh+C8QE=qcPKqUUG3SnH2EMmS7DWZ5D4AD7yWMxoK0Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-18  3:06           ` Zhu Lingshan
     [not found]             ` <4fc72e32-26fb-96bd-8a0d-814eef712b43-IBi9RG/b67k@public.gmane.org>
2016-10-18  4:42               ` Robert LeBlanc
2016-10-18  7:05                 ` Nicholas A. Bellinger
2016-10-18  7:52                   ` Nicholas A. Bellinger
     [not found]                   ` <1476774332.8490.43.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-18 22:13                     ` Robert LeBlanc
     [not found]                       ` <CAANLjFqXt5r=c9F75vjeK=_zLa8zCS1priLuZo=A1ZSHKZ=1Bw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-19  6:25                         ` Nicholas A. Bellinger
     [not found]                           ` <1476858359.8490.97.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-19 16:41                             ` Robert LeBlanc
     [not found]                               ` <CAANLjFoGEi29goybqsvEg6trystEkurVz52P8SwqGUSNV1jdSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-29 22:29                                 ` Nicholas A. Bellinger
     [not found]                                   ` <1477780190.22703.47.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-10-31 16:34                                     ` Robert LeBlanc
     [not found]                                       ` <CAANLjFpkEVmO83r5YWh=hCnN=AUf9bvrrCyVJHc-=CRpc3P0vQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-04 21:57                                         ` Robert LeBlanc
     [not found]                                           ` <CAANLjFqoHuSq2SsNZ4J2uvAQGPg0F1tpxeJuAQT1oM1hXQ0wew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-12 23:57                                             ` Robert LeBlanc
     [not found]                                               ` <CAANLjFpYT62G86w-r00+shJUyrPd68BS64y8f9OZemz_5kojzg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-15 20:38                                                 ` Robert LeBlanc
     [not found]                                                   ` <CAANLjFon+re7eMriFjnFfR-4SnzxR4LLSb2qcwhfkb7ODbuTwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-21 23:39                                                     ` Robert LeBlanc
2016-12-22 19:15                                                       ` Doug Ledford [this message]
2016-12-27 20:22                                                         ` Robert LeBlanc
     [not found]                                                           ` <CAANLjFq2ib0H+W3RFVAdqvWF8_qDOkM5mvmAhVh0x4Usha2dOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-27 20:58                                                             ` Robert LeBlanc
     [not found]                                                               ` <CAANLjFqRskoM7dn_zj_-V=uUb5KYq0OLLdLLuC4Uuba4+mq5Vw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-28 20:39                                                                 ` Robert LeBlanc
2016-12-28 20:58                                                                   ` Robert LeBlanc
     [not found]                                                                     ` <CAANLjFpbE9-B8qWtU5nDfg4+t+kD8TSVy0JOfN+zuFYsZ05_Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-29 21:23                                                                       ` Robert LeBlanc
     [not found]                                                                         ` <CAANLjFpEpJ4647u9R-7phf68fw--pOfThbp5Sntd4c7DdRSwwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-29 23:57                                                                           ` Robert LeBlanc
     [not found]                                                                             ` <CAANLjFooGrt51a9rOy8TKMyXyxBYmGEPm=h1YJm81Nj6YS=5yg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-30 23:07                                                                               ` Robert LeBlanc
     [not found]                                                                                 ` <CAANLjFrZrTPUuzP_NjkgG5h_YwwYKEWT-KzVjTvuXZ1d04z6Fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-03 20:07                                                                                   ` Robert LeBlanc
     [not found]                                                                                     ` <CAANLjFpSnQ7ApOK5HDRHXQQeQNGWLUv4e+2N=_e-zBeziYm5tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-04  0:11                                                                                       ` Robert LeBlanc
2017-01-06 17:06                                                                                         ` Laurence Oberman
2017-01-06 19:12                                                                                           ` Robert LeBlanc
2017-01-12 21:22                                                                                             ` Robert LeBlanc
2017-01-12 21:26                                                                                               ` Robert LeBlanc
2017-01-13 15:10                                                                                                 ` Laurence Oberman
     [not found]                                                                                                   ` <1449740553.15880491.1484320214006.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-01-13 23:38                                                                                                     ` Robert LeBlanc
     [not found]                                                                                                       ` <CAANLjFrFxasp6e=jWq4FwPFjRLgX-nwHc5n+eYRTz9EjTCAQ5g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-15 18:15                                                                                                         ` Laurence Oberman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6beeb461-0391-cb29-8cf7-74f4c8feda34@redhat.com \
    --to=dledford@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lszhu@suse.com \
    --cc=nab@linux-iscsi.org \
    --cc=robert@leblancnet.us \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).