From: Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Adam Mazur <adam.mazur-yCD69WgB1YhWk0Htik3J/w@public.gmane.org>,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
target-devel
<target-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: "Nicholas A. Bellinger"
<nab-IzHhD5pYlfBP7FQvKIMDCQ@public.gmane.org>,
Oren Duer <oren-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: CRASH 3.18-rc2, 3.17.1, isert_connect_request
Date: Tue, 04 Nov 2014 18:44:57 +0200 [thread overview]
Message-ID: <54590289.9020404@dev.mellanox.co.il> (raw)
In-Reply-To: <54589351.1080007-yCD69WgB1YhWk0Htik3J/w@public.gmane.org>
On 11/4/2014 10:50 AM, Adam Mazur wrote:
> W dniu 03.11.2014 o 12:50, Adam Mazur pisze:
>> W dniu 03.11.2014 o 12:27, Sagi Grimberg pisze:
>>> On 11/3/2014 12:28 PM, Adam Mazur wrote:
>>>> Can someone help us with these crashes? We are not able to recreate it
>>>> on demand, but it takes 30 minutes to a few hours to appear the crash.
>>>> We've seen it on kernel 3.17.1 and 3.18-rc2.
>>>>
>>>
>>> Hay Adam,
>>>
>>> CC'ing target-devel mailing list (where iser target is maintained).
>>>
>>> So I stepped on this issue as well, and I actually have a fix for it
>>> in the pipe. I'm planning to test it with a few other fixes for a little
>>> while longer before I submit the code.
>>>
>>> In general, This crash occurs due to a race between tpg shutdown (or
>>> np disable) and RDMA_CM connect requests happening in parallel. iser
>>> target tries to reference a tpg attribute while the np->tpg_np is
>>> actually NULL.
>>>
>>> How many targets/initiators/portals did you use? HCA?
>>
>> Hi Sagi,
>>
>> There are about 300 targets (lvm volumes), 4 initiators, two portals.
>>
>> HCA by lspci:
>> 05:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
>> HCA] (rev 20)
>> Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
>> Flags: bus master, fast devsel, latency 0, IRQ 46
>> Memory at df500000 (64-bit, non-prefetchable) [size=1M]
>> Memory at de800000 (64-bit, prefetchable) [size=8M]
>> Capabilities: [40] Power Management version 2
>> Capabilities: [48] Vital Product Data
>> Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
>> Capabilities: [84] MSI-X: Enable+ Count=32 Masked-
>> Capabilities: [60] Express Endpoint, MSI 00
>> Kernel driver in use: ib_mthca
>>
>>
>> root@portal-1:~# mstflint -d 05:00.0 q
>> Image type: Failsafe
>> FW Version: 1.2.0
>> I.S. Version: 1
>> Device ID: 25204
>> Chip Revision: A0
>> Description: Node Port1 Sys image
>> GUIDs: 0005ad00000c75c8 0005ad00000c75c9 0005ad00000c75cb
>> Board ID: (MT_0260000002)
>> VSD:
>> PSID: MT_0260000002
>>
>>
>> root@portal-2:~# mstflint -d 05:00.0 q
>> Image type: Failsafe
>> I.S. Version: 1
>> Chip Revision: A0
>> Description: Node Port1 Sys image
>> GUIDs: 0005ad00000c7010 0005ad00000c7011 0005ad00000c7013
>> Board ID: (MT_0260000002)
>> VSD:
>> PSID: MT_0260000002
>>
>>
>>> Would it be possible to send you some patches to test as well?
>>
>> Absolutely, we can immediately test any patch on any kernel version.
>>
>> Thanks
>> Adam
>
>
> The race is supposedly caused by login ddos of initiators that are not
> PI aware - our initiators were running kernels from 3.2 to 3.17.
This bug has nothing to do with the initiators or their awareness to PI.
The race itself is related to PI though.
> When
> we've upgraded all to kernels > 3.15 new targets seem to be stable.
> However it shows that the race is lurking somewhere as You have pointed
> out.
Yea, the race is still there.
I have some patches under testing and need cleaning up before they go on
the mailing list...
> Thank You for the feedback received. Later we will try to prepare a
> testcase that might expose the crash.
I think full target stack unload while lots of initiators are
connected should invoke this race...
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2014-11-04 16:44 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-03 10:28 CRASH 3.18-rc2, 3.17.1, isert_connect_request Adam Mazur
2014-11-03 11:27 ` Sagi Grimberg
[not found] ` <54576696.4000203-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-11-03 11:50 ` Adam Mazur
2014-11-04 8:50 ` Adam Mazur
[not found] ` <54589351.1080007-yCD69WgB1YhWk0Htik3J/w@public.gmane.org>
2014-11-04 16:44 ` Sagi Grimberg [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54590289.9020404@dev.mellanox.co.il \
--to=sagig-ldsdmyg8hgv8yrgs2mwiifqbs+8scbdb@public.gmane.org \
--cc=adam.mazur-yCD69WgB1YhWk0Htik3J/w@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=nab-IzHhD5pYlfBP7FQvKIMDCQ@public.gmane.org \
--cc=oren-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=target-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox