public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
@ 2010-07-12  4:57 Pradeep Satyanarayana
       [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-12  4:57 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma

Roland,

I realize that the following patch: 

https://patchwork.kernel.org/patch/97243/

is queued in your backlog of patches and unlikely that it will go into 2.6.35.
What are the chances that it will make it into 2.6.36? This patch has fixed a
a rarely seen crash and we would like it to go upstream ASAP.

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-12 10:07   ` Bart Van Assche
       [not found]     ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-07-12 21:21   ` Roland Dreier
  1 sibling, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2010-07-12 10:07 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell

On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
<pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> I realize that the following patch:
>
> https://patchwork.kernel.org/patch/97243/
>
> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> What are the chances that it will make it into 2.6.36? This patch has fixed a
> a rarely seen crash and we would like it to go upstream ASAP.

The following comment was made on that patch by Ralph Campbell (see
also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):

"Quite right. I should also use list_for_each_entry_safe(). I will fix this."

This makes me wonder whether version three of this patch can go in unmodified ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]     ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-12 10:20       ` Pradeep Satyanarayana
       [not found]         ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-12 10:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Roland Dreier, linux-rdma, Ralph Campbell

Bart Van Assche wrote:
> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>> I realize that the following patch:
>>
>> https://patchwork.kernel.org/patch/97243/
>>
>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
>> What are the chances that it will make it into 2.6.36? This patch has fixed a
>> a rarely seen crash and we would like it to go upstream ASAP.
> 
> The following comment was made on that patch by Ralph Campbell (see
> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
> 
> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
> 
> This makes me wonder whether version three of this patch can go in unmodified ?

There was a version 4 that followed. That was what I was referring to.

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]         ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-12 10:35           ` Bart Van Assche
       [not found]             ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2010-07-12 10:35 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell

On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana
<pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> Bart Van Assche wrote:
>> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
>> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>>> I realize that the following patch:
>>>
>>> https://patchwork.kernel.org/patch/97243/
>>>
>>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
>>> What are the chances that it will make it into 2.6.36? This patch has fixed a
>>> a rarely seen crash and we would like it to go upstream ASAP.
>>
>> The following comment was made on that patch by Ralph Campbell (see
>> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
>>
>> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
>>
>> This makes me wonder whether version three of this patch can go in unmodified ?
>
> There was a version 4 that followed. That was what I was referring to.

Thanks for the info -- I had missed version four of that patch. Now
that I had a look at it, why does the comment above
ipoib_cm_flush_path() say that it removes all entries while the loop
inside that function is stopped after the first entry has been found
and removed ? Why does that function use list_for_each_entry_safe()
while only a single entry is removed ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-07-12 10:07   ` Bart Van Assche
@ 2010-07-12 21:21   ` Roland Dreier
       [not found]     ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Roland Dreier @ 2010-07-12 21:21 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: linux-rdma

 > I realize that the following patch: 
 > 
 > https://patchwork.kernel.org/patch/97243/
 > 
 > is queued in your backlog of patches and unlikely that it will go into 2.6.35.
 > What are the chances that it will make it into 2.6.36? This patch has fixed a
 > a rarely seen crash and we would like it to go upstream ASAP.

How much testing and/or review have you done for this patch?  What is
the impact of the bug in the cases you have seen?

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]     ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-13  6:07       ` Pradeep Satyanarayana
       [not found]         ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-13  6:07 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma

Roland Dreier wrote:
>  > I realize that the following patch: 
>  > 
>  > https://patchwork.kernel.org/patch/97243/
>  > 
>  > is queued in your backlog of patches and unlikely that it will go into 2.6.35.
>  > What are the chances that it will make it into 2.6.36? This patch has fixed a
>  > a rarely seen crash and we would like it to go upstream ASAP.
> 
> How much testing and/or review have you done for this patch?  What is
> the impact of the bug in the cases you have seen?

I guess I came to a premature conclusion. One set of tests ran fine and I made that
conclusion. Another set of tests caused the following crash:

12:mon> t
[link register   ] d00000000ec950a8 .cm_destroy_id+0x3d8/0x520 [ib_cm]
[c0000007544d2d40] d00000000ec95098 .cm_destroy_id+0x3c8/0x520 [ib_cm] (unreliable)
[c0000007544d2e10] d00000000ef9b590 .ipoib_cm_free_rx_reap_list+0xc8/0x180 [ib_ipoib]
[c0000007544d2ed0] d00000000ef9e1cc .ipoib_cm_dev_stop+0x23c/0x360 [ib_ipoib]
[c0000007544d2f90] d00000000ef94d94 .ipoib_ib_dev_stop+0xe4/0x4b0 [ib_ipoib]
[c0000007544d30f0] d00000000ef91d68 .ipoib_stop+0x88/0x178 [ib_ipoib]
[c0000007544d3180] c0000000004ea1cc .dev_close+0xdc/0x148
[c0000007544d3210] c0000000004e9790 .dev_change_flags+0x1f0/0x288
[c0000007544d32b0] c0000000004f6a0c .do_setlink+0x2d4/0x498
[c0000007544d3390] c0000000004f8b20 .rtnl_newlink+0x520/0x5f0
[c0000007544d35a0] c0000000004f853c .rtnetlink_rcv_msg+0x24c/0x310
[c0000007544d3650] c000000000513a38 .netlink_rcv_skb+0xf0/0x128
[c0000007544d36e0] c0000000004f82cc .rtnetlink_rcv+0x34/0x58
[c0000007544d3770] c0000000005134c0 .netlink_unicast+0x498/0x4b0
[c0000007544d3840] c000000000514218 .netlink_sendmsg+0x288/0x390
[c0000007544d3920] c0000000004cf538 .sock_sendmsg+0x118/0x158
[c0000007544d3b30] c0000000004cf720 .SyS_sendmsg+0x1a8/0x348
[c0000007544d3d70] c0000000004cdbdc .SyS_socketcall+0x174/0x338
[c0000007544d3e30] c0000000000085b4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000fffa25c35d0
SP (fffc5a12fe0) is in userspace
12:mon> e
cpu 0x12: Vector: 300 (Data Access) at [c0000007544d2ac0]
    pc: c0000000002ef854: .rb_erase+0x64/0x180
    lr: d00000000ec950a8: .cm_destroy_id+0x3d8/0x520 [ib_cm]
    sp: c0000007544d2d40
   msr: 8000000000001432
   dar: 10000000100
 dsisr: 40000000
  current = 0xc0000006948dc240
  paca    = 0xc000000000f64a00
    pid   = 21593, comm = ip
12:mon>

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]         ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-13 14:45           ` Roland Dreier
       [not found]             ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Roland Dreier @ 2010-07-13 14:45 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: linux-rdma

 > I guess I came to a premature conclusion. One set of tests ran fine and I made that
 > conclusion. Another set of tests caused the following crash:

I don't really know how to interpret this.  Is this crash new, or is it
the same crash you were hoping this patch fixed?

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]             ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-13 19:23               ` Ralph Campbell
  0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-13 19:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Pradeep Satyanarayana, Roland Dreier, linux-rdma

On Mon, 2010-07-12 at 03:35 -0700, Bart Van Assche wrote:
> On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana
> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > Bart Van Assche wrote:
> >> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
> >> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> >>> I realize that the following patch:
> >>>
> >>> https://patchwork.kernel.org/patch/97243/
> >>>
> >>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> >>> What are the chances that it will make it into 2.6.36? This patch has fixed a
> >>> a rarely seen crash and we would like it to go upstream ASAP.
> >>
> >> The following comment was made on that patch by Ralph Campbell (see
> >> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
> >>
> >> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
> >>
> >> This makes me wonder whether version three of this patch can go in unmodified ?
> >
> > There was a version 4 that followed. That was what I was referring to.
> 
> Thanks for the info -- I had missed version four of that patch. Now
> that I had a look at it, why does the comment above
> ipoib_cm_flush_path() say that it removes all entries while the loop
> inside that function is stopped after the first entry has been found
> and removed ? Why does that function use list_for_each_entry_safe()
> while only a single entry is removed ?
> 
> Bart.

There is only one matching entry on the list at any given time
so I guess list_for_each_entry_safe() isn't required.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]             ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-14  2:54               ` Pradeep Satyanarayana
       [not found]                 ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-14  2:54 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma

Roland Dreier wrote:
>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>  > conclusion. Another set of tests caused the following crash:
> 
> I don't really know how to interpret this.  Is this crash new, or is it
> the same crash you were hoping this patch fixed?

This is a new crash.

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                 ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-14 16:25                   ` Pradeep Satyanarayana
       [not found]                     ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-14 16:25 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma

Pradeep Satyanarayana wrote:
> Roland Dreier wrote:
>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>>  > conclusion. Another set of tests caused the following crash:
>>
>> I don't really know how to interpret this.  Is this crash new, or is it
>> the same crash you were hoping this patch fixed?
> 
> This is a new crash.

I see other manifestations resulting in different crashes :

:mon> t
[c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
[c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
[c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
[c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
[c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
[c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
[c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
9:mon> e
cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
    pc: c0000000005ac390: ._spin_lock+0x20/0xc8
    lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
    sp: c00000074603b9a0
   msr: 8000000000009032
   dar: 3a0
 dsisr: 40000000
  current = 0xc000000756ce8b00
  paca    = 0xc000000000f63800
    pid   = 18095, comm = ipoib
9:mon>

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                     ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-15 11:56                       ` Pradeep Satyanarayana
       [not found]                         ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-15 11:56 UTC (permalink / raw)
  To: Ralph Campbell, Roland Dreier; +Cc: linux-rdma

Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
>> Roland Dreier wrote:
>>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>>>  > conclusion. Another set of tests caused the following crash:
>>>
>>> I don't really know how to interpret this.  Is this crash new, or is it
>>> the same crash you were hoping this patch fixed?
>> This is a new crash.
> 
> I see other manifestations resulting in different crashes :
> 
> :mon> t
> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> 9:mon> e
> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
>     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
>     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
>     sp: c00000074603b9a0
>    msr: 8000000000009032
>    dar: 3a0
>  dsisr: 40000000
>   current = 0xc000000756ce8b00
>   paca    = 0xc000000000f63800
>     pid   = 18095, comm = ipoib
> 9:mon>

Recreating the crash has been tricky. I have tried several several hundred times today
to unload and reload IPoIB while there is traffic and no crashes happened. I took
a closer look at the IPoIB CM code and I see a few things that look suspicious.

In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before 
calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
Is this a potential race?

In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in 
ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a 
clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in 
select cases. Was that intended?

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                         ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-15 19:59                           ` Ralph Campbell
  2010-07-16  1:29                           ` Ralph Campbell
  1 sibling, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-15 19:59 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma

I will write up a description of the locking as I
understand it and the changes I made. Give
me a day or two to write it up and check it.

On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>>  > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this.  Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> > 
> > I see other manifestations resulting in different crashes :
> > 
> > :mon> t
> > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >     sp: c00000074603b9a0
> >    msr: 8000000000009032
> >    dar: 3a0
> >  dsisr: 40000000
> >   current = 0xc000000756ce8b00
> >   paca    = 0xc000000000f63800
> >     pid   = 18095, comm = ipoib
> > 9:mon>
> 
> Recreating the crash has been tricky. I have tried several several hundred times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> a closer look at the IPoIB CM code and I see a few things that look suspicious.
> 
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before 
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> Is this a potential race?
> 
> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in 
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a 
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in 
> select cases. Was that intended?
> 
> Thanks
> Pradeep
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                         ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-07-15 19:59                           ` Ralph Campbell
@ 2010-07-16  1:29                           ` Ralph Campbell
       [not found]                             ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2010-07-16  1:29 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma

On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>>  > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this.  Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> > 
> > I see other manifestations resulting in different crashes :
> > 
> > :mon> t
> > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >     sp: c00000074603b9a0
> >    msr: 8000000000009032
> >    dar: 3a0
> >  dsisr: 40000000
> >   current = 0xc000000756ce8b00
> >   paca    = 0xc000000000f63800
> >     pid   = 18095, comm = ipoib
> > 9:mon>
> 
> Recreating the crash has been tricky. I have tried several several hundred times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> a closer look at the IPoIB CM code and I see a few things that look suspicious.
> 
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before 
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> Is this a potential race?

ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
by netif_tx_lock(dev) or stopping the ipoib network device.
It all depends on what pointer or data structure you think is being
accessed while free or being modified without the proper protection.

> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in 
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a 
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in 
> select cases. Was that intended?

The v4 patch comments explain the changes:
http://www.spinics.net/lists/linux-rdma/msg03733.html
Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx
has completed the RC QP creation process via the CM instead of simply
when ipoib_cm_create_tx() allocates the structure.
The test and clear was used to indicate the struct ipoib_cm_tx
had been put on the destroy list and the reaper thread woken up.
Now ipoib_cm_destroy_tx() uses the tx->neigh pointer != NULL to
indicate that ipoib_cm_destroy_tx() has started the destroy process.
ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv->lock
are held to protect tx->neigh.

> Thanks
> Pradeep

The longer write up on locking is turning out to be very complex.
I will keep working on it but I think it will be just as hard
to understand as slogging through the code.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                             ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-07-16  9:13                               ` Pradeep Satyanarayana
       [not found]                                 ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-16  9:13 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Roland Dreier, linux-rdma

Ralph Campbell wrote:
> On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
>> Pradeep Satyanarayana wrote:
>>> Pradeep Satyanarayana wrote:
>>>> Roland Dreier wrote:
>>>>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>>>>>  > conclusion. Another set of tests caused the following crash:
>>>>>
>>>>> I don't really know how to interpret this.  Is this crash new, or is it
>>>>> the same crash you were hoping this patch fixed?
>>>> This is a new crash.
>>> I see other manifestations resulting in different crashes :
>>>
>>> :mon> t
>>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
>>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
>>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
>>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
>>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
>>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
>>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
>>> 9:mon> e
>>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
>>>     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
>>>     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
>>>     sp: c00000074603b9a0
>>>    msr: 8000000000009032
>>>    dar: 3a0
>>>  dsisr: 40000000
>>>   current = 0xc000000756ce8b00
>>>   paca    = 0xc000000000f63800
>>>     pid   = 18095, comm = ipoib
>>> 9:mon>
>> Recreating the crash has been tricky. I have tried several several hundred times today
>> to unload and reload IPoIB while there is traffic and no crashes happened. I took
>> a closer look at the IPoIB CM code and I see a few things that look suspicious.
>>
>> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before 
>> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
>> Is this a potential race?
> 
> ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
> by netif_tx_lock(dev) or stopping the ipoib network device.

I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called
without netif_tx_lock(dev) held. Is that correct?

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
       [not found]                                 ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-16 17:43                                   ` Ralph Campbell
  0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-16 17:43 UTC (permalink / raw)
  To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma

On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote:
> Ralph Campbell wrote:
> > On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> >> Pradeep Satyanarayana wrote:
> >>> Pradeep Satyanarayana wrote:
> >>>> Roland Dreier wrote:
> >>>>>  > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>>>>  > conclusion. Another set of tests caused the following crash:
> >>>>>
> >>>>> I don't really know how to interpret this.  Is this crash new, or is it
> >>>>> the same crash you were hoping this patch fixed?
> >>>> This is a new crash.
> >>> I see other manifestations resulting in different crashes :
> >>>
> >>> :mon> t
> >>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> >>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> >>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> >>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> >>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> >>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> >>> 9:mon> e
> >>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >>>     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >>>     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >>>     sp: c00000074603b9a0
> >>>    msr: 8000000000009032
> >>>    dar: 3a0
> >>>  dsisr: 40000000
> >>>   current = 0xc000000756ce8b00
> >>>   paca    = 0xc000000000f63800
> >>>     pid   = 18095, comm = ipoib
> >>> 9:mon>
> >> Recreating the crash has been tricky. I have tried several several hundred times today
> >> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> >> a closer look at the IPoIB CM code and I see a few things that look suspicious.
> >>
> >> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before 
> >> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> >> Is this a potential race?
> > 
> > ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
> > by netif_tx_lock(dev) or stopping the ipoib network device.
> 
> I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called
> without netif_tx_lock(dev) held. Is that correct?
> 
> Thanks
> Pradeep

ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when
freeing a struct neighbour. I assume the Linux network stack is
not going to call into the IPoIB driver to send sk_buffs in that
case but I could be wrong. If it can, then you are correct that
the netif_tx_lock(dev) should be acquired.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-07-16 17:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-12  4:57 IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? Pradeep Satyanarayana
     [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:07   ` Bart Van Assche
     [not found]     ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-12 10:20       ` Pradeep Satyanarayana
     [not found]         ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:35           ` Bart Van Assche
     [not found]             ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-13 19:23               ` Ralph Campbell
2010-07-12 21:21   ` Roland Dreier
     [not found]     ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-13  6:07       ` Pradeep Satyanarayana
     [not found]         ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-13 14:45           ` Roland Dreier
     [not found]             ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-14  2:54               ` Pradeep Satyanarayana
     [not found]                 ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-14 16:25                   ` Pradeep Satyanarayana
     [not found]                     ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 11:56                       ` Pradeep Satyanarayana
     [not found]                         ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 19:59                           ` Ralph Campbell
2010-07-16  1:29                           ` Ralph Campbell
     [not found]                             ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-16  9:13                               ` Pradeep Satyanarayana
     [not found]                                 ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-16 17:43                                   ` Ralph Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox