* IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
@ 2010-07-12 4:57 Pradeep Satyanarayana
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-12 4:57 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Roland,
I realize that the following patch:
https://patchwork.kernel.org/patch/97243/
is queued in your backlog of patches and unlikely that it will go into 2.6.35.
What are the chances that it will make it into 2.6.36? This patch has fixed a
a rarely seen crash and we would like it to go upstream ASAP.
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-12 10:07 ` Bart Van Assche
[not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-12 21:21 ` Roland Dreier
1 sibling, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2010-07-12 10:07 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell
On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
<pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> I realize that the following patch:
>
> https://patchwork.kernel.org/patch/97243/
>
> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> What are the chances that it will make it into 2.6.36? This patch has fixed a
> a rarely seen crash and we would like it to go upstream ASAP.
The following comment was made on that patch by Ralph Campbell (see
also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
"Quite right. I should also use list_for_each_entry_safe(). I will fix this."
This makes me wonder whether version three of this patch can go in unmodified ?
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-12 10:20 ` Pradeep Satyanarayana
[not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-12 10:20 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Roland Dreier, linux-rdma, Ralph Campbell
Bart Van Assche wrote:
> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>> I realize that the following patch:
>>
>> https://patchwork.kernel.org/patch/97243/
>>
>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
>> What are the chances that it will make it into 2.6.36? This patch has fixed a
>> a rarely seen crash and we would like it to go upstream ASAP.
>
> The following comment was made on that patch by Ralph Campbell (see
> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
>
> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
>
> This makes me wonder whether version three of this patch can go in unmodified ?
There was a version 4 that followed. That was what I was referring to.
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-12 10:35 ` Bart Van Assche
[not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2010-07-12 10:35 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell
On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana
<pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> Bart Van Assche wrote:
>> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
>> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>>> I realize that the following patch:
>>>
>>> https://patchwork.kernel.org/patch/97243/
>>>
>>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
>>> What are the chances that it will make it into 2.6.36? This patch has fixed a
>>> a rarely seen crash and we would like it to go upstream ASAP.
>>
>> The following comment was made on that patch by Ralph Campbell (see
>> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
>>
>> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
>>
>> This makes me wonder whether version three of this patch can go in unmodified ?
>
> There was a version 4 that followed. That was what I was referring to.
Thanks for the info -- I had missed version four of that patch. Now
that I had a look at it, why does the comment above
ipoib_cm_flush_path() say that it removes all entries while the loop
inside that function is stopped after the first entry has been found
and removed ? Why does that function use list_for_each_entry_safe()
while only a single entry is removed ?
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:07 ` Bart Van Assche
@ 2010-07-12 21:21 ` Roland Dreier
[not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
1 sibling, 1 reply; 15+ messages in thread
From: Roland Dreier @ 2010-07-12 21:21 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: linux-rdma
> I realize that the following patch:
>
> https://patchwork.kernel.org/patch/97243/
>
> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> What are the chances that it will make it into 2.6.36? This patch has fixed a
> a rarely seen crash and we would like it to go upstream ASAP.
How much testing and/or review have you done for this patch? What is
the impact of the bug in the cases you have seen?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-13 6:07 ` Pradeep Satyanarayana
[not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-13 6:07 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Roland Dreier wrote:
> > I realize that the following patch:
> >
> > https://patchwork.kernel.org/patch/97243/
> >
> > is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> > What are the chances that it will make it into 2.6.36? This patch has fixed a
> > a rarely seen crash and we would like it to go upstream ASAP.
>
> How much testing and/or review have you done for this patch? What is
> the impact of the bug in the cases you have seen?
I guess I came to a premature conclusion. One set of tests ran fine and I made that
conclusion. Another set of tests caused the following crash:
12:mon> t
[link register ] d00000000ec950a8 .cm_destroy_id+0x3d8/0x520 [ib_cm]
[c0000007544d2d40] d00000000ec95098 .cm_destroy_id+0x3c8/0x520 [ib_cm] (unreliable)
[c0000007544d2e10] d00000000ef9b590 .ipoib_cm_free_rx_reap_list+0xc8/0x180 [ib_ipoib]
[c0000007544d2ed0] d00000000ef9e1cc .ipoib_cm_dev_stop+0x23c/0x360 [ib_ipoib]
[c0000007544d2f90] d00000000ef94d94 .ipoib_ib_dev_stop+0xe4/0x4b0 [ib_ipoib]
[c0000007544d30f0] d00000000ef91d68 .ipoib_stop+0x88/0x178 [ib_ipoib]
[c0000007544d3180] c0000000004ea1cc .dev_close+0xdc/0x148
[c0000007544d3210] c0000000004e9790 .dev_change_flags+0x1f0/0x288
[c0000007544d32b0] c0000000004f6a0c .do_setlink+0x2d4/0x498
[c0000007544d3390] c0000000004f8b20 .rtnl_newlink+0x520/0x5f0
[c0000007544d35a0] c0000000004f853c .rtnetlink_rcv_msg+0x24c/0x310
[c0000007544d3650] c000000000513a38 .netlink_rcv_skb+0xf0/0x128
[c0000007544d36e0] c0000000004f82cc .rtnetlink_rcv+0x34/0x58
[c0000007544d3770] c0000000005134c0 .netlink_unicast+0x498/0x4b0
[c0000007544d3840] c000000000514218 .netlink_sendmsg+0x288/0x390
[c0000007544d3920] c0000000004cf538 .sock_sendmsg+0x118/0x158
[c0000007544d3b30] c0000000004cf720 .SyS_sendmsg+0x1a8/0x348
[c0000007544d3d70] c0000000004cdbdc .SyS_socketcall+0x174/0x338
[c0000007544d3e30] c0000000000085b4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000fffa25c35d0
SP (fffc5a12fe0) is in userspace
12:mon> e
cpu 0x12: Vector: 300 (Data Access) at [c0000007544d2ac0]
pc: c0000000002ef854: .rb_erase+0x64/0x180
lr: d00000000ec950a8: .cm_destroy_id+0x3d8/0x520 [ib_cm]
sp: c0000007544d2d40
msr: 8000000000001432
dar: 10000000100
dsisr: 40000000
current = 0xc0000006948dc240
paca = 0xc000000000f64a00
pid = 21593, comm = ip
12:mon>
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-13 14:45 ` Roland Dreier
[not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Roland Dreier @ 2010-07-13 14:45 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: linux-rdma
> I guess I came to a premature conclusion. One set of tests ran fine and I made that
> conclusion. Another set of tests caused the following crash:
I don't really know how to interpret this. Is this crash new, or is it
the same crash you were hoping this patch fixed?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-13 19:23 ` Ralph Campbell
0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-13 19:23 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Pradeep Satyanarayana, Roland Dreier, linux-rdma
On Mon, 2010-07-12 at 03:35 -0700, Bart Van Assche wrote:
> On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana
> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > Bart Van Assche wrote:
> >> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
> >> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> >>> I realize that the following patch:
> >>>
> >>> https://patchwork.kernel.org/patch/97243/
> >>>
> >>> is queued in your backlog of patches and unlikely that it will go into 2.6.35.
> >>> What are the chances that it will make it into 2.6.36? This patch has fixed a
> >>> a rarely seen crash and we would like it to go upstream ASAP.
> >>
> >> The following comment was made on that patch by Ralph Campbell (see
> >> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html):
> >>
> >> "Quite right. I should also use list_for_each_entry_safe(). I will fix this."
> >>
> >> This makes me wonder whether version three of this patch can go in unmodified ?
> >
> > There was a version 4 that followed. That was what I was referring to.
>
> Thanks for the info -- I had missed version four of that patch. Now
> that I had a look at it, why does the comment above
> ipoib_cm_flush_path() say that it removes all entries while the loop
> inside that function is stopped after the first entry has been found
> and removed ? Why does that function use list_for_each_entry_safe()
> while only a single entry is removed ?
>
> Bart.
There is only one matching entry on the list at any given time
so I guess list_for_each_entry_safe() isn't required.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-07-14 2:54 ` Pradeep Satyanarayana
[not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-14 2:54 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Roland Dreier wrote:
> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> > conclusion. Another set of tests caused the following crash:
>
> I don't really know how to interpret this. Is this crash new, or is it
> the same crash you were hoping this patch fixed?
This is a new crash.
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-14 16:25 ` Pradeep Satyanarayana
[not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-14 16:25 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Pradeep Satyanarayana wrote:
> Roland Dreier wrote:
>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>> > conclusion. Another set of tests caused the following crash:
>>
>> I don't really know how to interpret this. Is this crash new, or is it
>> the same crash you were hoping this patch fixed?
>
> This is a new crash.
I see other manifestations resulting in different crashes :
:mon> t
[c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
[c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
[c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
[c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
[c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
[c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
[c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
9:mon> e
cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
pc: c0000000005ac390: ._spin_lock+0x20/0xc8
lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
sp: c00000074603b9a0
msr: 8000000000009032
dar: 3a0
dsisr: 40000000
current = 0xc000000756ce8b00
paca = 0xc000000000f63800
pid = 18095, comm = ipoib
9:mon>
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-15 11:56 ` Pradeep Satyanarayana
[not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-15 11:56 UTC (permalink / raw)
To: Ralph Campbell, Roland Dreier; +Cc: linux-rdma
Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
>> Roland Dreier wrote:
>>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>>> > conclusion. Another set of tests caused the following crash:
>>>
>>> I don't really know how to interpret this. Is this crash new, or is it
>>> the same crash you were hoping this patch fixed?
>> This is a new crash.
>
> I see other manifestations resulting in different crashes :
>
> :mon> t
> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> 9:mon> e
> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> sp: c00000074603b9a0
> msr: 8000000000009032
> dar: 3a0
> dsisr: 40000000
> current = 0xc000000756ce8b00
> paca = 0xc000000000f63800
> pid = 18095, comm = ipoib
> 9:mon>
Recreating the crash has been tricky. I have tried several several hundred times today
to unload and reload IPoIB while there is traffic and no crashes happened. I took
a closer look at the IPoIB CM code and I see a few things that look suspicious.
In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before
calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
Is this a potential race?
In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in
ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a
clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in
select cases. Was that intended?
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-15 19:59 ` Ralph Campbell
2010-07-16 1:29 ` Ralph Campbell
1 sibling, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-15 19:59 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma
I will write up a description of the locking as I
understand it and the changes I made. Give
me a day or two to write it up and check it.
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>> > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this. Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> >
> > I see other manifestations resulting in different crashes :
> >
> > :mon> t
> > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> > pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > sp: c00000074603b9a0
> > msr: 8000000000009032
> > dar: 3a0
> > dsisr: 40000000
> > current = 0xc000000756ce8b00
> > paca = 0xc000000000f63800
> > pid = 18095, comm = ipoib
> > 9:mon>
>
> Recreating the crash has been tricky. I have tried several several hundred times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> a closer look at the IPoIB CM code and I see a few things that look suspicious.
>
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> Is this a potential race?
>
> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in
> select cases. Was that intended?
>
> Thanks
> Pradeep
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 19:59 ` Ralph Campbell
@ 2010-07-16 1:29 ` Ralph Campbell
[not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
1 sibling, 1 reply; 15+ messages in thread
From: Ralph Campbell @ 2010-07-16 1:29 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>> > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this. Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> >
> > I see other manifestations resulting in different crashes :
> >
> > :mon> t
> > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> > pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > sp: c00000074603b9a0
> > msr: 8000000000009032
> > dar: 3a0
> > dsisr: 40000000
> > current = 0xc000000756ce8b00
> > paca = 0xc000000000f63800
> > pid = 18095, comm = ipoib
> > 9:mon>
>
> Recreating the crash has been tricky. I have tried several several hundred times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> a closer look at the IPoIB CM code and I see a few things that look suspicious.
>
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> Is this a potential race?
ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
by netif_tx_lock(dev) or stopping the ipoib network device.
It all depends on what pointer or data structure you think is being
accessed while free or being modified without the proper protection.
> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in
> select cases. Was that intended?
The v4 patch comments explain the changes:
http://www.spinics.net/lists/linux-rdma/msg03733.html
Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx
has completed the RC QP creation process via the CM instead of simply
when ipoib_cm_create_tx() allocates the structure.
The test and clear was used to indicate the struct ipoib_cm_tx
had been put on the destroy list and the reaper thread woken up.
Now ipoib_cm_destroy_tx() uses the tx->neigh pointer != NULL to
indicate that ipoib_cm_destroy_tx() has started the destroy process.
ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv->lock
are held to protect tx->neigh.
> Thanks
> Pradeep
The longer write up on locking is turning out to be very complex.
I will keep working on it but I think it will be just as hard
to understand as slogging through the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-07-16 9:13 ` Pradeep Satyanarayana
[not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-16 9:13 UTC (permalink / raw)
To: Ralph Campbell; +Cc: Roland Dreier, linux-rdma
Ralph Campbell wrote:
> On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
>> Pradeep Satyanarayana wrote:
>>> Pradeep Satyanarayana wrote:
>>>> Roland Dreier wrote:
>>>>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
>>>>> > conclusion. Another set of tests caused the following crash:
>>>>>
>>>>> I don't really know how to interpret this. Is this crash new, or is it
>>>>> the same crash you were hoping this patch fixed?
>>>> This is a new crash.
>>> I see other manifestations resulting in different crashes :
>>>
>>> :mon> t
>>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
>>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
>>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
>>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
>>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
>>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
>>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
>>> 9:mon> e
>>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
>>> pc: c0000000005ac390: ._spin_lock+0x20/0xc8
>>> lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
>>> sp: c00000074603b9a0
>>> msr: 8000000000009032
>>> dar: 3a0
>>> dsisr: 40000000
>>> current = 0xc000000756ce8b00
>>> paca = 0xc000000000f63800
>>> pid = 18095, comm = ipoib
>>> 9:mon>
>> Recreating the crash has been tricky. I have tried several several hundred times today
>> to unload and reload IPoIB while there is traffic and no crashes happened. I took
>> a closer look at the IPoIB CM code and I see a few things that look suspicious.
>>
>> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before
>> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
>> Is this a potential race?
>
> ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
> by netif_tx_lock(dev) or stopping the ipoib network device.
I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called
without netif_tx_lock(dev) held. Is that correct?
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
[not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-07-16 17:43 ` Ralph Campbell
0 siblings, 0 replies; 15+ messages in thread
From: Ralph Campbell @ 2010-07-16 17:43 UTC (permalink / raw)
To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma
On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote:
> Ralph Campbell wrote:
> > On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> >> Pradeep Satyanarayana wrote:
> >>> Pradeep Satyanarayana wrote:
> >>>> Roland Dreier wrote:
> >>>>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that
> >>>>> > conclusion. Another set of tests caused the following crash:
> >>>>>
> >>>>> I don't really know how to interpret this. Is this crash new, or is it
> >>>>> the same crash you were hoping this patch fixed?
> >>>> This is a new crash.
> >>> I see other manifestations resulting in different crashes :
> >>>
> >>> :mon> t
> >>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> >>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib]
> >>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> >>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> >>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> >>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> >>> 9:mon> e
> >>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >>> pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >>> lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >>> sp: c00000074603b9a0
> >>> msr: 8000000000009032
> >>> dar: 3a0
> >>> dsisr: 40000000
> >>> current = 0xc000000756ce8b00
> >>> paca = 0xc000000000f63800
> >>> pid = 18095, comm = ipoib
> >>> 9:mon>
> >> Recreating the crash has been tricky. I have tried several several hundred times today
> >> to unload and reload IPoIB while there is traffic and no crashes happened. I took
> >> a closer look at the IPoIB CM code and I see a few things that look suspicious.
> >>
> >> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before
> >> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer).
> >> Is this a potential race?
> >
> > ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
> > by netif_tx_lock(dev) or stopping the ipoib network device.
>
> I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called
> without netif_tx_lock(dev) held. Is that correct?
>
> Thanks
> Pradeep
ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when
freeing a struct neighbour. I assume the Linux network stack is
not going to call into the IPoIB driver to send sk_buffs in that
case but I could be wrong. If it can, then you are correct that
the netif_tx_lock(dev) should be acquired.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-07-16 17:43 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-12 4:57 IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? Pradeep Satyanarayana
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:07 ` Bart Van Assche
[not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-12 10:20 ` Pradeep Satyanarayana
[not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:35 ` Bart Van Assche
[not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-13 19:23 ` Ralph Campbell
2010-07-12 21:21 ` Roland Dreier
[not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-13 6:07 ` Pradeep Satyanarayana
[not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-13 14:45 ` Roland Dreier
[not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-14 2:54 ` Pradeep Satyanarayana
[not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-14 16:25 ` Pradeep Satyanarayana
[not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 11:56 ` Pradeep Satyanarayana
[not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 19:59 ` Ralph Campbell
2010-07-16 1:29 ` Ralph Campbell
[not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-16 9:13 ` Pradeep Satyanarayana
[not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-16 17:43 ` Ralph Campbell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox