* IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
@ 2010-07-12 4:57 Pradeep Satyanarayana
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Pradeep Satyanarayana @ 2010-07-12 4:57 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Roland,
I realize that the following patch:
https://patchwork.kernel.org/patch/97243/
is queued in your backlog of patches and unlikely that it will go into 2.6.35.
What are the chances that it will make it into 2.6.36? This patch has fixed a
a rarely seen crash and we would like it to go upstream ASAP.
Thanks
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread[parent not found: <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-12 10:07 ` Bart Van Assche [not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-07-12 21:21 ` Roland Dreier 1 sibling, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2010-07-12 10:07 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > I realize that the following patch: > > https://patchwork.kernel.org/patch/97243/ > > is queued in your backlog of patches and unlikely that it will go into 2.6.35. > What are the chances that it will make it into 2.6.36? This patch has fixed a > a rarely seen crash and we would like it to go upstream ASAP. The following comment was made on that patch by Ralph Campbell (see also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html): "Quite right. I should also use list_for_each_entry_safe(). I will fix this." This makes me wonder whether version three of this patch can go in unmodified ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-12 10:20 ` Pradeep Satyanarayana [not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-12 10:20 UTC (permalink / raw) To: Bart Van Assche; +Cc: Roland Dreier, linux-rdma, Ralph Campbell Bart Van Assche wrote: > On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana > <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: >> I realize that the following patch: >> >> https://patchwork.kernel.org/patch/97243/ >> >> is queued in your backlog of patches and unlikely that it will go into 2.6.35. >> What are the chances that it will make it into 2.6.36? This patch has fixed a >> a rarely seen crash and we would like it to go upstream ASAP. > > The following comment was made on that patch by Ralph Campbell (see > also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html): > > "Quite right. I should also use list_for_each_entry_safe(). I will fix this." > > This makes me wonder whether version three of this patch can go in unmodified ? There was a version 4 that followed. That was what I was referring to. Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-12 10:35 ` Bart Van Assche [not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Bart Van Assche @ 2010-07-12 10:35 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma, Ralph Campbell On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > Bart Van Assche wrote: >> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana >> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: >>> I realize that the following patch: >>> >>> https://patchwork.kernel.org/patch/97243/ >>> >>> is queued in your backlog of patches and unlikely that it will go into 2.6.35. >>> What are the chances that it will make it into 2.6.36? This patch has fixed a >>> a rarely seen crash and we would like it to go upstream ASAP. >> >> The following comment was made on that patch by Ralph Campbell (see >> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html): >> >> "Quite right. I should also use list_for_each_entry_safe(). I will fix this." >> >> This makes me wonder whether version three of this patch can go in unmodified ? > > There was a version 4 that followed. That was what I was referring to. Thanks for the info -- I had missed version four of that patch. Now that I had a look at it, why does the comment above ipoib_cm_flush_path() say that it removes all entries while the loop inside that function is stopped after the first entry has been found and removed ? Why does that function use list_for_each_entry_safe() while only a single entry is removed ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-13 19:23 ` Ralph Campbell 0 siblings, 0 replies; 15+ messages in thread From: Ralph Campbell @ 2010-07-13 19:23 UTC (permalink / raw) To: Bart Van Assche; +Cc: Pradeep Satyanarayana, Roland Dreier, linux-rdma On Mon, 2010-07-12 at 03:35 -0700, Bart Van Assche wrote: > On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana > <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > Bart Van Assche wrote: > >> On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana > >> <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > >>> I realize that the following patch: > >>> > >>> https://patchwork.kernel.org/patch/97243/ > >>> > >>> is queued in your backlog of patches and unlikely that it will go into 2.6.35. > >>> What are the chances that it will make it into 2.6.36? This patch has fixed a > >>> a rarely seen crash and we would like it to go upstream ASAP. > >> > >> The following comment was made on that patch by Ralph Campbell (see > >> also http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg03125.html): > >> > >> "Quite right. I should also use list_for_each_entry_safe(). I will fix this." > >> > >> This makes me wonder whether version three of this patch can go in unmodified ? > > > > There was a version 4 that followed. That was what I was referring to. > > Thanks for the info -- I had missed version four of that patch. Now > that I had a look at it, why does the comment above > ipoib_cm_flush_path() say that it removes all entries while the loop > inside that function is stopped after the first entry has been found > and removed ? Why does that function use list_for_each_entry_safe() > while only a single entry is removed ? > > Bart. There is only one matching entry on the list at any given time so I guess list_for_each_entry_safe() isn't required. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2010-07-12 10:07 ` Bart Van Assche @ 2010-07-12 21:21 ` Roland Dreier [not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> 1 sibling, 1 reply; 15+ messages in thread From: Roland Dreier @ 2010-07-12 21:21 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: linux-rdma > I realize that the following patch: > > https://patchwork.kernel.org/patch/97243/ > > is queued in your backlog of patches and unlikely that it will go into 2.6.35. > What are the chances that it will make it into 2.6.36? This patch has fixed a > a rarely seen crash and we would like it to go upstream ASAP. How much testing and/or review have you done for this patch? What is the impact of the bug in the cases you have seen? - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> @ 2010-07-13 6:07 ` Pradeep Satyanarayana [not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-13 6:07 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma Roland Dreier wrote: > > I realize that the following patch: > > > > https://patchwork.kernel.org/patch/97243/ > > > > is queued in your backlog of patches and unlikely that it will go into 2.6.35. > > What are the chances that it will make it into 2.6.36? This patch has fixed a > > a rarely seen crash and we would like it to go upstream ASAP. > > How much testing and/or review have you done for this patch? What is > the impact of the bug in the cases you have seen? I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: 12:mon> t [link register ] d00000000ec950a8 .cm_destroy_id+0x3d8/0x520 [ib_cm] [c0000007544d2d40] d00000000ec95098 .cm_destroy_id+0x3c8/0x520 [ib_cm] (unreliable) [c0000007544d2e10] d00000000ef9b590 .ipoib_cm_free_rx_reap_list+0xc8/0x180 [ib_ipoib] [c0000007544d2ed0] d00000000ef9e1cc .ipoib_cm_dev_stop+0x23c/0x360 [ib_ipoib] [c0000007544d2f90] d00000000ef94d94 .ipoib_ib_dev_stop+0xe4/0x4b0 [ib_ipoib] [c0000007544d30f0] d00000000ef91d68 .ipoib_stop+0x88/0x178 [ib_ipoib] [c0000007544d3180] c0000000004ea1cc .dev_close+0xdc/0x148 [c0000007544d3210] c0000000004e9790 .dev_change_flags+0x1f0/0x288 [c0000007544d32b0] c0000000004f6a0c .do_setlink+0x2d4/0x498 [c0000007544d3390] c0000000004f8b20 .rtnl_newlink+0x520/0x5f0 [c0000007544d35a0] c0000000004f853c .rtnetlink_rcv_msg+0x24c/0x310 [c0000007544d3650] c000000000513a38 .netlink_rcv_skb+0xf0/0x128 [c0000007544d36e0] c0000000004f82cc .rtnetlink_rcv+0x34/0x58 [c0000007544d3770] c0000000005134c0 .netlink_unicast+0x498/0x4b0 [c0000007544d3840] c000000000514218 .netlink_sendmsg+0x288/0x390 [c0000007544d3920] c0000000004cf538 .sock_sendmsg+0x118/0x158 [c0000007544d3b30] c0000000004cf720 .SyS_sendmsg+0x1a8/0x348 [c0000007544d3d70] c0000000004cdbdc .SyS_socketcall+0x174/0x338 [c0000007544d3e30] c0000000000085b4 syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 00000fffa25c35d0 SP (fffc5a12fe0) is in userspace 12:mon> e cpu 0x12: Vector: 300 (Data Access) at [c0000007544d2ac0] pc: c0000000002ef854: .rb_erase+0x64/0x180 lr: d00000000ec950a8: .cm_destroy_id+0x3d8/0x520 [ib_cm] sp: c0000007544d2d40 msr: 8000000000001432 dar: 10000000100 dsisr: 40000000 current = 0xc0000006948dc240 paca = 0xc000000000f64a00 pid = 21593, comm = ip 12:mon> Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-13 14:45 ` Roland Dreier [not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Roland Dreier @ 2010-07-13 14:45 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: linux-rdma > I guess I came to a premature conclusion. One set of tests ran fine and I made that > conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? - R. -- Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org> @ 2010-07-14 2:54 ` Pradeep Satyanarayana [not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-14 2:54 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma Roland Dreier wrote: > > I guess I came to a premature conclusion. One set of tests ran fine and I made that > > conclusion. Another set of tests caused the following crash: > > I don't really know how to interpret this. Is this crash new, or is it > the same crash you were hoping this patch fixed? This is a new crash. Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-14 16:25 ` Pradeep Satyanarayana [not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-14 16:25 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma Pradeep Satyanarayana wrote: > Roland Dreier wrote: >> > I guess I came to a premature conclusion. One set of tests ran fine and I made that >> > conclusion. Another set of tests caused the following crash: >> >> I don't really know how to interpret this. Is this crash new, or is it >> the same crash you were hoping this patch fixed? > > This is a new crash. I see other manifestations resulting in different crashes : :mon> t [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 9:mon> e cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] pc: c0000000005ac390: ._spin_lock+0x20/0xc8 lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c00000074603b9a0 msr: 8000000000009032 dar: 3a0 dsisr: 40000000 current = 0xc000000756ce8b00 paca = 0xc000000000f63800 pid = 18095, comm = ipoib 9:mon> Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-15 11:56 ` Pradeep Satyanarayana [not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-15 11:56 UTC (permalink / raw) To: Ralph Campbell, Roland Dreier; +Cc: linux-rdma Pradeep Satyanarayana wrote: > Pradeep Satyanarayana wrote: >> Roland Dreier wrote: >>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that >>> > conclusion. Another set of tests caused the following crash: >>> >>> I don't really know how to interpret this. Is this crash new, or is it >>> the same crash you were hoping this patch fixed? >> This is a new crash. > > I see other manifestations resulting in different crashes : > > :mon> t > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 > 9:mon> e > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] > pc: c0000000005ac390: ._spin_lock+0x20/0xc8 > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > sp: c00000074603b9a0 > msr: 8000000000009032 > dar: 3a0 > dsisr: 40000000 > current = 0xc000000756ce8b00 > paca = 0xc000000000f63800 > pid = 18095, comm = ipoib > 9:mon> Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in select cases. Was that intended? Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-15 19:59 ` Ralph Campbell 2010-07-16 1:29 ` Ralph Campbell 1 sibling, 0 replies; 15+ messages in thread From: Ralph Campbell @ 2010-07-15 19:59 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma I will write up a description of the locking as I understand it and the changes I made. Give me a day or two to write it up and check it. On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: > Pradeep Satyanarayana wrote: > > Pradeep Satyanarayana wrote: > >> Roland Dreier wrote: > >>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that > >>> > conclusion. Another set of tests caused the following crash: > >>> > >>> I don't really know how to interpret this. Is this crash new, or is it > >>> the same crash you were hoping this patch fixed? > >> This is a new crash. > > > > I see other manifestations resulting in different crashes : > > > > :mon> t > > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] > > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] > > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 > > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 > > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 > > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 > > 9:mon> e > > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] > > pc: c0000000005ac390: ._spin_lock+0x20/0xc8 > > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > sp: c00000074603b9a0 > > msr: 8000000000009032 > > dar: 3a0 > > dsisr: 40000000 > > current = 0xc000000756ce8b00 > > paca = 0xc000000000f63800 > > pid = 18095, comm = ipoib > > 9:mon> > > Recreating the crash has been tricky. I have tried several several hundred times today > to unload and reload IPoIB while there is traffic and no crashes happened. I took > a closer look at the IPoIB CM code and I see a few things that look suspicious. > > In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before > calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). > Is this a potential race? > > In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in > ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a > clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in > select cases. Was that intended? > > Thanks > Pradeep > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2010-07-15 19:59 ` Ralph Campbell @ 2010-07-16 1:29 ` Ralph Campbell [not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org> 1 sibling, 1 reply; 15+ messages in thread From: Ralph Campbell @ 2010-07-16 1:29 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: > Pradeep Satyanarayana wrote: > > Pradeep Satyanarayana wrote: > >> Roland Dreier wrote: > >>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that > >>> > conclusion. Another set of tests caused the following crash: > >>> > >>> I don't really know how to interpret this. Is this crash new, or is it > >>> the same crash you were hoping this patch fixed? > >> This is a new crash. > > > > I see other manifestations resulting in different crashes : > > > > :mon> t > > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] > > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] > > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 > > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 > > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 > > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 > > 9:mon> e > > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] > > pc: c0000000005ac390: ._spin_lock+0x20/0xc8 > > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > sp: c00000074603b9a0 > > msr: 8000000000009032 > > dar: 3a0 > > dsisr: 40000000 > > current = 0xc000000756ce8b00 > > paca = 0xc000000000f63800 > > pid = 18095, comm = ipoib > > 9:mon> > > Recreating the crash has been tricky. I have tried several several hundred times today > to unload and reload IPoIB while there is traffic and no crashes happened. I took > a closer look at the IPoIB CM code and I see a few things that look suspicious. > > In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before > calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). > Is this a potential race? ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected by netif_tx_lock(dev) or stopping the ipoib network device. It all depends on what pointer or data structure you think is being accessed while free or being modified without the proper protection. > In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags) in > ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a > clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling ipoib_cm_destroy_tx() only in > select cases. Was that intended? The v4 patch comments explain the changes: http://www.spinics.net/lists/linux-rdma/msg03733.html Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx has completed the RC QP creation process via the CM instead of simply when ipoib_cm_create_tx() allocates the structure. The test and clear was used to indicate the struct ipoib_cm_tx had been put on the destroy list and the reaper thread woken up. Now ipoib_cm_destroy_tx() uses the tx->neigh pointer != NULL to indicate that ipoib_cm_destroy_tx() has started the destroy process. ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv->lock are held to protect tx->neigh. > Thanks > Pradeep The longer write up on locking is turning out to be very complex. I will keep working on it but I think it will be just as hard to understand as slogging through the code. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org> @ 2010-07-16 9:13 ` Pradeep Satyanarayana [not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Pradeep Satyanarayana @ 2010-07-16 9:13 UTC (permalink / raw) To: Ralph Campbell; +Cc: Roland Dreier, linux-rdma Ralph Campbell wrote: > On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: >> Pradeep Satyanarayana wrote: >>> Pradeep Satyanarayana wrote: >>>> Roland Dreier wrote: >>>>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that >>>>> > conclusion. Another set of tests caused the following crash: >>>>> >>>>> I don't really know how to interpret this. Is this crash new, or is it >>>>> the same crash you were hoping this patch fixed? >>>> This is a new crash. >>> I see other manifestations resulting in different crashes : >>> >>> :mon> t >>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] >>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] >>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] >>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 >>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 >>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 >>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 >>> 9:mon> e >>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] >>> pc: c0000000005ac390: ._spin_lock+0x20/0xc8 >>> lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] >>> sp: c00000074603b9a0 >>> msr: 8000000000009032 >>> dar: 3a0 >>> dsisr: 40000000 >>> current = 0xc000000756ce8b00 >>> paca = 0xc000000000f63800 >>> pid = 18095, comm = ipoib >>> 9:mon> >> Recreating the crash has been tricky. I have tried several several hundred times today >> to unload and reload IPoIB while there is traffic and no crashes happened. I took >> a closer look at the IPoIB CM code and I see a few things that look suspicious. >> >> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before >> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). >> Is this a potential race? > > ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected > by netif_tx_lock(dev) or stopping the ipoib network device. I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called without netif_tx_lock(dev) held. Is that correct? Thanks Pradeep -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? [not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-16 17:43 ` Ralph Campbell 0 siblings, 0 replies; 15+ messages in thread From: Ralph Campbell @ 2010-07-16 17:43 UTC (permalink / raw) To: Pradeep Satyanarayana; +Cc: Roland Dreier, linux-rdma On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote: > Ralph Campbell wrote: > > On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: > >> Pradeep Satyanarayana wrote: > >>> Pradeep Satyanarayana wrote: > >>>> Roland Dreier wrote: > >>>>> > I guess I came to a premature conclusion. One set of tests ran fine and I made that > >>>>> > conclusion. Another set of tests caused the following crash: > >>>>> > >>>>> I don't really know how to interpret this. Is this crash new, or is it > >>>>> the same crash you were hoping this patch fixed? > >>>> This is a new crash. > >>> I see other manifestations resulting in different crashes : > >>> > >>> :mon> t > >>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > >>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] > >>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] > >>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 > >>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 > >>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 > >>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 > >>> 9:mon> e > >>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] > >>> pc: c0000000005ac390: ._spin_lock+0x20/0xc8 > >>> lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > >>> sp: c00000074603b9a0 > >>> msr: 8000000000009032 > >>> dar: 3a0 > >>> dsisr: 40000000 > >>> current = 0xc000000756ce8b00 > >>> paca = 0xc000000000f63800 > >>> pid = 18095, comm = ipoib > >>> 9:mon> > >> Recreating the crash has been tricky. I have tried several several hundred times today > >> to unload and reload IPoIB while there is traffic and no crashes happened. I took > >> a closer look at the IPoIB CM code and I see a few things that look suspicious. > >> > >> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is held before > >> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). > >> Is this a potential race? > > > > ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected > > by netif_tx_lock(dev) or stopping the ipoib network device. > > I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called > without netif_tx_lock(dev) held. Is that correct? > > Thanks > Pradeep ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when freeing a struct neighbour. I assume the Linux network stack is not going to call into the IPoIB driver to send sk_buffs in that case but I could be wrong. If it can, then you are correct that the netif_tx_lock(dev) should be acquired. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-07-16 17:43 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-12 4:57 IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream? Pradeep Satyanarayana
[not found] ` <4C3AA0A2.3090406-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:07 ` Bart Van Assche
[not found] ` <AANLkTilkXY0vDA4dtq9cMIO2shX-kw0ZlwadVjm4QMui-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-12 10:20 ` Pradeep Satyanarayana
[not found] ` <4C3AEC65.1090304-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-12 10:35 ` Bart Van Assche
[not found] ` <AANLkTil7Vbpg4n_TsR71F1mK5Sq6kzI0blgVjJ7_lebB-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-13 19:23 ` Ralph Campbell
2010-07-12 21:21 ` Roland Dreier
[not found] ` <adapqysxup6.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-13 6:07 ` Pradeep Satyanarayana
[not found] ` <4C3C02B2.9040408-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-13 14:45 ` Roland Dreier
[not found] ` <adaeif7xwwz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-07-14 2:54 ` Pradeep Satyanarayana
[not found] ` <4C3D26D0.3090508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-14 16:25 ` Pradeep Satyanarayana
[not found] ` <4C3DE512.3020903-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 11:56 ` Pradeep Satyanarayana
[not found] ` <4C3EF754.4060502-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-15 19:59 ` Ralph Campbell
2010-07-16 1:29 ` Ralph Campbell
[not found] ` <1279243768.31421.48.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-16 9:13 ` Pradeep Satyanarayana
[not found] ` <4C4022BC.3030401-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-16 17:43 ` Ralph Campbell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox