Mlx4: BUG: unable to handle kernel at ffffffffa02be210

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* Mlx4: BUG: unable to handle kernel at ffffffffa02be210
@ 2015-07-08  9:42 Jack Wang
       [not found] ` <CAD+HZHXi2bB59eWLYaGiXj5-b5w3V1NhwUJbSjx5NfdmhEaRhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Wang @ 2015-07-08  9:42 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Or Gerlitz,
	Jack Morgenstein, Moni Shoua

Hello Or, Jack and Moni,

We hit bug below spontaneously, our test trigger this bug around 1 in 5 times.
We're using MLX OFED 2.4-1.0.4 together on top of 3.18.14.

HCA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0

Could you offer some insight, could this be a old bug already fixed,
if so, could you point me the link, I can port to our kernel. thanks.

[  657.723842] BUG: unable to handle kernel  at ffffffffa02be210
[  657.724245] IP: [<ffffffffa02be210>] 0xffffffffa02be210
[  657.724539] PGD 1c15067
[  657.725162] Oops: 0010 [#1]
[  657.725657] Modules linked in: ib_ipoib ib_uverbs ib_umad mlx4_ib
rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr ipv6 null_blk loop
amd64_edac_mod k10temp fam15h_power edac
_core button microcode hid_generic usbhid hid igb hwmon i2c_algo_bit
i2c_core dca ahci ptp libahci ohci_pci pps_core mlx4_core ohci_hcd
libata [last unloaded: ibtrs_server]
[  657.731897] CPU: 0 PID: 337 Comm: kworker/u128:1 Tainted: G
  O   3.18.14-1-ibnbd-debug #1
[  657.732049] Hardware name: Supermicro BHQGE/BHQGE, BIOS 3.00       10/24/2012
[  657.732199] Workqueue: ib_mad1 ib_mad_complete_send_wr [ib_mad]
[  657.732464] task: ffff880415bea1f0 ti: ffff880415420000 task.ti:
ffff880415420000
[  657.732610] RIP: 0010:[<ffffffffa02be210>]  [<ffffffffa02be210>]
0xffffffffa02be210
[  657.732959] RSP: 0018:ffff880417c03d00  EFLAGS: 00010006
[  657.733193] RAX: ffff8803bc5fc4d8 RBX: ffff8803bc5fc4d8 RCX: 0000000000000000
[  657.733416] RDX: ffff880415bea9e0 RSI: ffff8803d8dcd388 RDI: ffff8803bc5fc4a8
[  657.736094] RBP: ffff880417c03d08 R08: 0000000000000000 R09: ffff880415bea9b8
[  657.736317] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800d3b00000
[  657.736543] R13: 00000000000000c5 R14: 0000000000000000 R15: 0000000000000020
[  657.736800] FS:  00007f2f05b5f700(0000) GS:ffff880417c00000(0000)
knlGS:0000000000000000
[  657.737109] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  657.737330] CR2: ffffffffa02be210 CR3: 000000180d76f000 CR4: 00000000000407f0
[  657.737555] Stack:
[  657.737758]  ffffffffa01d84b7 ffff880417c03d48 ffffffffa004a486
ffffffffa004a3f5
[  657.738546]  ffffffff81c194e0 0000000000000000 00000000c5000000
ffff8804136001c0
[  657.739360]  ffff8800d3b00000 ffff880417c03e18 ffffffffa004c0ea
0000000000000002
[  657.740149] Call Trace:
[  657.740385]  <IRQ>
[  657.740514]  [<ffffffffa01d84b7>] ? mlx4_ib_destroy_ah+0x37/0x360 [mlx4_ib]
[  657.741093]  [<ffffffffa004a486>] mlx4_cq_completion+0x96/0xe0 [mlx4_core]
[  657.741330]  [<ffffffffa004a3f5>] ? mlx4_cq_completion+0x5/0xe0 [mlx4_core]
[  657.741594]  [<ffffffffa004c0ea>] mlx4_test_interrupts+0x84a/0x1100
[mlx4_core]
[  657.741908]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
[  657.742142]  [<ffffffffa004c904>]
mlx4_test_interrupts+0x1064/0x1100 [mlx4_core]
[  657.742457]  [<ffffffff810aa678>] handle_irq_event_percpu+0x78/0x2b0
[  657.742685]  [<ffffffff810aa8f8>] handle_irq_event+0x48/0x70
[  657.742934]  [<ffffffff810adf58>] handle_edge_irq+0xc8/0x160
[  657.743160]  [<ffffffff8100515e>] handle_irq+0x14e/0x200
[  657.743384]  [<ffffffff815fea3e>] do_IRQ+0x5e/0x110
[  657.743603]  [<ffffffff815fcf6a>] common_interrupt+0x6a/0x6a
[  657.743826]  <EOI>
[  657.743957]  [<ffffffff81197295>] ? __slab_alloc+0x615/0x710
[  657.744513]  [<ffffffffa01d80de>] ? mlx4_ib_create_ah+0x2e/0x2a0 [mlx4_ib]
[  657.744738]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
[  657.744968]  [<ffffffff81198f12>] __kmalloc+0x162/0x2e0
[  657.745191]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
[  657.745420]  [<ffffffffa01d8100>] ? mlx4_ib_create_ah+0x50/0x2a0 [mlx4_ib]
[  657.745650]  [<ffffffffa0195603>] ib_create_send_mad+0xf3/0x330 [ib_mad]
[  657.745875]  [<ffffffffa019985b>] agent_send_response+0xbb/0x270 [ib_mad]
[  657.746103]  [<ffffffffa0198bf4>] ?
ib_mad_complete_send_wr+0x844/0xfa0 [ib_mad]
[  657.746413]  [<ffffffffa0198f96>]
ib_mad_complete_send_wr+0xbe6/0xfa0 [ib_mad]
[  657.746729]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
[  657.746959]  [<ffffffff8106c82d>] process_one_work+0x33d/0x6d0
[  657.747181]  [<ffffffff8106c7a4>] ? process_one_work+0x2b4/0x6d0
[  657.747434]  [<ffffffff8106d015>] worker_thread+0x55/0x6d0
[  657.751224]  [<ffffffff8106cfc0>] ? rescuer_thread+0x3c0/0x3c0
[  657.751482]  [<ffffffff81073e84>] kthread+0xe4/0x100
[  657.751705]  [<ffffffff810792b4>] ? finish_task_switch+0x84/0x140
[  657.751935]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
[  657.752165]  [<ffffffff815fc3c8>] ret_from_fork+0x58/0x90
[  657.752391]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
[  657.752640] Code:  Bad RIP value.
[  657.753095] RIP  [<ffffffffa02be210>] 0xffffffffa02be210
[  657.753434]  RSP <ffff880417c03d00>
[  657.753645] CR2: ffffffffa02be210
[  657.753878] ---[ end trace 9c9225f5e490f806 ]---
[  657.765754] Kernel panic - not syncing: Fatal exception in interrupt
[  657.766089] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffff9fffffff)
[  657.778084] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Best regards,
Jack Wang
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found] ` <CAD+HZHXi2bB59eWLYaGiXj5-b5w3V1NhwUJbSjx5NfdmhEaRhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08 12:19   ` Or Gerlitz
       [not found]     ` <559D1562.2070309-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-07-08 12:19 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

On 7/8/2015 12:42 PM, Jack Wang wrote:

> We're using MLX OFED 2.4-1.0.4 together on top of 3.18.14.

So this list is for upstream things.. still, let's see


> We hit bug below spontaneously, our test trigger this bug around 1 in 5 times.

and what is your test if I may ask?!


> HCA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.9.1000
> Hardware version: b0
>
> Could you offer some insight, could this be a old bug already fixed,
> if so, could you point me the link, I can port to our kernel. thanks.
>
> [  657.723842] BUG: unable to handle kernel  at ffffffffa02be210
> [  657.724245] IP: [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.724539] PGD 1c15067
> [  657.725162] Oops: 0010 [#1]
> [  657.725657] Modules linked in: ib_ipoib ib_uverbs ib_umad mlx4_ib
> rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr ipv6 null_blk loop
> amd64_edac_mod k10temp fam15h_power edac
> _core button microcode hid_generic usbhid hid igb hwmon i2c_algo_bit
> i2c_core dca ahci ptp libahci ohci_pci pps_core mlx4_core ohci_hcd
> libata [last unloaded: ibtrs_server]
> [  657.731897] CPU: 0 PID: 337 Comm: kworker/u128:1 Tainted: G
>    O   3.18.14-1-ibnbd-debug #1
> [  657.732049] Hardware name: Supermicro BHQGE/BHQGE, BIOS 3.00       10/24/2012
> [  657.732199] Workqueue: ib_mad1 ib_mad_complete_send_wr [ib_mad]
> [  657.732464] task: ffff880415bea1f0 ti: ffff880415420000 task.ti: ffff880415420000
> [  657.732610] RIP: 0010:[<ffffffffa02be210>]  [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.732959] RSP: 0018:ffff880417c03d00  EFLAGS: 00010006
> [  657.733193] RAX: ffff8803bc5fc4d8 RBX: ffff8803bc5fc4d8 RCX: 0000000000000000
> [  657.733416] RDX: ffff880415bea9e0 RSI: ffff8803d8dcd388 RDI: ffff8803bc5fc4a8
> [  657.736094] RBP: ffff880417c03d08 R08: 0000000000000000 R09: ffff880415bea9b8
> [  657.736317] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800d3b00000
> [  657.736543] R13: 00000000000000c5 R14: 0000000000000000 R15: 0000000000000020
> [  657.736800] FS:  00007f2f05b5f700(0000) GS:ffff880417c00000(0000) knlGS:0000000000000000
> [  657.737109] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  657.737330] CR2: ffffffffa02be210 CR3: 000000180d76f000 CR4: 00000000000407f0
> [  657.737555] Stack:
> [  657.737758]  ffffffffa01d84b7 ffff880417c03d48 ffffffffa004a486 ffffffffa004a3f5
> [  657.738546]  ffffffff81c194e0 0000000000000000 00000000c5000000 ffff8804136001c0
> [  657.739360]  ffff8800d3b00000 ffff880417c03e18 ffffffffa004c0ea 0000000000000002
> [  657.740149] Call Trace:
> [  657.740385]  <IRQ>
> [  657.740514]  [<ffffffffa01d84b7>] ? mlx4_ib_destroy_ah+0x37/0x360 [mlx4_ib]
> [  657.741093]  [<ffffffffa004a486>] mlx4_cq_completion+0x96/0xe0 [mlx4_core]
> [  657.741330]  [<ffffffffa004a3f5>] ? mlx4_cq_completion+0x5/0xe0 [mlx4_core]
> [  657.741594]  [<ffffffffa004c0ea>] mlx4_test_interrupts+0x84a/0x1100
> [mlx4_core]

mlx4_test_interrupts is called from the mlx4_en ethtool selftest handler, so you are
calling it while X (what?) is done in parallel?




> [  657.741908]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
> [  657.742142]  [<ffffffffa004c904>]
> mlx4_test_interrupts+0x1064/0x1100 [mlx4_core]
> [  657.742457]  [<ffffffff810aa678>] handle_irq_event_percpu+0x78/0x2b0
> [  657.742685]  [<ffffffff810aa8f8>] handle_irq_event+0x48/0x70
> [  657.742934]  [<ffffffff810adf58>] handle_edge_irq+0xc8/0x160
> [  657.743160]  [<ffffffff8100515e>] handle_irq+0x14e/0x200
> [  657.743384]  [<ffffffff815fea3e>] do_IRQ+0x5e/0x110
> [  657.743603]  [<ffffffff815fcf6a>] common_interrupt+0x6a/0x6a
> [  657.743826]  <EOI>
> [  657.743957]  [<ffffffff81197295>] ? __slab_alloc+0x615/0x710
> [  657.744513]  [<ffffffffa01d80de>] ? mlx4_ib_create_ah+0x2e/0x2a0 [mlx4_ib]
> [  657.744738]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.744968]  [<ffffffff81198f12>] __kmalloc+0x162/0x2e0
> [  657.745191]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.745420]  [<ffffffffa01d8100>] ? mlx4_ib_create_ah+0x50/0x2a0 [mlx4_ib]
> [  657.745650]  [<ffffffffa0195603>] ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.745875]  [<ffffffffa019985b>] agent_send_response+0xbb/0x270 [ib_mad]
> [  657.746103]  [<ffffffffa0198bf4>] ?
> ib_mad_complete_send_wr+0x844/0xfa0 [ib_mad]
> [  657.746413]  [<ffffffffa0198f96>]
> ib_mad_complete_send_wr+0xbe6/0xfa0 [ib_mad]
> [  657.746729]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
> [  657.746959]  [<ffffffff8106c82d>] process_one_work+0x33d/0x6d0
> [  657.747181]  [<ffffffff8106c7a4>] ? process_one_work+0x2b4/0x6d0
> [  657.747434]  [<ffffffff8106d015>] worker_thread+0x55/0x6d0
> [  657.751224]  [<ffffffff8106cfc0>] ? rescuer_thread+0x3c0/0x3c0
> [  657.751482]  [<ffffffff81073e84>] kthread+0xe4/0x100
> [  657.751705]  [<ffffffff810792b4>] ? finish_task_switch+0x84/0x140
> [  657.751935]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
> [  657.752165]  [<ffffffff815fc3c8>] ret_from_fork+0x58/0x90
> [  657.752391]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
> [  657.752640] Code:  Bad RIP value.
> [  657.753095] RIP  [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.753434]  RSP <ffff880417c03d00>
> [  657.753645] CR2: ffffffffa02be210
> [  657.753878] ---[ end trace 9c9225f5e490f806 ]---
> [  657.765754] Kernel panic - not syncing: Fatal exception in interrupt
> [  657.766089] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
> range: 0xffffffff80000000-0xffffffff9fffffff)
> [  657.778084] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>
> Best regards,
> Jack Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]     ` <559D1562.2070309-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-07-08 12:47       ` Jack Wang
       [not found]         ` <CAD+HZHVCMa97zEQ1SB=JXCKHOGgSO93BPyLha2PDrOOTsHTUCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Wang @ 2015-07-08 12:47 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

Hi Or,

We're testing our rdma kernel module, the tests is load module, create
RDMA connection, do some traffic, and unload module.
No mlx4_en involved, in fact we disable mlx4_en in kernel build,
because we don't need that.

I did some debug with gdb:

(gdb)list *mlx4_test_interrupts+0x84a
0xb0ea is in mlx4_eq_int (drivers/net/ethernet/mellanox/mlx4/eq.c:517).
512 in drivers/net/ethernet/mellanox/mlx4/eq.c
 513                 switch (eqe->type) {
 514                 case MLX4_EVENT_TYPE_COMP:
 515                         cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
 516                         mlx4_cq_completion(dev, cqn);
 517                         break;

(gdb) list *mlx4_cq_completion+0x96
0x9486 is in mlx4_cq_completion (drivers/net/ethernet/mellanox/mlx4/cq.c:117).

(gdb) list *mlx4_ib_destroy_ah+0x37
0x4e7 is in mlx4_ib_cq_comp (drivers/infiniband/hw/mlx4/cq.c:50).

static void mlx4_ib_cq_comp(struct mlx4_cq *cq)
47 {
48 struct ib_cq *ibcq = &to_mibcq(cq)->ibcq;
49 ibcq->comp_handler(ibcq, ibcq->cq_context);
50 }

Looks like cq use-after-free? I have no idea where.

Regards
Jack

2015-07-08 14:19 GMT+02:00 Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
> On 7/8/2015 12:42 PM, Jack Wang wrote:
>
>> We're using MLX OFED 2.4-1.0.4 together on top of 3.18.14.
>
>
> So this list is for upstream things.. still, let's see
>
>
>> We hit bug below spontaneously, our test trigger this bug around 1 in 5
>> times.
>
>
> and what is your test if I may ask?!
>
>
>
>> HCA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 2
>> Firmware version: 2.9.1000
>> Hardware version: b0
>>
>> Could you offer some insight, could this be a old bug already fixed,
>> if so, could you point me the link, I can port to our kernel. thanks.
>>
>> [  657.723842] BUG: unable to handle kernel  at ffffffffa02be210
>> [  657.724245] IP: [<ffffffffa02be210>] 0xffffffffa02be210
>> [  657.724539] PGD 1c15067
>> [  657.725162] Oops: 0010 [#1]
>> [  657.725657] Modules linked in: ib_ipoib ib_uverbs ib_umad mlx4_ib
>> rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr ipv6 null_blk loop
>> amd64_edac_mod k10temp fam15h_power edac
>> _core button microcode hid_generic usbhid hid igb hwmon i2c_algo_bit
>> i2c_core dca ahci ptp libahci ohci_pci pps_core mlx4_core ohci_hcd
>> libata [last unloaded: ibtrs_server]
>> [  657.731897] CPU: 0 PID: 337 Comm: kworker/u128:1 Tainted: G
>>    O   3.18.14-1-ibnbd-debug #1
>> [  657.732049] Hardware name: Supermicro BHQGE/BHQGE, BIOS 3.00
>> 10/24/2012
>> [  657.732199] Workqueue: ib_mad1 ib_mad_complete_send_wr [ib_mad]
>> [  657.732464] task: ffff880415bea1f0 ti: ffff880415420000 task.ti:
>> ffff880415420000
>> [  657.732610] RIP: 0010:[<ffffffffa02be210>]  [<ffffffffa02be210>]
>> 0xffffffffa02be210
>> [  657.732959] RSP: 0018:ffff880417c03d00  EFLAGS: 00010006
>> [  657.733193] RAX: ffff8803bc5fc4d8 RBX: ffff8803bc5fc4d8 RCX:
>> 0000000000000000
>> [  657.733416] RDX: ffff880415bea9e0 RSI: ffff8803d8dcd388 RDI:
>> ffff8803bc5fc4a8
>> [  657.736094] RBP: ffff880417c03d08 R08: 0000000000000000 R09:
>> ffff880415bea9b8
>> [  657.736317] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffff8800d3b00000
>> [  657.736543] R13: 00000000000000c5 R14: 0000000000000000 R15:
>> 0000000000000020
>> [  657.736800] FS:  00007f2f05b5f700(0000) GS:ffff880417c00000(0000)
>> knlGS:0000000000000000
>> [  657.737109] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  657.737330] CR2: ffffffffa02be210 CR3: 000000180d76f000 CR4:
>> 00000000000407f0
>> [  657.737555] Stack:
>> [  657.737758]  ffffffffa01d84b7 ffff880417c03d48 ffffffffa004a486
>> ffffffffa004a3f5
>> [  657.738546]  ffffffff81c194e0 0000000000000000 00000000c5000000
>> ffff8804136001c0
>> [  657.739360]  ffff8800d3b00000 ffff880417c03e18 ffffffffa004c0ea
>> 0000000000000002
>> [  657.740149] Call Trace:
>> [  657.740385]  <IRQ>
>> [  657.740514]  [<ffffffffa01d84b7>] ? mlx4_ib_destroy_ah+0x37/0x360
>> [mlx4_ib]
>> [  657.741093]  [<ffffffffa004a486>] mlx4_cq_completion+0x96/0xe0
>> [mlx4_core]
>> [  657.741330]  [<ffffffffa004a3f5>] ? mlx4_cq_completion+0x5/0xe0
>> [mlx4_core]
>> [  657.741594]  [<ffffffffa004c0ea>] mlx4_test_interrupts+0x84a/0x1100
>> [mlx4_core]
>
>
> mlx4_test_interrupts is called from the mlx4_en ethtool selftest handler, so
> you are
> calling it while X (what?) is done in parallel?
>
>
>
>
>> [  657.741908]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
>>
>> [  657.742142]  [<ffffffffa004c904>]
>> mlx4_test_interrupts+0x1064/0x1100 [mlx4_core]
>> [  657.742457]  [<ffffffff810aa678>] handle_irq_event_percpu+0x78/0x2b0
>> [  657.742685]  [<ffffffff810aa8f8>] handle_irq_event+0x48/0x70
>> [  657.742934]  [<ffffffff810adf58>] handle_edge_irq+0xc8/0x160
>> [  657.743160]  [<ffffffff8100515e>] handle_irq+0x14e/0x200
>> [  657.743384]  [<ffffffff815fea3e>] do_IRQ+0x5e/0x110
>> [  657.743603]  [<ffffffff815fcf6a>] common_interrupt+0x6a/0x6a
>> [  657.743826]  <EOI>
>> [  657.743957]  [<ffffffff81197295>] ? __slab_alloc+0x615/0x710
>> [  657.744513]  [<ffffffffa01d80de>] ? mlx4_ib_create_ah+0x2e/0x2a0
>> [mlx4_ib]
>> [  657.744738]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330
>> [ib_mad]
>> [  657.744968]  [<ffffffff81198f12>] __kmalloc+0x162/0x2e0
>> [  657.745191]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330
>> [ib_mad]
>> [  657.745420]  [<ffffffffa01d8100>] ? mlx4_ib_create_ah+0x50/0x2a0
>> [mlx4_ib]
>> [  657.745650]  [<ffffffffa0195603>] ib_create_send_mad+0xf3/0x330
>> [ib_mad]
>> [  657.745875]  [<ffffffffa019985b>] agent_send_response+0xbb/0x270
>> [ib_mad]
>> [  657.746103]  [<ffffffffa0198bf4>] ?
>> ib_mad_complete_send_wr+0x844/0xfa0 [ib_mad]
>> [  657.746413]  [<ffffffffa0198f96>]
>> ib_mad_complete_send_wr+0xbe6/0xfa0 [ib_mad]
>> [  657.746729]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
>> [  657.746959]  [<ffffffff8106c82d>] process_one_work+0x33d/0x6d0
>> [  657.747181]  [<ffffffff8106c7a4>] ? process_one_work+0x2b4/0x6d0
>> [  657.747434]  [<ffffffff8106d015>] worker_thread+0x55/0x6d0
>> [  657.751224]  [<ffffffff8106cfc0>] ? rescuer_thread+0x3c0/0x3c0
>> [  657.751482]  [<ffffffff81073e84>] kthread+0xe4/0x100
>> [  657.751705]  [<ffffffff810792b4>] ? finish_task_switch+0x84/0x140
>> [  657.751935]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
>> [  657.752165]  [<ffffffff815fc3c8>] ret_from_fork+0x58/0x90
>> [  657.752391]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
>> [  657.752640] Code:  Bad RIP value.
>> [  657.753095] RIP  [<ffffffffa02be210>] 0xffffffffa02be210
>> [  657.753434]  RSP <ffff880417c03d00>
>> [  657.753645] CR2: ffffffffa02be210
>> [  657.753878] ---[ end trace 9c9225f5e490f806 ]---
>> [  657.765754] Kernel panic - not syncing: Fatal exception in interrupt
>> [  657.766089] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
>> range: 0xffffffff80000000-0xffffffff9fffffff)
>> [  657.778084] ---[ end Kernel panic - not syncing: Fatal exception in
>> interrupt
>>
>> Best regards,
>> Jack Wang
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]         ` <CAD+HZHVCMa97zEQ1SB=JXCKHOGgSO93BPyLha2PDrOOTsHTUCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08 13:49           ` Or Gerlitz
       [not found]             ` <559D2A80.4040909-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-07-08 13:49 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

On 7/8/2015 3:47 PM, Jack Wang wrote:
> static void mlx4_ib_cq_comp(struct mlx4_cq *cq)
> 47 {
> 48 struct ib_cq *ibcq = &to_mibcq(cq)->ibcq;
> 49 ibcq->comp_handler(ibcq, ibcq->cq_context);
> 50 }
>
> Looks like cq use-after-free? I have no idea where.

see if you have in the code base you're using (why not the stock 3.18.14 
driver, BTW?) all the synchronize_irq
calls we have in the latest upstream driver:

drivers/net/ethernet/mellanox/mlx4/cq.c:371: 
synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq);
drivers/net/ethernet/mellanox/mlx4/cq.c:374: 
synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
drivers/net/ethernet/mellanox/mlx4/eq.c:1088: synchronize_irq(eq->irq);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]             ` <559D2A80.4040909-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-07-08 14:07               ` Jack Wang
       [not found]                 ` <CAD+HZHWBn-KZCsskSGPKLtntj-LjDRodda9jngr+qcKSxLhkGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Wang @ 2015-07-08 14:07 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

Thanks for your time.

Looks the last one is missing in OFED 2.4 driver, I just checked the
history of mainline

commit bf1bac5b7882daa41249f85fbc97828f0597de5c
Author: Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Date:   Thu Oct 23 15:57:27 2014 +0300

    net/mlx4_core: Call synchronize_irq() before freeing EQ buffer

    After moving the EQ ownership to software effectively destroying it, call
    synchronize_irq() to ensure that any handler routines running on other CPU
    cores finish execution. Only then free the EQ buffer.
    The same thing is done when we destroy a CQ which is one of the sources
    generating interrupts. In the case of CQ we want to avoid
completion handlers
    on a CQ that was destroyed. In the case we do the same to avoid receiving
    asynchronous events after the EQ has been destroyed and its buffers freed.

    Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
    Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

This fix looks fit the bug we're hitting. Yes, we plan to update 3.0
OFED recently, and the fix is included there.
Will report if the bug is still there

Thanks again.
Jack

2015-07-08 15:49 GMT+02:00 Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
> On 7/8/2015 3:47 PM, Jack Wang wrote:
>>
>> static void mlx4_ib_cq_comp(struct mlx4_cq *cq)
>> 47 {
>> 48 struct ib_cq *ibcq = &to_mibcq(cq)->ibcq;
>> 49 ibcq->comp_handler(ibcq, ibcq->cq_context);
>> 50 }
>>
>> Looks like cq use-after-free? I have no idea where.
>
>
> see if you have in the code base you're using (why not the stock 3.18.14
> driver, BTW?) all the synchronize_irq
> calls we have in the latest upstream driver:
>
> drivers/net/ethernet/mellanox/mlx4/cq.c:371:
> synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq);
> drivers/net/ethernet/mellanox/mlx4/cq.c:374:
> synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
> drivers/net/ethernet/mellanox/mlx4/eq.c:1088: synchronize_irq(eq->irq);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]                 ` <CAD+HZHWBn-KZCsskSGPKLtntj-LjDRodda9jngr+qcKSxLhkGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08 18:22                   ` Or Gerlitz
  2015-07-09 11:14                   ` Jack Wang
  1 sibling, 0 replies; 10+ messages in thread
From: Or Gerlitz @ 2015-07-08 18:22 UTC (permalink / raw)
  To: Jack Wang
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

On Wed, Jul 8, 2015 at 5:07 PM, Jack Wang <xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Looks the last one is missing in OFED 2.4 driver, I just checked the
> history of mainline
>
> commit bf1bac5b7882daa41249f85fbc97828f0597de5c
> Author: Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> Date:   Thu Oct 23 15:57:27 2014 +0300
>
>   net/mlx4_core: Call synchronize_irq() before freeing EQ buffer

[...]

> This fix looks fit the bug we're hitting. Yes, we plan to update 3.0
> OFED recently, and the fix is included there.
> Will report if the bug is still there

Again... could you comment why aren't you using the stock 3.18.14 mlx4
driver? what feature is missing there for your needs?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]                 ` <CAD+HZHWBn-KZCsskSGPKLtntj-LjDRodda9jngr+qcKSxLhkGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-07-08 18:22                   ` Or Gerlitz
@ 2015-07-09 11:14                   ` Jack Wang
       [not found]                     ` <CAD+HZHXcrejwu=dAhmL7vZ=tkAPswm2LiCgwK42kEe5XDvBvhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Jack Wang @ 2015-07-09 11:14 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

Hi Or,

I managed to update the kernel to OFED 3.0 to verify the bug, but I
can still produce the bug, maybe there're still some synchronice_irq
is missing?

Thanks
Jack

2015-07-08 16:07 GMT+02:00 Jack Wang <xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Thanks for your time.
>
> Looks the last one is missing in OFED 2.4 driver, I just checked the
> history of mainline
>
> commit bf1bac5b7882daa41249f85fbc97828f0597de5c
> Author: Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> Date:   Thu Oct 23 15:57:27 2014 +0300
>
>     net/mlx4_core: Call synchronize_irq() before freeing EQ buffer
>
>     After moving the EQ ownership to software effectively destroying it, call
>     synchronize_irq() to ensure that any handler routines running on other CPU
>     cores finish execution. Only then free the EQ buffer.
>     The same thing is done when we destroy a CQ which is one of the sources
>     generating interrupts. In the case of CQ we want to avoid
> completion handlers
>     on a CQ that was destroyed. In the case we do the same to avoid receiving
>     asynchronous events after the EQ has been destroyed and its buffers freed.
>
>     Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>     Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
>
> This fix looks fit the bug we're hitting. Yes, we plan to update 3.0
> OFED recently, and the fix is included there.
> Will report if the bug is still there
>
> Thanks again.
> Jack
>
> 2015-07-08 15:49 GMT+02:00 Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
>> On 7/8/2015 3:47 PM, Jack Wang wrote:
>>>
>>> static void mlx4_ib_cq_comp(struct mlx4_cq *cq)
>>> 47 {
>>> 48 struct ib_cq *ibcq = &to_mibcq(cq)->ibcq;
>>> 49 ibcq->comp_handler(ibcq, ibcq->cq_context);
>>> 50 }
>>>
>>> Looks like cq use-after-free? I have no idea where.
>>
>>
>> see if you have in the code base you're using (why not the stock 3.18.14
>> driver, BTW?) all the synchronize_irq
>> calls we have in the latest upstream driver:
>>
>> drivers/net/ethernet/mellanox/mlx4/cq.c:371:
>> synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq);
>> drivers/net/ethernet/mellanox/mlx4/cq.c:374:
>> synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
>> drivers/net/ethernet/mellanox/mlx4/eq.c:1088: synchronize_irq(eq->irq);
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]                     ` <CAD+HZHXcrejwu=dAhmL7vZ=tkAPswm2LiCgwK42kEe5XDvBvhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-09 11:21                       ` Or Gerlitz
       [not found]                         ` <559E592D.5000201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-07-09 11:21 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

On 7/9/2015 2:14 PM, Jack Wang wrote:
> I managed to update the kernel to OFED 3.0 to verify the bug, but I
> can still produce the bug, maybe there're still some synchronice_irq
> is missing?

Again, even if you don't use the upstream kernel for production, I 
suggest you
try to reproduce the bug there and if it exists we'll try to solve it on 
upstream
and later port to MLNX OFED, makes sense?You can start with just the 
installed 3.18.14

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
       [not found]                         ` <559E592D.5000201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-07-09 13:35                           ` Jack Wang
  2015-07-09 13:57                             ` Or Gerlitz
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Wang @ 2015-07-09 13:35 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

2015-07-09 13:21 GMT+02:00 Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
> On 7/9/2015 2:14 PM, Jack Wang wrote:
>>
>> I managed to update the kernel to OFED 3.0 to verify the bug, but I
>> can still produce the bug, maybe there're still some synchronice_irq
>> is missing?
>
>
> Again, even if you don't use the upstream kernel for production, I suggest
> you
> try to reproduce the bug there and if it exists we'll try to solve it on
> upstream
> and later port to MLNX OFED, makes sense?You can start with just the
> installed 3.18.14
>
> Or.
Hello Or,

We have other kernel modules together also the autotest
infrastructure. It's not that easy to install a 3.18.14 kernel.

I look into the code a little bit. I think the bug may relate
radix_tree usage in mlx4_cq_free , OFED code in radix_tree_delete
before synchronize_irq, but mainline code call radix_tree_delete after
synchronize_irq,  does this matter? I'm building a new kernel with
this small change:

--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -393,16 +393,16 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct
mlx4_cq *cq)
  if (err)
  mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);

- spin_lock(&cq_table->lock);
- radix_tree_delete(&cq_table->tree, cq->cqn);
- spin_unlock(&cq_table->lock);
-
  synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq);
  /* synchronize ASYNC irq */
  if (priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq !=
     priv->eq_table.eq[MLX4_EQ_ASYNC].irq)
  synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq);

+ spin_lock(&cq_table->lock);
+ radix_tree_delete(&cq_table->tree, cq->cqn);
+ spin_unlock(&cq_table->lock);
+
  if (atomic_dec_and_test(&cq->refcount))
  complete(&cq->free);
  wait_for_completion(&cq->free);
Thanks,
Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
  2015-07-09 13:35                           ` Jack Wang
@ 2015-07-09 13:57                             ` Or Gerlitz
  0 siblings, 0 replies; 10+ messages in thread
From: Or Gerlitz @ 2015-07-09 13:57 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein, Moni Shoua

On 7/9/2015 4:35 PM, Jack Wang wrote:
> We have other kernel modules together also the autotest
> infrastructure. It's not that easy to install a 3.18.14 kernel.

you said you are running on 3.18.14 and just replaced their stock RDMA 
stack with MLNX OFED

>
> I look into the code a little bit. I think the bug may relate
> radix_tree usage in mlx4_cq_free , OFED code in radix_tree_delete
> before synchronize_irq, but mainline code call radix_tree_delete after
> synchronize_irq,  does this matter?

possibly yes, as in life location && timings matter

> I'm building a new kernel with
> this small change:

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-07-09 13:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-08  9:42 Mlx4: BUG: unable to handle kernel at ffffffffa02be210 Jack Wang
     [not found] ` <CAD+HZHXi2bB59eWLYaGiXj5-b5w3V1NhwUJbSjx5NfdmhEaRhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 12:19   ` Or Gerlitz
     [not found]     ` <559D1562.2070309-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-08 12:47       ` Jack Wang
     [not found]         ` <CAD+HZHVCMa97zEQ1SB=JXCKHOGgSO93BPyLha2PDrOOTsHTUCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 13:49           ` Or Gerlitz
     [not found]             ` <559D2A80.4040909-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-08 14:07               ` Jack Wang
     [not found]                 ` <CAD+HZHWBn-KZCsskSGPKLtntj-LjDRodda9jngr+qcKSxLhkGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 18:22                   ` Or Gerlitz
2015-07-09 11:14                   ` Jack Wang
     [not found]                     ` <CAD+HZHXcrejwu=dAhmL7vZ=tkAPswm2LiCgwK42kEe5XDvBvhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-09 11:21                       ` Or Gerlitz
     [not found]                         ` <559E592D.5000201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-09 13:35                           ` Jack Wang
2015-07-09 13:57                             ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox