All of lore.kernel.org
 help / color / mirror / Atom feed
From: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Jack Wang <xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Jack Morgenstein
	<jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>,
	Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: Mlx4: BUG: unable to handle kernel at ffffffffa02be210
Date: Wed, 8 Jul 2015 15:19:46 +0300	[thread overview]
Message-ID: <559D1562.2070309@mellanox.com> (raw)
In-Reply-To: <CAD+HZHXi2bB59eWLYaGiXj5-b5w3V1NhwUJbSjx5NfdmhEaRhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 7/8/2015 12:42 PM, Jack Wang wrote:

> We're using MLX OFED 2.4-1.0.4 together on top of 3.18.14.

So this list is for upstream things.. still, let's see


> We hit bug below spontaneously, our test trigger this bug around 1 in 5 times.

and what is your test if I may ask?!


> HCA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.9.1000
> Hardware version: b0
>
> Could you offer some insight, could this be a old bug already fixed,
> if so, could you point me the link, I can port to our kernel. thanks.
>
> [  657.723842] BUG: unable to handle kernel  at ffffffffa02be210
> [  657.724245] IP: [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.724539] PGD 1c15067
> [  657.725162] Oops: 0010 [#1]
> [  657.725657] Modules linked in: ib_ipoib ib_uverbs ib_umad mlx4_ib
> rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr ipv6 null_blk loop
> amd64_edac_mod k10temp fam15h_power edac
> _core button microcode hid_generic usbhid hid igb hwmon i2c_algo_bit
> i2c_core dca ahci ptp libahci ohci_pci pps_core mlx4_core ohci_hcd
> libata [last unloaded: ibtrs_server]
> [  657.731897] CPU: 0 PID: 337 Comm: kworker/u128:1 Tainted: G
>    O   3.18.14-1-ibnbd-debug #1
> [  657.732049] Hardware name: Supermicro BHQGE/BHQGE, BIOS 3.00       10/24/2012
> [  657.732199] Workqueue: ib_mad1 ib_mad_complete_send_wr [ib_mad]
> [  657.732464] task: ffff880415bea1f0 ti: ffff880415420000 task.ti: ffff880415420000
> [  657.732610] RIP: 0010:[<ffffffffa02be210>]  [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.732959] RSP: 0018:ffff880417c03d00  EFLAGS: 00010006
> [  657.733193] RAX: ffff8803bc5fc4d8 RBX: ffff8803bc5fc4d8 RCX: 0000000000000000
> [  657.733416] RDX: ffff880415bea9e0 RSI: ffff8803d8dcd388 RDI: ffff8803bc5fc4a8
> [  657.736094] RBP: ffff880417c03d08 R08: 0000000000000000 R09: ffff880415bea9b8
> [  657.736317] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800d3b00000
> [  657.736543] R13: 00000000000000c5 R14: 0000000000000000 R15: 0000000000000020
> [  657.736800] FS:  00007f2f05b5f700(0000) GS:ffff880417c00000(0000) knlGS:0000000000000000
> [  657.737109] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  657.737330] CR2: ffffffffa02be210 CR3: 000000180d76f000 CR4: 00000000000407f0
> [  657.737555] Stack:
> [  657.737758]  ffffffffa01d84b7 ffff880417c03d48 ffffffffa004a486 ffffffffa004a3f5
> [  657.738546]  ffffffff81c194e0 0000000000000000 00000000c5000000 ffff8804136001c0
> [  657.739360]  ffff8800d3b00000 ffff880417c03e18 ffffffffa004c0ea 0000000000000002
> [  657.740149] Call Trace:
> [  657.740385]  <IRQ>
> [  657.740514]  [<ffffffffa01d84b7>] ? mlx4_ib_destroy_ah+0x37/0x360 [mlx4_ib]
> [  657.741093]  [<ffffffffa004a486>] mlx4_cq_completion+0x96/0xe0 [mlx4_core]
> [  657.741330]  [<ffffffffa004a3f5>] ? mlx4_cq_completion+0x5/0xe0 [mlx4_core]
> [  657.741594]  [<ffffffffa004c0ea>] mlx4_test_interrupts+0x84a/0x1100
> [mlx4_core]

mlx4_test_interrupts is called from the mlx4_en ethtool selftest handler, so you are
calling it while X (what?) is done in parallel?




> [  657.741908]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
> [  657.742142]  [<ffffffffa004c904>]
> mlx4_test_interrupts+0x1064/0x1100 [mlx4_core]
> [  657.742457]  [<ffffffff810aa678>] handle_irq_event_percpu+0x78/0x2b0
> [  657.742685]  [<ffffffff810aa8f8>] handle_irq_event+0x48/0x70
> [  657.742934]  [<ffffffff810adf58>] handle_edge_irq+0xc8/0x160
> [  657.743160]  [<ffffffff8100515e>] handle_irq+0x14e/0x200
> [  657.743384]  [<ffffffff815fea3e>] do_IRQ+0x5e/0x110
> [  657.743603]  [<ffffffff815fcf6a>] common_interrupt+0x6a/0x6a
> [  657.743826]  <EOI>
> [  657.743957]  [<ffffffff81197295>] ? __slab_alloc+0x615/0x710
> [  657.744513]  [<ffffffffa01d80de>] ? mlx4_ib_create_ah+0x2e/0x2a0 [mlx4_ib]
> [  657.744738]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.744968]  [<ffffffff81198f12>] __kmalloc+0x162/0x2e0
> [  657.745191]  [<ffffffffa0195603>] ? ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.745420]  [<ffffffffa01d8100>] ? mlx4_ib_create_ah+0x50/0x2a0 [mlx4_ib]
> [  657.745650]  [<ffffffffa0195603>] ib_create_send_mad+0xf3/0x330 [ib_mad]
> [  657.745875]  [<ffffffffa019985b>] agent_send_response+0xbb/0x270 [ib_mad]
> [  657.746103]  [<ffffffffa0198bf4>] ?
> ib_mad_complete_send_wr+0x844/0xfa0 [ib_mad]
> [  657.746413]  [<ffffffffa0198f96>]
> ib_mad_complete_send_wr+0xbe6/0xfa0 [ib_mad]
> [  657.746729]  [<ffffffff8109f37a>] ? __lock_acquire.isra.28+0x3aa/0xcb0
> [  657.746959]  [<ffffffff8106c82d>] process_one_work+0x33d/0x6d0
> [  657.747181]  [<ffffffff8106c7a4>] ? process_one_work+0x2b4/0x6d0
> [  657.747434]  [<ffffffff8106d015>] worker_thread+0x55/0x6d0
> [  657.751224]  [<ffffffff8106cfc0>] ? rescuer_thread+0x3c0/0x3c0
> [  657.751482]  [<ffffffff81073e84>] kthread+0xe4/0x100
> [  657.751705]  [<ffffffff810792b4>] ? finish_task_switch+0x84/0x140
> [  657.751935]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
> [  657.752165]  [<ffffffff815fc3c8>] ret_from_fork+0x58/0x90
> [  657.752391]  [<ffffffff81073da0>] ? kthread_create_on_node+0x280/0x280
> [  657.752640] Code:  Bad RIP value.
> [  657.753095] RIP  [<ffffffffa02be210>] 0xffffffffa02be210
> [  657.753434]  RSP <ffff880417c03d00>
> [  657.753645] CR2: ffffffffa02be210
> [  657.753878] ---[ end trace 9c9225f5e490f806 ]---
> [  657.765754] Kernel panic - not syncing: Fatal exception in interrupt
> [  657.766089] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
> range: 0xffffffff80000000-0xffffffff9fffffff)
> [  657.778084] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>
> Best regards,
> Jack Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2015-07-08 12:19 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-08  9:42 Mlx4: BUG: unable to handle kernel at ffffffffa02be210 Jack Wang
     [not found] ` <CAD+HZHXi2bB59eWLYaGiXj5-b5w3V1NhwUJbSjx5NfdmhEaRhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 12:19   ` Or Gerlitz [this message]
     [not found]     ` <559D1562.2070309-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-08 12:47       ` Jack Wang
     [not found]         ` <CAD+HZHVCMa97zEQ1SB=JXCKHOGgSO93BPyLha2PDrOOTsHTUCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 13:49           ` Or Gerlitz
     [not found]             ` <559D2A80.4040909-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-08 14:07               ` Jack Wang
     [not found]                 ` <CAD+HZHWBn-KZCsskSGPKLtntj-LjDRodda9jngr+qcKSxLhkGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 18:22                   ` Or Gerlitz
2015-07-09 11:14                   ` Jack Wang
     [not found]                     ` <CAD+HZHXcrejwu=dAhmL7vZ=tkAPswm2LiCgwK42kEe5XDvBvhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-09 11:21                       ` Or Gerlitz
     [not found]                         ` <559E592D.5000201-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-07-09 13:35                           ` Jack Wang
2015-07-09 13:57                             ` Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=559D1562.2070309@mellanox.com \
    --to=ogerlitz-vpraknaxozvwk0htik3j/w@public.gmane.org \
    --cc=jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=xjtuwjp-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.