* BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 @ 2017-07-28 17:38 Logan Gunthorpe 2017-08-01 11:08 ` Matan Barak 0 siblings, 1 reply; 7+ messages in thread From: Logan Gunthorpe @ 2017-07-28 17:38 UTC (permalink / raw) To: Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org Cc: Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 4042 bytes --] Hi, My system has been failing with recent kernels (4.12.x and 4.13-rc2) with a NULL pointer dereference at the stack trace given at the end of this email. This happens when simply running 'ib_write_bw -R <server>' with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the offending commit to be: commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af Author: Matan Barak <matanb@mellanox.com> IB/core: Change completion channel to use the reworked objects schema Reverting this commit (and the dependent commits db1b5ddd53365 and e0fcc61113c that also fix other bugs with this commit) from v4.12.3 fixes the issue. I did the bisect with the userspace libraries in Debian Stretch but I also had this bug with rdma-core v14. I was pretty sure v4.12 kernels worked for me in the past but likely only before I upgraded from Jessie to Stretch. Thanks, Logan PS. As a side rant, this bug was found after a very *frustrating* day of what was supposed to be the 20 minute task of getting my RDMA cards plugged in again. I tried with both CX4s and the T6s (and I'm still not sure if my CX4s work yet). Instead, it turns out there's a whole mess of bugs in the kernel I had to go up against. I went back and forth between different versions of the userspace libraries because I was sure 4.11 worked -- but it turned out 4.11.10+, 4.12.x and who knows what other stable kernels are currently broken by the bug fixed in [1]. And there was a whole other bug that broke things that was fixed in the 4.12-rc series that I had to carefully bisect around to find the one reported above. So frustrating!! [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3 -- [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2) [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30 [ 54.752719] PGD 0 [ 54.752721] P4D 0 [ 54.755049] [ 54.759109] Oops: 0002 [#1] SMP [ 54.762699] Modules linked in: [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted 4.13.0-rc2.direct #708 [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS 3.0a 12/05/2013 [ 54.783182] Workqueue: iw_cxgb4 process_work [ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000 [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 [ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046 [ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 [ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI: 0000000000000058 [ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09: 0000000000000001 [ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12: 0000000000000282 [ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15: ffff880470484600 [ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000 [ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4: 00000000000406f0 [ 54.870602] Call Trace: [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0 [ 54.878610] ? flush_qp+0x6e/0x2b0 [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870 [ 54.887295] ? close_con_rpl+0xe7/0x170 [ 54.891686] ? kfree_skb+0x33/0x90 [ 54.895592] ? skb_dequeue+0x52/0x60 [ 54.899690] ? process_work+0x4a/0x60 [ 54.903887] ? process_one_work+0x1c2/0x3e0 [ 54.908664] ? worker_thread+0x47/0x3d0 [ 54.913056] ? kthread+0xfc/0x130 [ 54.916864] ? create_worker+0x180/0x180 [ 54.921353] ? kthread_create_on_node+0x40/0x40 [ 54.926521] ? ret_from_fork+0x22/0x30 [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48 [ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70 [ 54.959598] CR2: 0000000000000058 [ 54.963405] ---[ end trace 896cfe0234c949d2 ]--- [ 102.633421] random: crng init done [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: bisect.log --] [-- Type: text/x-log; name="bisect.log", Size: 2825 bytes --] git bisect start # good: [a351e9b9fc24e982ec2f0e76379a49826036da12] Linux 4.11 git bisect good a351e9b9fc24e982ec2f0e76379a49826036da12 # bad: [2ea659a9ef488125eb46da6eb571de5eae5c43f6] Linux 4.12-rc1 git bisect bad 2ea659a9ef488125eb46da6eb571de5eae5c43f6 # good: [221656e7c4ce342b99c31eca96c1cbb6d1dce45f] Merge tag 'sound-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect good 221656e7c4ce342b99c31eca96c1cbb6d1dce45f # bad: [c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91] Merge tag 'staging-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging git bisect bad c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91 # bad: [e579dde654fc2c6b0d3e4b77a9a4b2d2405c510e] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace git bisect bad e579dde654fc2c6b0d3e4b77a9a4b2d2405c510e # bad: [a96480723c287c502b02659f4b347aecaa651ea1] Merge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip git bisect bad a96480723c287c502b02659f4b347aecaa651ea1 # good: [16a12fa9aed176444fc795b09e796be41902bb08] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input git bisect good 16a12fa9aed176444fc795b09e796be41902bb08 # bad: [1684096b1ed813f621fb6cbd06e72235c1c2a0ca] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma git bisect bad 1684096b1ed813f621fb6cbd06e72235c1c2a0ca # bad: [e821303c428eedcc20746224d590b11c7000a7e5] iw_cxgb4: Use dsgl by default git bisect bad e821303c428eedcc20746224d590b11c7000a7e5 # bad: [515ed4f3aab4e8a0855d0cdfd9753a419ccfb297] IB/IPoIB: Separate control and data related initializations git bisect bad 515ed4f3aab4e8a0855d0cdfd9753a419ccfb297 # bad: [f7b42633720deb5ca8f4bcb175c7dc2933057e7f] IB/hfi1: Ensure VL index is within bounds git bisect bad f7b42633720deb5ca8f4bcb175c7dc2933057e7f # bad: [8688426ba6464f7079649f52cf9108856c419415] IB/hfi1: Cache registers during state change git bisect bad 8688426ba6464f7079649f52cf9108856c419415 # good: [cf8966b3477d5e6545393bb4499f2051ea554c62] IB/core: Add support for fd objects git bisect good cf8966b3477d5e6545393bb4499f2051ea554c62 # bad: [771a52584096c45e4565e8aabb596eece9d73d61] IB/IPoIB: ibX: failed to create mcg debug file git bisect bad 771a52584096c45e4565e8aabb596eece9d73d61 # bad: [cd6ce4a5737829052abc4ffc8befd0adfff8998d] IB/hns: Explicitly include linux/of.h git bisect bad cd6ce4a5737829052abc4ffc8befd0adfff8998d # bad: [1e7710f3f6563940bb6bbc94aa8eadfd344a86af] IB/core: Change completion channel to use the reworked objects schema git bisect bad 1e7710f3f6563940bb6bbc94aa8eadfd344a86af # first bad commit: [1e7710f3f6563940bb6bbc94aa8eadfd344a86af] IB/core: Change completion channel to use the reworked objects schema ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-07-28 17:38 BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 Logan Gunthorpe @ 2017-08-01 11:08 ` Matan Barak 2017-08-01 12:30 ` Potnuri Bharat Teja 2017-08-01 18:32 ` Logan Gunthorpe 0 siblings, 2 replies; 7+ messages in thread From: Matan Barak @ 2017-08-01 11:08 UTC (permalink / raw) To: Logan Gunthorpe Cc: Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <logang@deltatee.com> wrote: > Hi, > > My system has been failing with recent kernels (4.12.x and 4.13-rc2) > with a NULL pointer dereference at the stack trace given at the end of > this email. This happens when simply running 'ib_write_bw -R <server>' > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the > offending commit to be: > > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af > Author: Matan Barak <matanb@mellanox.com> > IB/core: Change completion channel to use the reworked objects schema > > Reverting this commit (and the dependent commits db1b5ddd53365 and > e0fcc61113c that also fix other bugs with this commit) from v4.12.3 > fixes the issue. > > I did the bisect with the userspace libraries in Debian Stretch but I > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels > worked for me in the past but likely only before I upgraded from Jessie > to Stretch. > > Thanks, > > Logan > Hi Logan, I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head. I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits. I couldn't reproduce this problem. I'll try to review this commit again, but please provide more information. For example, do you see the iwpm_register_pid error when these commits are reverted? Does this also happen when using the plain rdma-cm examples (ucmatose, rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)? I assume you use iWarp, right? Did you test other modes? Did you reproduce this issue with your ConnectX 4 as well? Could you please reproduce it with KASAN as well? PS, e0fcc61113c isn't a bug fix, it's just a simple refactor. Regards, Matan > > PS. As a side rant, this bug was found after a very *frustrating* day of > what was supposed to be the 20 minute task of getting my RDMA cards > plugged in again. I tried with both CX4s and the T6s (and I'm still not > sure if my CX4s work yet). Instead, it turns out there's a whole mess of > bugs in the kernel I had to go up against. I went back and forth between > different versions of the userspace libraries because I was sure 4.11 > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other > stable kernels are currently broken by the bug fixed in [1]. And there > was a whole other bug that broke things that was fixed in the 4.12-rc > series that I had to carefully bisect around to find the one reported > above. So frustrating!! > > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3 > > -- > > [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2) > [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000058 > [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30 > [ 54.752719] PGD 0 > [ 54.752721] P4D 0 > [ 54.755049] > [ 54.759109] Oops: 0002 [#1] SMP > [ 54.762699] Modules linked in: > [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted > 4.13.0-rc2.direct #708 > [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS > 3.0a 12/05/2013 > [ 54.783182] Workqueue: iw_cxgb4 process_work > [ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000 > [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 > [ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046 > [ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX: > 0000000000000000 > [ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI: > 0000000000000058 > [ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09: > 0000000000000001 > [ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12: > 0000000000000282 > [ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15: > ffff880470484600 > [ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) > knlGS:0000000000000000 > [ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4: > 00000000000406f0 > [ 54.870602] Call Trace: > [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0 > [ 54.878610] ? flush_qp+0x6e/0x2b0 > [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870 > [ 54.887295] ? close_con_rpl+0xe7/0x170 > [ 54.891686] ? kfree_skb+0x33/0x90 > [ 54.895592] ? skb_dequeue+0x52/0x60 > [ 54.899690] ? process_work+0x4a/0x60 > [ 54.903887] ? process_one_work+0x1c2/0x3e0 > [ 54.908664] ? worker_thread+0x47/0x3d0 > [ 54.913056] ? kthread+0xfc/0x130 > [ 54.916864] ? create_worker+0x180/0x180 > [ 54.921353] ? kthread_create_on_node+0x40/0x40 > [ 54.926521] ? ret_from_fork+0x22/0x30 > [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00 > 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00 > 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48 > [ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70 > [ 54.959598] CR2: 0000000000000058 > [ 54.963405] ---[ end trace 896cfe0234c949d2 ]--- > [ 102.633421] random: crng init done > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-08-01 11:08 ` Matan Barak @ 2017-08-01 12:30 ` Potnuri Bharat Teja 2017-08-01 18:35 ` Logan Gunthorpe 2017-08-01 18:32 ` Logan Gunthorpe 1 sibling, 1 reply; 7+ messages in thread From: Potnuri Bharat Teja @ 2017-08-01 12:30 UTC (permalink / raw) To: Matan Barak Cc: Logan Gunthorpe, Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org On Tuesday, August 08/01/17, 2017 at 16:38:08 +0530, Matan Barak wrote: > On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <logang@deltatee.com> wrote: > > Hi, > > > > My system has been failing with recent kernels (4.12.x and 4.13-rc2) > > with a NULL pointer dereference at the stack trace given at the end of > > this email. This happens when simply running 'ib_write_bw -R <server>' > > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the > > offending commit to be: > > > > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af > > Author: Matan Barak <matanb@mellanox.com> > > IB/core: Change completion channel to use the reworked objects schema > > > > Reverting this commit (and the dependent commits db1b5ddd53365 and > > e0fcc61113c that also fix other bugs with this commit) from v4.12.3 > > fixes the issue. > > > > I did the bisect with the userspace libraries in Debian Stretch but I > > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels > > worked for me in the past but likely only before I upgraded from Jessie > > to Stretch. > > > > Thanks, > > > > Logan > > Hi Logan, Today I sent out a patch to address the issue. Please try it. "[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately" > > Hi Logan, > > I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using > 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head. > I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits. > I couldn't reproduce this problem. > I'll try to review this commit again, but please provide more information. > For example, do you see the iwpm_register_pid error when these commits > are reverted? Hi Matan, Issue is seen with applications not creating a completion channel. It is not seen with rping or similar applications which do create completion channel. Today I sent out a patch to address the issue. Please review it. "[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately" Thanks, Bharat. > Does this also happen when using the plain rdma-cm examples (ucmatose, > rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)? > I assume you use iWarp, right? Did you test other modes? > Did you reproduce this issue with your ConnectX 4 as well? > Could you please reproduce it with KASAN as well? > > PS, e0fcc61113c isn't a bug fix, it's just a simple refactor. > > Regards, > Matan > > > > > PS. As a side rant, this bug was found after a very *frustrating* day of > > what was supposed to be the 20 minute task of getting my RDMA cards > > plugged in again. I tried with both CX4s and the T6s (and I'm still not > > sure if my CX4s work yet). Instead, it turns out there's a whole mess of > > bugs in the kernel I had to go up against. I went back and forth between > > different versions of the userspace libraries because I was sure 4.11 > > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other > > stable kernels are currently broken by the bug fixed in [1]. And there > > was a whole other bug that broke things that was fixed in the 4.12-rc > > series that I had to carefully bisect around to find the one reported > > above. So frustrating!! > > > > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3 > > > > -- > > > > [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2) > > [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at > > 0000000000000058 > > [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30 > > [ 54.752719] PGD 0 > > [ 54.752721] P4D 0 > > [ 54.755049] > > [ 54.759109] Oops: 0002 [#1] SMP > > [ 54.762699] Modules linked in: > > [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted > > 4.13.0-rc2.direct #708 > > [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS > > 3.0a 12/05/2013 > > [ 54.783182] Workqueue: iw_cxgb4 process_work > > [ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000 > > [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 > > [ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046 > > [ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX: > > 0000000000000000 > > [ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI: > > 0000000000000058 > > [ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09: > > 0000000000000001 > > [ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12: > > 0000000000000282 > > [ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15: > > ffff880470484600 > > [ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) > > knlGS:0000000000000000 > > [ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4: > > 00000000000406f0 > > [ 54.870602] Call Trace: > > [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0 > > [ 54.878610] ? flush_qp+0x6e/0x2b0 > > [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870 > > [ 54.887295] ? close_con_rpl+0xe7/0x170 > > [ 54.891686] ? kfree_skb+0x33/0x90 > > [ 54.895592] ? skb_dequeue+0x52/0x60 > > [ 54.899690] ? process_work+0x4a/0x60 > > [ 54.903887] ? process_one_work+0x1c2/0x3e0 > > [ 54.908664] ? worker_thread+0x47/0x3d0 > > [ 54.913056] ? kthread+0xfc/0x130 > > [ 54.916864] ? create_worker+0x180/0x180 > > [ 54.921353] ? kthread_create_on_node+0x40/0x40 > > [ 54.926521] ? ret_from_fork+0x22/0x30 > > [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00 > > 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00 > > 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48 > > [ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70 > > [ 54.959598] CR2: 0000000000000058 > > [ 54.963405] ---[ end trace 896cfe0234c949d2 ]--- > > [ 102.633421] random: crng init done > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-08-01 12:30 ` Potnuri Bharat Teja @ 2017-08-01 18:35 ` Logan Gunthorpe 0 siblings, 0 replies; 7+ messages in thread From: Logan Gunthorpe @ 2017-08-01 18:35 UTC (permalink / raw) To: Potnuri Bharat Teja, Matan Barak Cc: Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org Hey, On 01/08/17 06:30 AM, Potnuri Bharat Teja wrote: > Hi Logan, > Today I sent out a patch to address the issue. Please try it. > "[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately" Thanks, as I mentioned in my other email this fixes the kernel panic on the T6 but doesn't solve all my problems. You can add a Tested-by: Logan Gunthorpe <logang@deltatee.com> Logan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-08-01 11:08 ` Matan Barak 2017-08-01 12:30 ` Potnuri Bharat Teja @ 2017-08-01 18:32 ` Logan Gunthorpe 2017-08-01 19:29 ` Jason Gunthorpe 1 sibling, 1 reply; 7+ messages in thread From: Logan Gunthorpe @ 2017-08-01 18:32 UTC (permalink / raw) To: Matan Barak Cc: Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org, Potnuri Bharat Teja [-- Attachment #1: Type: text/plain, Size: 1242 bytes --] Hey, The patch Bharat provided fixes the kernel panic but RDMA in userspace still does not work at all. Reverting the commits I mentioned still fixes everything. To answer your questions: * I see the iwpm_register_pid message even when things are working so I don't think it's related. * All clients I've tried fail. I've attached a log of all the error messages I see with various clients. (This was with Bharat's patch so there was no kernel panic and I saw no dmesgs during these runs). The same runs with the commits I mentioned reverted work fine. * I retested everything with the CX4 cards as well and they have a similar problem but produce different error messages. I've attached a log of client runs as well. The CX4 also works once I revert those patches. However, by memory, I don't think the CX4s ever suffered from the kernel panic, and I guess it was just luck that the patches I reverted caused all these problems. On 01/08/17 05:08 AM, Matan Barak wrote: > PS, e0fcc61113c isn't a bug fix, it's just a simple refactor. If it's not a bug fix I don't think it should have a fixes tag. It probably didn't mater in this case but you don't want refactor commits to accidentally reach a stable kernel. Thanks, Logan [-- Attachment #2: cxgb4-client-errors.txt --] [-- Type: text/plain, Size: 1502 bytes --] gunthorp@cgy1-donard:~$ ib_write_bw -R ************************************ * Waiting for client to connect... * ************************************ Couldn't create rdma QP - Invalid argument Unable to create QP. Failed to create QP. Unable to create the resources needed by comm struct Unable to perform rdma_server function Unable to init the socket connection gunthorp@cgy1-donard:~$ ib_write_bw -R flash-cxgb Couldn't create rdma QP - Invalid argument Unable to create QP. Failed to create QP. Unable to create the resources needed by comm struct Unable to perform rdma_client function Unable to init the socket connection gunthorp@cgy1-donard:~$ rping -s rdma_create_qp: Invalid argument setup_qp failed: -1 gunthorp@cgy1-donard:~$ rping -c -a flash-cxgb -v -C5 rdma_create_qp: Invalid argument setup_qp failed: -1 gunthorp@cgy1-donard:~$ ucmatose cmatose: starting server cmatose: unable to create QP: Invalid argument cmatose: failing connection request test complete return status -1 gunthorp@cgy1-donard:~$ ucmatose -s flash-cxgb cmatose: starting client cmatose: connecting cmatose: unable to create QP: Invalid argument test complete return status -1 gunthorp@cgy1-donard:~$ ibv_rc_pingpong local address: LID 0x0000, QPN 0x000430, PSN 0xef9f8e, GID :: Failed to modify QP to RTR Couldn't connect to remote QP gunthorp@cgy1-donard:~$ ibv_rc_pingpong flash-cxgb local address: LID 0x0000, QPN 0x000438, PSN 0x9e00c8, GID :: client read: Success Couldn't read remote address [-- Attachment #3: mlx5-client-errors.txt --] [-- Type: text/plain, Size: 1373 bytes --] gunthorp@cgy1-donard:~$ ib_write_bw -R -d mlx5_0 flash-rdma Unexpected CM event bl blka 6 Unable to perform rdma_client function Unable to init the socket connection gunthorp@cgy1-donard:~$ ib_write_bw -R -d mlx5_0 ************************************ * Waiting for client to connect... * ************************************ Function rdma_accept failed Unable to perform rdma_server function Unable to init the socket connection gunthorp@cgy1-donard:~$ rping -s rdma_accept: Invalid argument connect error -1 gunthorp@cgy1-donard:~$ rping -c -a flash-rdma cma event RDMA_CM_EVENT_CONNECT_ERROR, error -1 wait for CONNECTED state 4 connect error -1 gunthorp@cgy1-donard:~$ ucmatose cmatose: starting server cmatose: failure accepting: Invalid argument cmatose: failing connection request test complete return status -1 gunthorp@cgy1-donard:~$ ucmatose -s flash-rdma cmatose: starting client cmatose: connecting cmatose: event: RDMA_CM_EVENT_CONNECT_ERROR, error: -1 test complete return status -1 gunthorp@cgy1-donard:~$ ibv_rc_pingpong local address: LID 0x0003, QPN 0x000847, PSN 0x2a678f, GID :: Failed to modify QP to RTR Couldn't connect to remote QP gunthorp@cgy1-donard:~$ ibv_rc_pingpong flash-rdma local address: LID 0x0003, QPN 0x000848, PSN 0xe014bd, GID :: remote address: LID 0x0002, QPN 0x000849, PSN 0x2bd346, GID :: Failed to modify QP to RTR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-08-01 18:32 ` Logan Gunthorpe @ 2017-08-01 19:29 ` Jason Gunthorpe 2017-08-01 19:39 ` Logan Gunthorpe 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2017-08-01 19:29 UTC (permalink / raw) To: Logan Gunthorpe Cc: Matan Barak, Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Stephen Bates, linux-kernel@vger.kernel.org, Potnuri Bharat Teja On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote: > Couldn't create rdma QP - Invalid argument > Unable to create QP. > Failed to create QP. Failing to create a QP makes me wonder if you have have this patch? Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number The port number is only valid if IB_QP_PORT is set in the mask. So only check port number if it is valid to prevent modify_qp from failing due to an invalid port number. Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds") Cc: <stable@vger.kernel.org> # v2.6.14+ Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Jason ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 2017-08-01 19:29 ` Jason Gunthorpe @ 2017-08-01 19:39 ` Logan Gunthorpe 0 siblings, 0 replies; 7+ messages in thread From: Logan Gunthorpe @ 2017-08-01 19:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Matan Barak, Matan Barak, Yishai Hadas, Doug Ledford, linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock, Stephen Bates, linux-kernel@vger.kernel.org, Potnuri Bharat Teja On 01/08/17 01:29 PM, Jason Gunthorpe wrote: > On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote: >> Couldn't create rdma QP - Invalid argument >> Unable to create QP. >> Failed to create QP. > > Failing to create a QP makes me wonder if you have have this patch? > > Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number > > The port number is only valid if IB_QP_PORT is set in the mask. > So only check port number if it is valid to prevent modify_qp from > failing due to an invalid port number. > > Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds") > Cc: <stable@vger.kernel.org> # v2.6.14+ > Reviewed-by: Steve Wise <swise@opengridcomputing.com> > Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Oh, oops, I forgot about that. I mentioned the fix for that in my original email and it seems I wasn't testing apples to apples for my testing today. During my testing today, the branch with the reverted commits had the fix for that commit while the branch with Bharat's patch didn't. I just did a test with both Bharat's patch and 5a7a88f1b4, and everything is working correctly again. So that's great, we just need these patches to be picked up by the stable kernels. Thanks, Logan ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-08-01 19:40 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-28 17:38 BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 Logan Gunthorpe 2017-08-01 11:08 ` Matan Barak 2017-08-01 12:30 ` Potnuri Bharat Teja 2017-08-01 18:35 ` Logan Gunthorpe 2017-08-01 18:32 ` Logan Gunthorpe 2017-08-01 19:29 ` Jason Gunthorpe 2017-08-01 19:39 ` Logan Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox