public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
@ 2017-07-28 17:38 Logan Gunthorpe
  2017-08-01 11:08 ` Matan Barak
  0 siblings, 1 reply; 7+ messages in thread
From: Logan Gunthorpe @ 2017-07-28 17:38 UTC (permalink / raw)
  To: Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org
  Cc: Sean Hefty, Hal Rosenstock, Jason Gunthorpe, Stephen Bates,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4042 bytes --]

Hi,

My system has been failing with recent kernels (4.12.x and 4.13-rc2)
with a NULL pointer dereference at the stack trace given at the end of
this email. This happens when simply running 'ib_write_bw -R <server>'
with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
offending commit to be:

commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
Author: Matan Barak <matanb@mellanox.com>
  IB/core: Change completion channel to use the reworked objects schema

Reverting this commit (and the dependent commits db1b5ddd53365 and
e0fcc61113c that also fix other bugs with this commit) from v4.12.3
fixes the issue.

I did the bisect with the userspace libraries in Debian Stretch but I
also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
worked for me in the past but likely only before I upgraded from Jessie
to Stretch.

Thanks,

Logan


PS. As a side rant, this bug was found after a very *frustrating* day of
what was supposed to be the 20 minute task of getting my RDMA cards
plugged in again. I tried with both CX4s and the T6s (and I'm still not
sure if my CX4s work yet). Instead, it turns out there's a whole mess of
bugs in the kernel I had to go up against. I went back and forth between
different versions of the userspace libraries because I was sure 4.11
worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
stable kernels are currently broken by the bug fixed in [1]. And there
was a whole other bug that broke things that was fixed in the 4.12-rc
series that I had to carefully bisect around to find the one reported
above. So frustrating!!

[1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3

--

[   53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[   54.738579] BUG: unable to handle kernel NULL pointer dereference at
0000000000000058
[   54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
[   54.752719] PGD 0
[   54.752721] P4D 0
[   54.755049]
[   54.759109] Oops: 0002 [#1] SMP
[   54.762699] Modules linked in:
[   54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
4.13.0-rc2.direct #708
[   54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
3.0a 12/05/2013
[   54.783182] Workqueue: iw_cxgb4 process_work
[   54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
[   54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
[   54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
[   54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000000000
[   54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
0000000000000058
[   54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
0000000000000001
[   54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
0000000000000282
[   54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
ffff880470484600
[   54.846825] FS:  0000000000000000(0000) GS:ffff880277c00000(0000)
knlGS:0000000000000000
[   54.855997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
00000000000406f0
[   54.870602] Call Trace:
[   54.873442]  ? ib_uverbs_comp_handler+0x20/0xe0
[   54.878610]  ? flush_qp+0x6e/0x2b0
[   54.882514]  ? c4iw_modify_qp+0x11c2/0x1870
[   54.887295]  ? close_con_rpl+0xe7/0x170
[   54.891686]  ? kfree_skb+0x33/0x90
[   54.895592]  ? skb_dequeue+0x52/0x60
[   54.899690]  ? process_work+0x4a/0x60
[   54.903887]  ? process_one_work+0x1c2/0x3e0
[   54.908664]  ? worker_thread+0x47/0x3d0
[   54.913056]  ? kthread+0xfc/0x130
[   54.916864]  ? create_worker+0x180/0x180
[   54.921353]  ? kthread_create_on_node+0x40/0x40
[   54.926521]  ? ret_from_fork+0x22/0x30
[   54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
[   54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
[   54.959598] CR2: 0000000000000058
[   54.963405] ---[ end trace 896cfe0234c949d2 ]---
[  102.633421] random: crng init done


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bisect.log --]
[-- Type: text/x-log; name="bisect.log", Size: 2825 bytes --]

git bisect start
# good: [a351e9b9fc24e982ec2f0e76379a49826036da12] Linux 4.11
git bisect good a351e9b9fc24e982ec2f0e76379a49826036da12
# bad: [2ea659a9ef488125eb46da6eb571de5eae5c43f6] Linux 4.12-rc1
git bisect bad 2ea659a9ef488125eb46da6eb571de5eae5c43f6
# good: [221656e7c4ce342b99c31eca96c1cbb6d1dce45f] Merge tag 'sound-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 221656e7c4ce342b99c31eca96c1cbb6d1dce45f
# bad: [c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91] Merge tag 'staging-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad c6a677c6f37bb7abc85ba7e3465e82b9f7eb1d91
# bad: [e579dde654fc2c6b0d3e4b77a9a4b2d2405c510e] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
git bisect bad e579dde654fc2c6b0d3e4b77a9a4b2d2405c510e
# bad: [a96480723c287c502b02659f4b347aecaa651ea1] Merge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
git bisect bad a96480723c287c502b02659f4b347aecaa651ea1
# good: [16a12fa9aed176444fc795b09e796be41902bb08] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect good 16a12fa9aed176444fc795b09e796be41902bb08
# bad: [1684096b1ed813f621fb6cbd06e72235c1c2a0ca] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
git bisect bad 1684096b1ed813f621fb6cbd06e72235c1c2a0ca
# bad: [e821303c428eedcc20746224d590b11c7000a7e5] iw_cxgb4: Use dsgl by default
git bisect bad e821303c428eedcc20746224d590b11c7000a7e5
# bad: [515ed4f3aab4e8a0855d0cdfd9753a419ccfb297] IB/IPoIB: Separate control and data related initializations
git bisect bad 515ed4f3aab4e8a0855d0cdfd9753a419ccfb297
# bad: [f7b42633720deb5ca8f4bcb175c7dc2933057e7f] IB/hfi1: Ensure VL index is within bounds
git bisect bad f7b42633720deb5ca8f4bcb175c7dc2933057e7f
# bad: [8688426ba6464f7079649f52cf9108856c419415] IB/hfi1: Cache registers during state change
git bisect bad 8688426ba6464f7079649f52cf9108856c419415
# good: [cf8966b3477d5e6545393bb4499f2051ea554c62] IB/core: Add support for fd objects
git bisect good cf8966b3477d5e6545393bb4499f2051ea554c62
# bad: [771a52584096c45e4565e8aabb596eece9d73d61] IB/IPoIB: ibX: failed to create mcg debug file
git bisect bad 771a52584096c45e4565e8aabb596eece9d73d61
# bad: [cd6ce4a5737829052abc4ffc8befd0adfff8998d] IB/hns: Explicitly include linux/of.h
git bisect bad cd6ce4a5737829052abc4ffc8befd0adfff8998d
# bad: [1e7710f3f6563940bb6bbc94aa8eadfd344a86af] IB/core: Change completion channel to use the reworked objects schema
git bisect bad 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
# first bad commit: [1e7710f3f6563940bb6bbc94aa8eadfd344a86af] IB/core: Change completion channel to use the reworked objects schema

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-07-28 17:38 BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 Logan Gunthorpe
@ 2017-08-01 11:08 ` Matan Barak
  2017-08-01 12:30   ` Potnuri Bharat Teja
  2017-08-01 18:32   ` Logan Gunthorpe
  0 siblings, 2 replies; 7+ messages in thread
From: Matan Barak @ 2017-08-01 11:08 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org

On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi,
>
> My system has been failing with recent kernels (4.12.x and 4.13-rc2)
> with a NULL pointer dereference at the stack trace given at the end of
> this email. This happens when simply running 'ib_write_bw -R <server>'
> with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
> offending commit to be:
>
> commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
> Author: Matan Barak <matanb@mellanox.com>
>   IB/core: Change completion channel to use the reworked objects schema
>
> Reverting this commit (and the dependent commits db1b5ddd53365 and
> e0fcc61113c that also fix other bugs with this commit) from v4.12.3
> fixes the issue.
>
> I did the bisect with the userspace libraries in Debian Stretch but I
> also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
> worked for me in the past but likely only before I upgraded from Jessie
> to Stretch.
>
> Thanks,
>
> Logan
>

Hi Logan,

I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using
1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head.
I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits.
I couldn't reproduce this problem.
I'll try to review this commit again, but please provide more information.
For example, do you see the iwpm_register_pid error when these commits
are reverted?
Does this also happen when using the plain rdma-cm examples (ucmatose,
rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)?
I assume you use iWarp, right? Did you test other modes?
Did you reproduce this issue with your ConnectX 4 as well?
Could you please reproduce it with KASAN as well?

PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.

Regards,
Matan

>
> PS. As a side rant, this bug was found after a very *frustrating* day of
> what was supposed to be the 20 minute task of getting my RDMA cards
> plugged in again. I tried with both CX4s and the T6s (and I'm still not
> sure if my CX4s work yet). Instead, it turns out there's a whole mess of
> bugs in the kernel I had to go up against. I went back and forth between
> different versions of the userspace libraries because I was sure 4.11
> worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
> stable kernels are currently broken by the bug fixed in [1]. And there
> was a whole other bug that broke things that was fixed in the 4.12-rc
> series that I had to carefully bisect around to find the one reported
> above. So frustrating!!
>
> [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
>
> --
>
> [   53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
> [   54.738579] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000058
> [   54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
> [   54.752719] PGD 0
> [   54.752721] P4D 0
> [   54.755049]
> [   54.759109] Oops: 0002 [#1] SMP
> [   54.762699] Modules linked in:
> [   54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
> 4.13.0-rc2.direct #708
> [   54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
> 3.0a 12/05/2013
> [   54.783182] Workqueue: iw_cxgb4 process_work
> [   54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
> [   54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
> [   54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
> [   54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
> 0000000000000000
> [   54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
> 0000000000000058
> [   54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
> 0000000000000001
> [   54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
> 0000000000000282
> [   54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
> ffff880470484600
> [   54.846825] FS:  0000000000000000(0000) GS:ffff880277c00000(0000)
> knlGS:0000000000000000
> [   54.855997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
> 00000000000406f0
> [   54.870602] Call Trace:
> [   54.873442]  ? ib_uverbs_comp_handler+0x20/0xe0
> [   54.878610]  ? flush_qp+0x6e/0x2b0
> [   54.882514]  ? c4iw_modify_qp+0x11c2/0x1870
> [   54.887295]  ? close_con_rpl+0xe7/0x170
> [   54.891686]  ? kfree_skb+0x33/0x90
> [   54.895592]  ? skb_dequeue+0x52/0x60
> [   54.899690]  ? process_work+0x4a/0x60
> [   54.903887]  ? process_one_work+0x1c2/0x3e0
> [   54.908664]  ? worker_thread+0x47/0x3d0
> [   54.913056]  ? kthread+0xfc/0x130
> [   54.916864]  ? create_worker+0x180/0x180
> [   54.921353]  ? kthread_create_on_node+0x40/0x40
> [   54.926521]  ? ret_from_fork+0x22/0x30
> [   54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
> 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
> 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
> [   54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
> [   54.959598] CR2: 0000000000000058
> [   54.963405] ---[ end trace 896cfe0234c949d2 ]---
> [  102.633421] random: crng init done
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-08-01 11:08 ` Matan Barak
@ 2017-08-01 12:30   ` Potnuri Bharat Teja
  2017-08-01 18:35     ` Logan Gunthorpe
  2017-08-01 18:32   ` Logan Gunthorpe
  1 sibling, 1 reply; 7+ messages in thread
From: Potnuri Bharat Teja @ 2017-08-01 12:30 UTC (permalink / raw)
  To: Matan Barak
  Cc: Logan Gunthorpe, Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org

On Tuesday, August 08/01/17, 2017 at 16:38:08 +0530, Matan Barak wrote:
> On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > Hi,
> >
> > My system has been failing with recent kernels (4.12.x and 4.13-rc2)
> > with a NULL pointer dereference at the stack trace given at the end of
> > this email. This happens when simply running 'ib_write_bw -R <server>'
> > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
> > offending commit to be:
> >
> > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
> > Author: Matan Barak <matanb@mellanox.com>
> >   IB/core: Change completion channel to use the reworked objects schema
> >
> > Reverting this commit (and the dependent commits db1b5ddd53365 and
> > e0fcc61113c that also fix other bugs with this commit) from v4.12.3
> > fixes the issue.
> >
> > I did the bisect with the userspace libraries in Debian Stretch but I
> > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
> > worked for me in the past but likely only before I upgraded from Jessie
> > to Stretch.
> >
> > Thanks,
> >
> > Logan
> >
Hi Logan,
Today I sent out a patch to address the issue. Please try it.
"[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

> 
> Hi Logan,
> 
> I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using
> 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head.
> I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits.
> I couldn't reproduce this problem.
> I'll try to review this commit again, but please provide more information.
> For example, do you see the iwpm_register_pid error when these commits
> are reverted?
Hi Matan,
Issue is seen with applications not creating a completion channel.
It is not seen with rping or similar applications which do create completion channel.

Today I sent out a patch to address the issue. Please review it.
"[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

Thanks,
Bharat.
> Does this also happen when using the plain rdma-cm examples (ucmatose,
> rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)?
> I assume you use iWarp, right? Did you test other modes?
> Did you reproduce this issue with your ConnectX 4 as well?
> Could you please reproduce it with KASAN as well?

> 
> PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.
> 
> Regards,
> Matan
> 
> >
> > PS. As a side rant, this bug was found after a very *frustrating* day of
> > what was supposed to be the 20 minute task of getting my RDMA cards
> > plugged in again. I tried with both CX4s and the T6s (and I'm still not
> > sure if my CX4s work yet). Instead, it turns out there's a whole mess of
> > bugs in the kernel I had to go up against. I went back and forth between
> > different versions of the userspace libraries because I was sure 4.11
> > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
> > stable kernels are currently broken by the bug fixed in [1]. And there
> > was a whole other bug that broke things that was fixed in the 4.12-rc
> > series that I had to carefully bisect around to find the one reported
> > above. So frustrating!!
> >
> > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
> >
> > --
> >
> > [   53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
> > [   54.738579] BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000058
> > [   54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
> > [   54.752719] PGD 0
> > [   54.752721] P4D 0
> > [   54.755049]
> > [   54.759109] Oops: 0002 [#1] SMP
> > [   54.762699] Modules linked in:
> > [   54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
> > 4.13.0-rc2.direct #708
> > [   54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
> > 3.0a 12/05/2013
> > [   54.783182] Workqueue: iw_cxgb4 process_work
> > [   54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
> > [   54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
> > [   54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
> > [   54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
> > 0000000000000000
> > [   54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
> > 0000000000000058
> > [   54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
> > 0000000000000001
> > [   54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
> > 0000000000000282
> > [   54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
> > ffff880470484600
> > [   54.846825] FS:  0000000000000000(0000) GS:ffff880277c00000(0000)
> > knlGS:0000000000000000
> > [   54.855997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
> > 00000000000406f0
> > [   54.870602] Call Trace:
> > [   54.873442]  ? ib_uverbs_comp_handler+0x20/0xe0
> > [   54.878610]  ? flush_qp+0x6e/0x2b0
> > [   54.882514]  ? c4iw_modify_qp+0x11c2/0x1870
> > [   54.887295]  ? close_con_rpl+0xe7/0x170
> > [   54.891686]  ? kfree_skb+0x33/0x90
> > [   54.895592]  ? skb_dequeue+0x52/0x60
> > [   54.899690]  ? process_work+0x4a/0x60
> > [   54.903887]  ? process_one_work+0x1c2/0x3e0
> > [   54.908664]  ? worker_thread+0x47/0x3d0
> > [   54.913056]  ? kthread+0xfc/0x130
> > [   54.916864]  ? create_worker+0x180/0x180
> > [   54.921353]  ? kthread_create_on_node+0x40/0x40
> > [   54.926521]  ? ret_from_fork+0x22/0x30
> > [   54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
> > 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
> > 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
> > [   54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
> > [   54.959598] CR2: 0000000000000058
> > [   54.963405] ---[ end trace 896cfe0234c949d2 ]---
> > [  102.633421] random: crng init done
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-08-01 11:08 ` Matan Barak
  2017-08-01 12:30   ` Potnuri Bharat Teja
@ 2017-08-01 18:32   ` Logan Gunthorpe
  2017-08-01 19:29     ` Jason Gunthorpe
  1 sibling, 1 reply; 7+ messages in thread
From: Logan Gunthorpe @ 2017-08-01 18:32 UTC (permalink / raw)
  To: Matan Barak
  Cc: Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org,
	Potnuri Bharat Teja

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

Hey,

The patch Bharat provided fixes the kernel panic but RDMA in userspace
still does not work at all. Reverting the commits I mentioned still
fixes everything.

To answer your questions:

* I see the iwpm_register_pid message even when things are working so I
don't think it's related.

* All clients I've tried fail. I've attached a log of all the error
messages I see with various clients. (This was with Bharat's patch so
there was no kernel panic and I saw no dmesgs during these runs). The
same runs with the commits I mentioned reverted work fine.

* I retested everything with the CX4 cards as well and they have a
similar problem but produce different error messages. I've attached a
log of client runs as well. The CX4 also works once I revert those
patches. However, by memory, I don't think the CX4s ever suffered from
the kernel panic, and I guess it was just luck that the patches I
reverted caused all these problems.


On 01/08/17 05:08 AM, Matan Barak wrote:
> PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.

If it's not a bug fix I don't think it should have a fixes tag. It
probably didn't mater in this case but you don't want refactor commits
to accidentally reach a stable kernel.

Thanks,

Logan







[-- Attachment #2: cxgb4-client-errors.txt --]
[-- Type: text/plain, Size: 1502 bytes --]

gunthorp@cgy1-donard:~$ ib_write_bw -R

************************************
* Waiting for client to connect... *
************************************
 Couldn't create rdma QP - Invalid argument
Unable to create QP.
Failed to create QP.
 Unable to create the resources needed by comm struct
 Unable to perform rdma_server function
 Unable to init the socket connection
gunthorp@cgy1-donard:~$ ib_write_bw -R flash-cxgb
 Couldn't create rdma QP - Invalid argument
Unable to create QP.
Failed to create QP.
 Unable to create the resources needed by comm struct
 Unable to perform rdma_client function
 Unable to init the socket connection
gunthorp@cgy1-donard:~$ rping -s
rdma_create_qp: Invalid argument
setup_qp failed: -1
gunthorp@cgy1-donard:~$ rping -c -a flash-cxgb -v -C5
rdma_create_qp: Invalid argument
setup_qp failed: -1
gunthorp@cgy1-donard:~$ ucmatose 
cmatose: starting server
cmatose: unable to create QP: Invalid argument
cmatose: failing connection request
test complete
return status -1
gunthorp@cgy1-donard:~$ ucmatose -s flash-cxgb
cmatose: starting client
cmatose: connecting
cmatose: unable to create QP: Invalid argument
test complete
return status -1
gunthorp@cgy1-donard:~$ ibv_rc_pingpong 
  local address:  LID 0x0000, QPN 0x000430, PSN 0xef9f8e, GID ::
Failed to modify QP to RTR
Couldn't connect to remote QP
gunthorp@cgy1-donard:~$ ibv_rc_pingpong flash-cxgb
  local address:  LID 0x0000, QPN 0x000438, PSN 0x9e00c8, GID ::
client read: Success
Couldn't read remote address

[-- Attachment #3: mlx5-client-errors.txt --]
[-- Type: text/plain, Size: 1373 bytes --]

gunthorp@cgy1-donard:~$ ib_write_bw -R -d mlx5_0 flash-rdma
Unexpected CM event bl blka 6
 Unable to perform rdma_client function
 Unable to init the socket connection
gunthorp@cgy1-donard:~$ ib_write_bw -R -d mlx5_0 

************************************
* Waiting for client to connect... *
************************************
Function rdma_accept failed
 Unable to perform rdma_server function
 Unable to init the socket connection
gunthorp@cgy1-donard:~$ rping -s
rdma_accept: Invalid argument
connect error -1
gunthorp@cgy1-donard:~$ rping -c -a flash-rdma
cma event RDMA_CM_EVENT_CONNECT_ERROR, error -1
wait for CONNECTED state 4
connect error -1
gunthorp@cgy1-donard:~$ ucmatose
cmatose: starting server
cmatose: failure accepting: Invalid argument
cmatose: failing connection request
test complete
return status -1
gunthorp@cgy1-donard:~$ ucmatose -s flash-rdma
cmatose: starting client
cmatose: connecting
cmatose: event: RDMA_CM_EVENT_CONNECT_ERROR, error: -1
test complete
return status -1
gunthorp@cgy1-donard:~$ ibv_rc_pingpong
  local address:  LID 0x0003, QPN 0x000847, PSN 0x2a678f, GID ::
Failed to modify QP to RTR
Couldn't connect to remote QP
gunthorp@cgy1-donard:~$ ibv_rc_pingpong flash-rdma
  local address:  LID 0x0003, QPN 0x000848, PSN 0xe014bd, GID ::
  remote address: LID 0x0002, QPN 0x000849, PSN 0x2bd346, GID ::
Failed to modify QP to RTR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-08-01 12:30   ` Potnuri Bharat Teja
@ 2017-08-01 18:35     ` Logan Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Logan Gunthorpe @ 2017-08-01 18:35 UTC (permalink / raw)
  To: Potnuri Bharat Teja, Matan Barak
  Cc: Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Jason Gunthorpe, Stephen Bates, linux-kernel@vger.kernel.org

Hey,

On 01/08/17 06:30 AM, Potnuri Bharat Teja wrote:
> Hi Logan,
> Today I sent out a patch to address the issue. Please try it.
> "[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

Thanks, as I mentioned in my other email this fixes the kernel panic on
the T6 but doesn't solve all my problems. You can add a

Tested-by: Logan Gunthorpe <logang@deltatee.com>

Logan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-08-01 18:32   ` Logan Gunthorpe
@ 2017-08-01 19:29     ` Jason Gunthorpe
  2017-08-01 19:39       ` Logan Gunthorpe
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2017-08-01 19:29 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Matan Barak, Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Stephen Bates, linux-kernel@vger.kernel.org, Potnuri Bharat Teja

On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote:
>  Couldn't create rdma QP - Invalid argument
> Unable to create QP.
> Failed to create QP.

Failing to create a QP makes me wonder if you have have this patch?

 Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number

 The port number is only valid if IB_QP_PORT is set in the mask.
 So only check port number if it is valid to prevent modify_qp from
 failing due to an invalid port number.

 Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds")
 Cc: <stable@vger.kernel.org> # v2.6.14+
 Reviewed-by: Steve Wise <swise@opengridcomputing.com>
 Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>

Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
  2017-08-01 19:29     ` Jason Gunthorpe
@ 2017-08-01 19:39       ` Logan Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Logan Gunthorpe @ 2017-08-01 19:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Matan Barak, Yishai Hadas, Doug Ledford,
	linux-rdma@vger.kernel.org, Sean Hefty, Hal Rosenstock,
	Stephen Bates, linux-kernel@vger.kernel.org, Potnuri Bharat Teja


On 01/08/17 01:29 PM, Jason Gunthorpe wrote:
> On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote:
>>  Couldn't create rdma QP - Invalid argument
>> Unable to create QP.
>> Failed to create QP.
> 
> Failing to create a QP makes me wonder if you have have this patch?
> 
>  Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number
> 
>  The port number is only valid if IB_QP_PORT is set in the mask.
>  So only check port number if it is valid to prevent modify_qp from
>  failing due to an invalid port number.
> 
>  Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds")
>  Cc: <stable@vger.kernel.org> # v2.6.14+
>  Reviewed-by: Steve Wise <swise@opengridcomputing.com>
>  Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>

Oh, oops, I forgot about that. I mentioned the fix for that in my
original email and it seems I wasn't testing apples to apples for my
testing today. During my testing today, the branch with the reverted
commits had the fix for that commit while the branch with Bharat's patch
didn't.

I just did a test with both Bharat's patch and 5a7a88f1b4, and
everything is working correctly again.

So that's great, we just need these patches to be picked up by the
stable kernels.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-08-01 19:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-28 17:38 BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20 Logan Gunthorpe
2017-08-01 11:08 ` Matan Barak
2017-08-01 12:30   ` Potnuri Bharat Teja
2017-08-01 18:35     ` Logan Gunthorpe
2017-08-01 18:32   ` Logan Gunthorpe
2017-08-01 19:29     ` Jason Gunthorpe
2017-08-01 19:39       ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox