All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Bart Van Assche <Bart.VanAssche-Sjgp3cTcYWE@public.gmane.org>,
	"jgg-uk2M96/98Pc@public.gmane.org"
	<jgg-uk2M96/98Pc@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
	<ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: Kernel v4.16 / v4.17 SRP and SRPT patches
Date: Wed, 10 Jan 2018 13:59:10 -0500	[thread overview]
Message-ID: <1515610750.10153.1.camel@redhat.com> (raw)
In-Reply-To: <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>

On Wed, 2018-01-10 at 18:40 +0000, Bart Van Assche wrote:
> On Wed, 2018-01-10 at 11:26 -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 10, 2018 at 08:42:03AM -0500, Laurence Oberman wrote:
> > 
> > > [  946.647514] kernel tried to execute NX-protected page -
> > > exploit
> > > attempt? (uid: 0)
> > > [  946.691954] BUG: unable to handle kernel paging request at
> > > 00000000a2129b93
> > > [  947.889552] Call Trace:
> > > [  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
> > > [  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > [  947.958153]  ? process_one_work+0x141/0x340
> > > [  947.981362]  ? worker_thread+0x47/0x3e0
> > > [  948.002102]  ? kthread+0xf5/0x130
> > > [  948.020538]  ? rescuer_thread+0x380/0x380
> > > [  948.043180]  ? kthread_associate_blkcg+0x90/0x90
> > > [  948.070184]  ? ret_from_fork+0x1f/0x30
> > 
> > These oops's you have are very suggestive that ib_wc->wr_cqe
> > is garbage..
> > 
> > Did SRP free its wr_cqe data before completion somehow?
> > 
> > Turn on slab poisoning to confirm?
> 
> Hello Jason,
> 
> It's easy to see in drivers/infiniband/core/cq.c that polling is
> stopped
> before a completion queue is destroyed (see also the
> cancel_work_sync(&cq->work)
> and the cq->device->destroy_cq(cq) calls in ib_free_cq()).
> 
> BTW, I run all my tests with SLAB poisoning enabled. My SRP tests
> pass if I run
> the SRP initiator and target drivers on top of the mlx4 and rdma_rxe
> drivers.
> 
> Bart.

Hi Jason

Yep, this seems specific to the mlx5 and IB. 
The problem though is Linus's tree 4.15-rc-7 already has enough of the
part of the RDMA updates to see issues.

With his tree I don't panic but I see this

[ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
[ 1360.550531] mlx5_core 0000:08:00.1: mlx5_enter_error_state:121:(pid
15149): start
[ 1360.593520] ------------[ cut here ]------------
[ 1360.619930] got unsolicited completion for CQ 0x0000000068694acd
[ 1360.654434] WARNING: CPU: 15 PID: 15149 at
drivers/infiniband/core/cq.c:80 ib_cq_completion_direct+0x28/0x30
[ib_core]
[ 1360.716099] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
ghash_clmulni_intel pcbc joydev aesni_intel dm_service_time ipmi_si
crypto_simd glue_helper sg hpilo cryptd hpwdt ipmi_devintf iTCO_wdt
gpio_ich acpi_power_meter iTCO_vendor_support ipmi_msghandler shpchp
pcspkr i7core_edac lpc_ich
[ 1361.120851]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
dm_multipath sunrpc ip_tables xfs libcrc32c radeon i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm sd_mod
drm mlx5_core mlxfw ptp serio_raw crc32c_intel i2c_core hpsa pps_core
bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1361.288913] CPU: 15 PID: 15149 Comm: reboot Tainted:
G          I      4.15.0-rc7 #1
[ 1361.333577] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1361.369976] RIP: 0010:ib_cq_completion_direct+0x28/0x30 [ib_core]
[ 1361.404971] RSP: 0018:ffffa08c8747fc60 EFLAGS: 00010086
[ 1361.435007] RAX: 0000000000000000 RBX: ffff8d37a6f8b468 RCX:
ffffffffae662928
[ 1361.474397] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
0000000000000046
[ 1361.515097] RBP: ffff8d2bb07e0000 R08: 0000000000000000 R09:
0000000000000717
[ 1361.555054] R10: 0000000000000000 R11: ffffa08c8747f9c8 R12:
ffff8d2ed1edc264
[ 1361.595593] R13: ffff8d37a6f8b400 R14: ffffa08c8747fca8 R15:
0000000000000083
[ 1361.635133] FS:  00007fc09956a880(0000) GS:ffff8d37b33c0000(0000)
knlGS:0000000000000000
[ 1361.681800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1361.714217] CR2: 0000000001034f80 CR3: 0000000ba0f9e005 CR4:
00000000000206e0
[ 1361.754794] Call Trace:
[ 1361.768980]  mlx5_ib_event+0x335/0x410 [mlx5_ib]
[ 1361.795303]  mlx5_core_event+0x7b/0x1a0 [mlx5_core]
[ 1361.823438]  ? synchronize_irq+0x35/0xa0
[ 1361.845962]  mlx5_enter_error_state+0xe4/0x1c0 [mlx5_core]
[ 1361.877382]  shutdown+0x127/0x170 [mlx5_core]
[ 1361.902688]  pci_device_shutdown+0x31/0x60
[ 1361.925924]  device_shutdown+0x101/0x1d0
[ 1361.948642]  kernel_restart+0xe/0x60
[ 1361.968517]  SYSC_reboot+0x1e8/0x210
[ 1361.988062]  ? __audit_syscall_entry+0xaf/0x100
[ 1362.013500]  ? syscall_trace_enter+0x1cc/0x2b0
[ 1362.038483]  ? __audit_syscall_exit+0x1ff/0x280
[ 1362.064598]  do_syscall_64+0x61/0x1a0
[ 1362.084635]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1362.111113] RIP: 0033:0x7fc098377a56
[ 1362.131668] RSP: 002b:00007ffd4b3377e8 EFLAGS: 00000206 ORIG_RAX:
00000000000000a9
[ 1362.174578] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
00007fc098377a56
[ 1362.213620] RDX: 0000000001234567 RSI: 0000000028121969 RDI:
fffffffffee1dead
[ 1362.255259] RBP: 0000000000000000 R08: 000056141a7642a0 R09:
00007ffd4b336eb0
[ 1362.296293] R10: 0000000000000024 R11: 0000000000000206 R12:
0000000000000000
[ 1362.338341] R13: 00007ffd4b337ab0 R14: 0000000000000000 R15:
0000000000000000
[ 1362.378518] Code: 00 00 00 66 66 66 66 90 80 3d 65 e1 02 00 00 74 02
f3 c3 48 89 fe 31 c0 48 c7 c7 68 58 92 c0 c6 05 4e e1 02 00 01 e8 a8 23
d8 ec <0f> ff c3 0f 1f 44 00 00 66 66 66 66 90 41 55 45 89 c5 41 54 49 
[ 1362.483962] ---[ end trace 528ee06930a5763f ]---
[ 1362.509435] mlx5_1:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.548716] scsi host2: ib_srp: failed RECV status WR flushed (5)
for CQE 0000000023e53497
[ 1362.595980] mlx5_core 0000:08:00.1: mlx5_enter_error_state:128:(pid
15149): end
[ 1362.637630] mlx5_core 0000:08:00.0: Shutdown was called
[ 1362.677523] mlx5_core 0000:08:00.0: mlx5_enter_error_state:121:(pid
15149): start
[ 1362.720734] mlx5_0:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.760795] scsi host1: ib_srp: failed RECV status WR flushed (5)
for CQE 000000009ad07e27
[ 1362.806977] mlx5_core 0000:08:00.0: mlx5_enter_error_state:128:(pid
15149): end

With the latest RDMA tree additions I panic every time on shutdown.
This is built against  4.15.0-rc2 with whatever other patches are in
the RDMA tree.

I was testing Bart's tree when I panicked and we know now we hve an
issue in mlx5/ib

I am waiting to see what Leon and the RDMA folks want to do so I can
avoid another bisect, but if I have to instrument and/or bisect I will
do it.

Regards
Laurence


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2018-01-10 18:59 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-06  0:22 [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation Randy Dunlap
     [not found] ` <5a5016c0.4c0a620a.ed2b3.60da-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
2018-01-06  0:36   ` Bart Van Assche
     [not found]     ` <fcc3f226-848d-abc4-2a81-f4fd821761c9-Sjgp3cTcYWE@public.gmane.org>
2018-01-06  5:55       ` Randy Dunlap
     [not found]         ` <31f69352-b8b1-9ed1-635b-2c654b49c775-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2018-01-06 16:50           ` Bart Van Assche
2018-01-09 20:15       ` Laurence Oberman
     [not found]         ` <1515528956.3919.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 20:31           ` Laurence Oberman
     [not found]             ` <1515529869.3919.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 20:51               ` Kernel v4.16 / v4.17 SRP and SRPT patches Bart Van Assche
     [not found]                 ` <1515531079.2721.26.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-09 21:00                   ` Laurence Oberman
     [not found]                     ` <1515531652.26021.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 22:40                       ` Laurence Oberman
     [not found]                         ` <1515537614.26021.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 13:42                           ` Laurence Oberman
     [not found]                             ` <1515591723.26021.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 18:26                               ` Jason Gunthorpe
     [not found]                                 ` <20180110182648.GI4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 18:40                                   ` Bart Van Assche
     [not found]                                     ` <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-10 18:59                                       ` Laurence Oberman [this message]
     [not found]                                         ` <1515610750.10153.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 19:15                                           ` Jason Gunthorpe
     [not found]                                             ` <20180110191510.GK4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 19:30                                               ` Laurence Oberman
     [not found]                                                 ` <1515612639.10153.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 20:52                                                   ` Jason Gunthorpe
     [not found]                                                     ` <20180110205243.GP4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-10 21:11                                                       ` Laurence Oberman
     [not found]                                                         ` <1515618674.10153.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 21:15                                                           ` Jason Gunthorpe
     [not found]                                                             ` <20180110211501.GS4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-11 13:02                                                               ` Laurence Oberman
     [not found]                                                                 ` <1515675741.21421.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 18:20                                                                   ` Laurence Oberman
     [not found]                                                                     ` <1515694855.21421.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 18:35                                                                       ` Patch: RDMA mlx5_core.c : mlx5_try_fast_unload causes panics Laurence Oberman
2018-01-11 20:43                                                                   ` Kernel v4.16 / v4.17 SRP and SRPT patches Laurence Oberman
     [not found]                                                                     ` <1515703435.21421.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 21:15                                                                       ` Bart Van Assche
     [not found]                                                                         ` <1515705340.2752.60.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-11 21:33                                                                           ` Laurence Oberman
     [not found]                                                                             ` <1515706433.21421.11.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 21:43                                                                               ` Bart Van Assche
2018-01-12 21:11                                                                               ` Bart Van Assche
     [not found]                                                                                 ` <1515791472.2396.57.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-13  0:09                                                                                   ` Laurence Oberman
     [not found]                                                                                     ` <1515802177.1566.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-13  1:57                                                                                       ` Laurence Oberman
     [not found]                                                                                         ` <1515808673.11354.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-13 14:53                                                                                           ` Laurence Oberman
     [not found]                                                                                             ` <1515855226.32050.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-15 16:12                                                                                               ` Bart Van Assche
     [not found]                                                                                                 ` <1516032762.3951.5.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-15 16:52                                                                                                   ` Laurence Oberman
2018-01-10 21:17                                                           ` Laurence Oberman
2018-01-10 19:17                                       ` Jason Gunthorpe
     [not found]                                         ` <20180110191758.GL4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 19:32                                           ` Bart Van Assche
     [not found]                                             ` <1515612733.2745.27.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-10 22:43                                               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1515610750.10153.1.camel@redhat.com \
    --to=loberman-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=Bart.VanAssche-Sjgp3cTcYWE@public.gmane.org \
    --cc=ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=jgg-uk2M96/98Pc@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.