* mlx5 core/en oops in 4.6-rc6+
@ 2016-05-05 16:00 Doug Ledford
2016-05-05 16:42 ` Saeed Mahameed
0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-05-05 16:00 UTC (permalink / raw)
To: Linux Netdev List
[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]
Just had this pop up during testing, happened very soon after bootup:
[ 47.235925] BUG: unable to handle kernel NULL pointer dereference at
00000000000001e8
[ 47.245057] IP: [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
[ 47.252822] PGD 0
[ 47.255218] Oops: 0000 [#1] SMP
[ 47.259070] Modules linked in: sch_mqprio bridge 8021q garp mrp stp
llc ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa
ib_mad x86_pkg_temp_thermal coretd
[ 47.352984] CPU: 18 PID: 1358 Comm: NetworkManager Not tainted
4.6.0-rc6-00004-g7199787 #102
[ 47.362460] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS
1.6.2 01/08/2016
[ 47.370869] task: ffff88103369d000 ti: ffff88103751c000 task.ti:
ffff88103751c000
[ 47.379263] RIP: 0010:[<ffffffffc0328b9c>] [<ffffffffc0328b9c>]
mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
[ 47.389627] RSP: 0018:ffff88103751f7d0 EFLAGS: 00010282
[ 47.395574] RAX: ffff880fe6f51d00 RBX: 0000000000000000 RCX:
0000000000000081
[ 47.403571] RDX: ffff880ff1dc3000 RSI: ffff880fe6f51d00 RDI:
0000000000000000
[ 47.411561] RBP: ffff88103751f828 R08: 0000000000020c80 R09:
ffffffff81871e04
[ 47.419563] R10: ffffea003f9bd400 R11: ffff88100116de00 R12:
000000000000003e
[ 47.427566] R13: ffff880fe6f51d00 R14: ffff8810240d0090 R15:
ffff8810240d0068
[ 47.435557] FS: 00007fd79b882dc0(0000) GS:ffff88103ee40000(0000)
knlGS:0000000000000000
[ 47.444625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.451062] CR2: 00000000000001e8 CR3: 0000001cf86c5000 CR4:
00000000001406e0
[ 47.459053] Stack:
[ 47.461306] ffffffff81875480 ffff880fe6f50c00 ffff881d02f9b800
ffff88103751f838
[ 47.469647] ffffffff81a08415 ffff88103751f818 ffff880fe6f51d00
000000000000003e
[ 47.477964] ffff881d02f9bd00 ffff8810240d0090 ffff8810240d0068
ffff88103751f838
[ 47.486279] Call Trace:
[ 47.489019] [<ffffffff81875480>] ? consume_skb+0x80/0x150
[ 47.495178] [<ffffffff81a08415>] ? packet_rcv+0x65/0x6d0
[ 47.501244] [<ffffffffc03299ae>] mlx5e_xmit+0x2e/0x40 [mlx5_core]
[ 47.508169] [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
[ 47.515007] [<ffffffff818951bb>] ? validate_xmit_skb.isra.80+0x4b/0x4e0
[ 47.522516] [<ffffffff818d036f>] sch_direct_xmit+0x19f/0x360
[ 47.528963] [<ffffffff81896565>] __dev_queue_xmit+0x6e5/0xaa0
[ 47.535502] [<ffffffff81875480>] ? consume_skb+0x80/0x150
[ 47.542723] [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
[ 47.549856] [<ffffffffc08d1d54>]
vlan_dev_hard_start_xmit+0x104/0x210 [8021q]
[ 47.558933] [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
[ 47.566738] [<ffffffff8189675a>] __dev_queue_xmit+0x8da/0xaa0
[ 47.574246] [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
[ 47.581349] [<ffffffff818a2d07>] neigh_connected_output+0x107/0x170
[ 47.589433] [<ffffffff819a3e9f>] ip6_finish_output2+0x23f/0x720
[ 47.597128] [<ffffffff81430f32>] ? selinux_ipv6_postroute+0x22/0x30
[ 47.605207] [<ffffffff819a666b>] ip6_finish_output+0x13b/0x1e0
[ 47.612809] [<ffffffff819a6777>] ip6_output+0x67/0x1c0
[ 47.619619] [<ffffffff819a6530>] ? ip6_fragment+0xd80/0xd80
[ 47.626903] [<ffffffff819fb80d>] ip6_local_out+0x4d/0x60
[ 47.633884] [<ffffffff819a703b>] ip6_send_skb+0x2b/0xb0
[ 47.640773] [<ffffffff819a713d>] ip6_push_pending_frames+0x7d/0x90
[ 47.648710] [<ffffffff819d533d>] rawv6_sendmsg+0xd2d/0x1210
[ 47.655938] [<ffffffff8128f70a>] ? do_wp_page+0x3ba/0x910
[ 47.662944] [<ffffffff8142a970>] ? sock_has_perm+0x80/0xb0
[ 47.670020] [<ffffffff8194f2c7>] inet_sendmsg+0x97/0xf0
[ 47.676778] [<ffffffff818673f8>] sock_sendmsg+0x58/0x90
[ 47.683505] [<ffffffff81868148>] SYSC_sendto+0x138/0x1b0
[ 47.690302] [<ffffffff8109d5a8>] ? __do_page_fault+0x338/0x9d0
[ 47.697656] [<ffffffff8116b131>] ? ktime_get_with_offset+0x71/0x130
[ 47.705481] [<ffffffff81163ee7>] ? posix_get_boottime+0x37/0x60
[ 47.712904] [<ffffffff81868b36>] SyS_sendto+0x16/0x20
[ 47.719346] [<ffffffff81a336b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 47.727230] Code: 05 a9 9f 03 00 01 66 31 47 48 5d c3 0f 1f 00 0f 1f
44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 48 89 fb 48 83
ec 30 <0f> b7 87 e8 01 00 00 0f b6 8f ea 01 00 00 45 8b 95 80 00 00 00
[ 47.750336] RIP [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80
[mlx5_core]
[ 47.758755] RSP <ffff88103751f7d0>
[ 47.763368] CR2: 00000000000001e8
[ 47.767779] ---[ end trace 35565b04ca44e521 ]---
It appears to be intermittent as this machine has booted this kernel
multiple times without hitting this. Network setup includes both vlan
and non-vlan interfaces. If you need more info from me, please include
me on the Cc: as I don't follow netdev@
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-05 16:00 mlx5 core/en oops in 4.6-rc6+ Doug Ledford @ 2016-05-05 16:42 ` Saeed Mahameed 2016-05-05 17:16 ` Doug Ledford 0 siblings, 1 reply; 7+ messages in thread From: Saeed Mahameed @ 2016-05-05 16:42 UTC (permalink / raw) To: Doug Ledford; +Cc: Linux Netdev List On Thu, May 5, 2016 at 7:00 PM, Doug Ledford <dledford@redhat.com> wrote: > Just had this pop up during testing, happened very soon after bootup: > > [ 47.235925] BUG: unable to handle kernel NULL pointer dereference at > 00000000000001e8 > [ 47.245057] IP: [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core] > [ 47.252822] PGD 0 > [ 47.255218] Oops: 0000 [#1] SMP > [ 47.259070] Modules linked in: sch_mqprio bridge 8021q garp mrp stp > llc ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp > ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa > ib_mad x86_pkg_temp_thermal coretd > [ 47.352984] CPU: 18 PID: 1358 Comm: NetworkManager Not tainted > 4.6.0-rc6-00004-g7199787 #102 > [ 47.362460] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS > 1.6.2 01/08/2016 > [ 47.370869] task: ffff88103369d000 ti: ffff88103751c000 task.ti: > ffff88103751c000 > [ 47.379263] RIP: 0010:[<ffffffffc0328b9c>] [<ffffffffc0328b9c>] > mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core] > [ 47.389627] RSP: 0018:ffff88103751f7d0 EFLAGS: 00010282 > [ 47.395574] RAX: ffff880fe6f51d00 RBX: 0000000000000000 RCX: > 0000000000000081 > [ 47.403571] RDX: ffff880ff1dc3000 RSI: ffff880fe6f51d00 RDI: > 0000000000000000 > [ 47.411561] RBP: ffff88103751f828 R08: 0000000000020c80 R09: > ffffffff81871e04 > [ 47.419563] R10: ffffea003f9bd400 R11: ffff88100116de00 R12: > 000000000000003e > [ 47.427566] R13: ffff880fe6f51d00 R14: ffff8810240d0090 R15: > ffff8810240d0068 > [ 47.435557] FS: 00007fd79b882dc0(0000) GS:ffff88103ee40000(0000) > knlGS:0000000000000000 > [ 47.444625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 47.451062] CR2: 00000000000001e8 CR3: 0000001cf86c5000 CR4: > 00000000001406e0 > [ 47.459053] Stack: > [ 47.461306] ffffffff81875480 ffff880fe6f50c00 ffff881d02f9b800 > ffff88103751f838 > [ 47.469647] ffffffff81a08415 ffff88103751f818 ffff880fe6f51d00 > 000000000000003e > [ 47.477964] ffff881d02f9bd00 ffff8810240d0090 ffff8810240d0068 > ffff88103751f838 > [ 47.486279] Call Trace: > [ 47.489019] [<ffffffff81875480>] ? consume_skb+0x80/0x150 > [ 47.495178] [<ffffffff81a08415>] ? packet_rcv+0x65/0x6d0 > [ 47.501244] [<ffffffffc03299ae>] mlx5e_xmit+0x2e/0x40 [mlx5_core] > [ 47.508169] [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650 > [ 47.515007] [<ffffffff818951bb>] ? validate_xmit_skb.isra.80+0x4b/0x4e0 > [ 47.522516] [<ffffffff818d036f>] sch_direct_xmit+0x19f/0x360 > [ 47.528963] [<ffffffff81896565>] __dev_queue_xmit+0x6e5/0xaa0 > [ 47.535502] [<ffffffff81875480>] ? consume_skb+0x80/0x150 > [ 47.542723] [<ffffffff81896958>] dev_queue_xmit+0x18/0x30 > [ 47.549856] [<ffffffffc08d1d54>] > vlan_dev_hard_start_xmit+0x104/0x210 [8021q] > [ 47.558933] [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650 > [ 47.566738] [<ffffffff8189675a>] __dev_queue_xmit+0x8da/0xaa0 > [ 47.574246] [<ffffffff81896958>] dev_queue_xmit+0x18/0x30 > [ 47.581349] [<ffffffff818a2d07>] neigh_connected_output+0x107/0x170 > [ 47.589433] [<ffffffff819a3e9f>] ip6_finish_output2+0x23f/0x720 > [ 47.597128] [<ffffffff81430f32>] ? selinux_ipv6_postroute+0x22/0x30 > [ 47.605207] [<ffffffff819a666b>] ip6_finish_output+0x13b/0x1e0 > [ 47.612809] [<ffffffff819a6777>] ip6_output+0x67/0x1c0 > [ 47.619619] [<ffffffff819a6530>] ? ip6_fragment+0xd80/0xd80 > [ 47.626903] [<ffffffff819fb80d>] ip6_local_out+0x4d/0x60 > [ 47.633884] [<ffffffff819a703b>] ip6_send_skb+0x2b/0xb0 > [ 47.640773] [<ffffffff819a713d>] ip6_push_pending_frames+0x7d/0x90 > [ 47.648710] [<ffffffff819d533d>] rawv6_sendmsg+0xd2d/0x1210 > [ 47.655938] [<ffffffff8128f70a>] ? do_wp_page+0x3ba/0x910 > [ 47.662944] [<ffffffff8142a970>] ? sock_has_perm+0x80/0xb0 > [ 47.670020] [<ffffffff8194f2c7>] inet_sendmsg+0x97/0xf0 > [ 47.676778] [<ffffffff818673f8>] sock_sendmsg+0x58/0x90 > [ 47.683505] [<ffffffff81868148>] SYSC_sendto+0x138/0x1b0 > [ 47.690302] [<ffffffff8109d5a8>] ? __do_page_fault+0x338/0x9d0 > [ 47.697656] [<ffffffff8116b131>] ? ktime_get_with_offset+0x71/0x130 > [ 47.705481] [<ffffffff81163ee7>] ? posix_get_boottime+0x37/0x60 > [ 47.712904] [<ffffffff81868b36>] SyS_sendto+0x16/0x20 > [ 47.719346] [<ffffffff81a336b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4 > [ 47.727230] Code: 05 a9 9f 03 00 01 66 31 47 48 5d c3 0f 1f 00 0f 1f > 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 48 89 fb 48 83 > ec 30 <0f> b7 87 e8 01 00 00 0f b6 8f ea 01 00 00 45 8b 95 80 00 00 00 > [ 47.750336] RIP [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80 > [mlx5_core] > [ 47.758755] RSP <ffff88103751f7d0> > [ 47.763368] CR2: 00000000000001e8 > [ 47.767779] ---[ end trace 35565b04ca44e521 ]--- > > It appears to be intermittent as this machine has booted this kernel > multiple times without hitting this. Network setup includes both vlan > and non-vlan interfaces. If you need more info from me, please include > me on the Cc: as I don't follow netdev@ > Hi Doug, did you by change configure TC queues for the netdev ? i.e. dev->num_tc > 1 if not i would be happy to get more info in you network configuration. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-05 16:42 ` Saeed Mahameed @ 2016-05-05 17:16 ` Doug Ledford 2016-05-05 20:51 ` Saeed Mahameed 0 siblings, 1 reply; 7+ messages in thread From: Doug Ledford @ 2016-05-05 17:16 UTC (permalink / raw) To: Saeed Mahameed; +Cc: Linux Netdev List [-- Attachment #1: Type: text/plain, Size: 5654 bytes --] On 05/05/2016 12:42 PM, Saeed Mahameed wrote: > On Thu, May 5, 2016 at 7:00 PM, Doug Ledford <dledford@redhat.com> wrote: >> Just had this pop up during testing, happened very soon after bootup: >> [ snip oops ] > Hi Doug, > > did you by change configure TC queues for the netdev ? i.e. dev->num_tc > 1 > if not i would be happy to get more info in you network configuration. That depends on which interface actually generated the oops. If it was the base interface, then I don't manually set any special params on it. If it's one of the vlan interfaces, then there is a NetworkManager dispatcher script that is intended to set the tc count on interface up: [root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4* :::::::::::::: /etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf :::::::::::::: #!/bin/sh interface=$1 status=$2 [ "$interface" = mlx5_roce.43 ] || exit 0 case $status in up) tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ;; esac --More--(Next file: /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf:::::::::::::: /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf :::::::::::::: #!/bin/sh interface=$1 status=$2 [ "$interface" = mlx5_roce.45 ] || exit 0 case $status in up) tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ;; esac [root@rdma-virt-03 ~]$ However, I should note that this usage of tc is a bit out of date last I checked and doesn't even work any more. Let me double check... [root@rdma-virt-02 vlan]$ cd /proc/net/vlan/ [root@rdma-virt-02 vlan]$ ls config mlx5_roce.43 mlx5_roce.45 [root@rdma-virt-02 vlan]$ [root@rdma-virt-02 vlan]$ for i in *; do echo "$i:"; cat $i; echo; done config: VLAN Dev name | VLAN ID Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD mlx5_roce.45 | 45 | mlx5_roce mlx5_roce.43 | 43 | mlx5_roce mlx5_roce.43: mlx5_roce.43 VID: 43 REORDER_HDR: 1 dev->priv_flags: 1001 total frames received 57 total bytes received 5010 Broadcast/Multicast Rcvd 0 total frames transmitted 20 total bytes transmitted 2525 Device: mlx5_roce INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3 mlx5_roce.45: mlx5_roce.45 VID: 45 REORDER_HDR: 1 dev->priv_flags: 1001 total frames received 57 total bytes received 5010 Broadcast/Multicast Rcvd 0 total frames transmitted 21 total bytes transmitted 2603 Device: mlx5_roce INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 EGRESS priority mappings: 0:5 1:5 2:5 3:5 4:5 5:5 6:5 7:5 OK, so the vlans have egress mappings, but they don't match what the mlx5_roce.43 egress.conf file should have enabled. Digging a little further on this machine: [root@rdma-virt-03 vlan]$ more /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.4? :::::::::::::: /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.43 :::::::::::::: DEVICE=mlx5_roce.43 VLAN=yes VLAN_ID=43 VLAN_EGRESS_PRIORITY_MAP=0:3,1:3,2:3,3:3,4:3,5:3,6:3,7:3 TYPE=Vlan ONBOOT=yes BOOTPROTO=dhcp DEFROUTE=no PEERDNS=no PEERROUTES=yes IPV4_FAILURE_FATAL=yes IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=no IPV6_PEERDNS=no IPV6_PEERROUTES=yes IPV6_FAILURE_FATAL=no NAME=mlx5_roce.43 :::::::::::::: /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.45 :::::::::::::: DEVICE=mlx5_roce.45 VLAN=yes VLAN_ID=45 VLAN_EGRESS_PRIORITY_MAP=0:5,1:5,2:5,3:5,4:5,5:5,6:5,7:5 TYPE=Vlan ONBOOT=yes BOOTPROTO=dhcp DEFROUTE=no PEERDNS=no PEERROUTES=yes IPV4_FAILURE_FATAL=yes IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=no IPV6_PEERDNS=no IPV6_PEERROUTES=yes IPV6_FAILURE_FATAL=no NAME=mlx5_roce.45 [root@rdma-virt-03 vlan]$ This is a Fedora rawhide machine, using NetworkManager to handle the network interfaces. So, the egress priority mappings are being set by NM. I don't know if they are overriding the egress mapping dispatchers or if the egress mapping dispatchers are failing to work/run properly. It might be the latter. Let me double check the command... OK, re-reading the egress dispatchers above, they work on the base interface, not on the vlan interface that triggers them. That's why they both use the same command (mapping to egress 5) instead of being like the ifcfg files, which map the 43 vlan to egress priority 3, and the 45 vlan to egress priority 5. Running tc qdisc | grep mlx5_roce shows that the egress mapping is being applied (although I'm not sure it should be...I made that mapping many kernels ago when that was the right thing to do, the modern mlx5 ethernet drivers create their own mappings that are drastically different). So, to answer your question, yes, num_tc > 1, num_tc == 8, and I probably need to reconfigure that egress dispatcher to do what I want it to do (which is merely to make sure that all packets from specific interfaces are tagged with specific vlan priorities so per-priority flow control between the card and switch works properly, the base interface is supposed to have no priority tag, the 43 vlan is supposed to have priority tag 3, and vlan 45 is supposed to have priority tag 5) on modern kernels. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-05 17:16 ` Doug Ledford @ 2016-05-05 20:51 ` Saeed Mahameed 2016-05-12 17:28 ` Doug Ledford 0 siblings, 1 reply; 7+ messages in thread From: Saeed Mahameed @ 2016-05-05 20:51 UTC (permalink / raw) To: Doug Ledford; +Cc: Linux Netdev List On Thu, May 5, 2016 at 8:16 PM, Doug Ledford <dledford@redhat.com> wrote: > > That depends on which interface actually generated the oops. If it was > the base interface, then I don't manually set any special params on it. > If it's one of the vlan interfaces, then there is a NetworkManager > dispatcher script that is intended to set the tc count on interface up: > > [root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4* > :::::::::::::: > /etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf > :::::::::::::: > #!/bin/sh > interface=$1 > status=$2 > [ "$interface" = mlx5_roce.43 ] || exit 0 > case $status in > up) > tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 > 5 5 5 5 5 5 Well, here you are configuring 8 TCs on the base mlx5 interface, so the answer to my question is yes. It appears that we have a bug in mlx5e_slelect_queue int channel_ix = fallback(dev, skb); return priv->channeltc_to_txq_map[channel_ix][tc]; When num_tc > 1 the fallback can return any value between [0.. num_channles * num_tc ] while channeltc_to_txq_map is an array of the size num_channels. so there is a good chance that channel_ix exceeds the array limits and resulting OOPs. > # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 > ;; > esac > --More--(Next file: > /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf:::::::::::::: > /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf > :::::::::::::: > #!/bin/sh > interface=$1 > status=$2 > [ "$interface" = mlx5_roce.45 ] || exit 0 > case $status in > up) > tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 > 5 5 5 5 5 5 will, here you map all user skb prios (skb->priority) to HW tc 5. BTW skprio or user prio in this example is never the vlan prio it is the ipv4 (ToS). please see http://lartc.org/manpages/tc-prio.html So to achieve a vlan prio to HW tc mapping, you will need to map the skprios to vlan prios using vlan egress mapping which i see you already do down below. But, our select queue implementation will extract the vlan priority and use the corresponding TC from our own priv->channeltc_to_txq_map[channel_ix][up] mapping where up is vlan user priority. but this only applies to kernel traffic, i don't see why it is needed for RoCE. Currently this code is buggy and I will need to dig more into how to provide a full working solution that fits our hardware requirements and complies with the kernel QoS APIs. [...] > [root@rdma-virt-02 vlan]$ for i in *; do echo "$i:"; cat $i; echo; done > config: > VLAN Dev name | VLAN ID > Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD > mlx5_roce.45 | 45 | mlx5_roce > mlx5_roce.43 | 43 | mlx5_roce > > mlx5_roce.43: > mlx5_roce.43 VID: 43 REORDER_HDR: 1 dev->priv_flags: 1001 > total frames received 57 > total bytes received 5010 > Broadcast/Multicast Rcvd 0 > > total frames transmitted 20 > total bytes transmitted 2525 > Device: mlx5_roce > INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 > EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3 > Here you map every SKB prio (0..7) to vlan priorty 3. > > mlx5_roce.45: > mlx5_roce.45 VID: 45 REORDER_HDR: 1 dev->priv_flags: 1001 > total frames received 57 > total bytes received 5010 > Broadcast/Multicast Rcvd 0 > > total frames transmitted 21 > total bytes transmitted 2603 > Device: mlx5_roce > INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 > EGRESS priority mappings: 0:5 1:5 2:5 3:5 4:5 5:5 6:5 7:5 > > OK, so the vlans have egress mappings, but they don't match what the > mlx5_roce.43 egress.conf file should have enabled. Digging a little > further on this machine: > > [root@rdma-virt-03 vlan]$ more > /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.4? > :::::::::::::: > /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.43 > :::::::::::::: > DEVICE=mlx5_roce.43 > VLAN=yes > VLAN_ID=43 > VLAN_EGRESS_PRIORITY_MAP=0:3,1:3,2:3,3:3,4:3,5:3,6:3,7:3 > TYPE=Vlan > ONBOOT=yes > BOOTPROTO=dhcp > DEFROUTE=no > PEERDNS=no > PEERROUTES=yes > IPV4_FAILURE_FATAL=yes > IPV6INIT=yes > IPV6_AUTOCONF=yes > IPV6_DEFROUTE=no > IPV6_PEERDNS=no > IPV6_PEERROUTES=yes > IPV6_FAILURE_FATAL=no > NAME=mlx5_roce.43 > :::::::::::::: > /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.45 > :::::::::::::: > DEVICE=mlx5_roce.45 > VLAN=yes > VLAN_ID=45 > VLAN_EGRESS_PRIORITY_MAP=0:5,1:5,2:5,3:5,4:5,5:5,6:5,7:5 > TYPE=Vlan > ONBOOT=yes > BOOTPROTO=dhcp > DEFROUTE=no > PEERDNS=no > PEERROUTES=yes > IPV4_FAILURE_FATAL=yes > IPV6INIT=yes > IPV6_AUTOCONF=yes > IPV6_DEFROUTE=no > IPV6_PEERDNS=no > IPV6_PEERROUTES=yes > IPV6_FAILURE_FATAL=no > NAME=mlx5_roce.45 > [root@rdma-virt-03 vlan]$ > > This is a Fedora rawhide machine, using NetworkManager to handle the > network interfaces. So, the egress priority mappings are being set by > NM. I don't know if they are overriding the egress mapping dispatchers > or if the egress mapping dispatchers are failing to work/run properly. > It might be the latter. Let me double check the command... > > OK, re-reading the egress dispatchers above, they work on the base > interface, not on the vlan interface that triggers them. That's why > they both use the same command (mapping to egress 5) instead of being > like the ifcfg files, which map the 43 vlan to egress priority 3, and > the 45 vlan to egress priority 5. Running tc qdisc | grep mlx5_roce > shows that the egress mapping is being applied (although I'm not sure it > should be...I made that mapping many kernels ago when that was the right > thing to do, the modern mlx5 ethernet drivers create their own mappings > that are drastically different). > > So, to answer your question, yes, num_tc > 1, num_tc == 8, and I > probably need to reconfigure that egress dispatcher to do what I want it > to do (which is merely to make sure that all packets from specific > interfaces are tagged with specific vlan priorities so per-priority flow > control between the card and switch works properly, the base interface > is supposed to have no priority tag, the 43 vlan is supposed to have > priority tag 3, and vlan 45 is supposed to have priority tag 5) on > modern kernels. > As i said above configuring any num_tc > 1 might cause the panic you saw. Regarding the proper mapping to do for 45 => priority 5, 43 => prio 3. the egress mappings you already did above should be sufficient, the question is, do you need the vlan priorities to be mapped to a specific HW TC dispatchers ? if not, then you don't need to configure "tc qdisc add dev mlx5_roce root ..." at all. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-05 20:51 ` Saeed Mahameed @ 2016-05-12 17:28 ` Doug Ledford 2016-05-19 17:13 ` Eran Ben Elisha 0 siblings, 1 reply; 7+ messages in thread From: Doug Ledford @ 2016-05-12 17:28 UTC (permalink / raw) To: Saeed Mahameed; +Cc: Linux Netdev List [-- Attachment #1: Type: text/plain, Size: 5334 bytes --] On 05/05/2016 04:51 PM, Saeed Mahameed wrote: > On Thu, May 5, 2016 at 8:16 PM, Doug Ledford <dledford@redhat.com> wrote: >> >> That depends on which interface actually generated the oops. If it was >> the base interface, then I don't manually set any special params on it. >> If it's one of the vlan interfaces, then there is a NetworkManager >> dispatcher script that is intended to set the tc count on interface up: >> >> [root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4* >> :::::::::::::: >> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf >> :::::::::::::: >> #!/bin/sh >> interface=$1 >> status=$2 >> [ "$interface" = mlx5_roce.43 ] || exit 0 >> case $status in >> up) >> tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 >> 5 5 5 5 5 5 > > Well, here you are configuring 8 TCs on the base mlx5 interface, so > the answer to my question is yes. Correct. I mentioned that at the end of my email ;-) > It appears that we have a bug in mlx5e_slelect_queue > > int channel_ix = fallback(dev, skb); > return priv->channeltc_to_txq_map[channel_ix][tc]; > > When num_tc > 1 the fallback can return any value between [0.. > num_channles * num_tc ] > > while channeltc_to_txq_map is an array of the size num_channels. > > so there is a good chance that channel_ix exceeds the array limits and > resulting OOPs. > >> # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 >> ;; >> esac >> --More--(Next file: >> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf:::::::::::::: >> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf >> :::::::::::::: >> #!/bin/sh >> interface=$1 >> status=$2 >> [ "$interface" = mlx5_roce.45 ] || exit 0 >> case $status in >> up) >> tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 >> 5 5 5 5 5 5 > > will, here you map all user skb prios (skb->priority) to HW tc 5. > BTW skprio or user prio in this example is never the vlan prio it is > the ipv4 (ToS). > > please see http://lartc.org/manpages/tc-prio.html Ok. > So to achieve a vlan prio to HW tc mapping, you will need to map the > skprios to vlan prios using vlan egress mapping > which i see you already do down below. I do, and this is all related to trying to get PFC working for RoCE on these cards. For the most part, the things you see here are documented in the Mellanox guides related to RoCE setup, or they are things I pulled from the tcwrap.py program that you guys distribute for setting this stuff up. > But, our select queue implementation will extract the vlan priority > and use the corresponding TC from our own > priv->channeltc_to_txq_map[channel_ix][up] mapping > where up is vlan user priority. but this only applies to kernel > traffic, i don't see why it is needed for RoCE. Read your own guides ;-). I'm using this one for your switches: https://community.mellanox.com/docs/DOC-1417 And these to try and get the linux machines configured properly: https://community.mellanox.com/docs/DOC-1414 https://community.mellanox.com/docs/DOC-1415 https://community.mellanox.com/docs/DOC-2311 https://community.mellanox.com/docs/DOC-2474 http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf The guides are helpful if your setup allows you to follow their exact example. But, they are shy on information about how to modify the examples to your specific situation. For instance, I have to use vlan priority 5 as my no-drop priority for RoCE traffic. I can't reliably tell which portions of the guide I must switch the 3s to 5s in order to get the new priority, and which uses of 3s in the guides relate to other things that could be mapped to 5. On a separate note, it's unclear to me if your switches and cards support more than one no-drop priority (other vendor's RoCE cards I'm using here don't, they only allow one no-drop priority for RoCE traffic and it must be 5). If it does support more than one, I'd actually like both 3 and 5 to be no-drop and for one vlan to use 3 and another to use 5. > As i said above configuring any num_tc > 1 might cause the panic you saw. > > Regarding the proper mapping to do for 45 => priority 5, 43 => prio 3. > the egress mappings you already did above should be sufficient, the > question is, do you need the vlan priorities to be mapped to a > specific HW TC dispatchers ? You'd have to tell me. The switch docs make it clear that it's best if no-drop priorities are mapped to TC1 or TC2 (which is not necessarily the same as the TC mapping you refer to here as far as I know, but it might be similar). The doc on setting up ConnectX-4 cards talks about the same basic TC dispatchers on the card, but instead of 4 like the switches have, there are 8. So, does the card's built in firmware/silicon have a preference for where no-drop traffic is queued via TC dispatches like the switches do? > > if not, then you don't need to configure "tc qdisc add dev mlx5_roce > root ..." at all. That appears to be a question for Mellanox to answer. I can't say. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-12 17:28 ` Doug Ledford @ 2016-05-19 17:13 ` Eran Ben Elisha 2016-06-08 12:48 ` Doug Ledford 0 siblings, 1 reply; 7+ messages in thread From: Eran Ben Elisha @ 2016-05-19 17:13 UTC (permalink / raw) To: Doug Ledford; +Cc: Saeed Mahameed, Linux Netdev List, ophirm, Eran Ben Elisha Hi Doug, Attaching here a response from Ophir Maor (from Mellanox community) > > Read your own guides ;-). > > I'm using this one for your switches: > https://community.mellanox.com/docs/DOC-1417 > > And these to try and get the linux machines configured properly: > https://community.mellanox.com/docs/DOC-1414 > https://community.mellanox.com/docs/DOC-1415 > https://community.mellanox.com/docs/DOC-2311 > https://community.mellanox.com/docs/DOC-2474 > http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf > > The guides are helpful if your setup allows you to follow their exact > example. But, they are shy on information about how to modify the > examples to your specific situation. For instance, I have to use vlan > priority 5 as my no-drop priority for RoCE traffic. I can't reliably > tell which portions of the guide I must switch the 3s to 5s in order to > get the new priority, and which uses of 3s in the guides relate to other > things that could be mapped to 5. On a separate note, it's unclear to > me if your switches and cards support more than one no-drop priority > (other vendor's RoCE cards I'm using here don't, they only allow one > no-drop priority for RoCE traffic and it must be 5). If it does support > more than one, I'd actually like both 3 and 5 to be no-drop and for one > vlan to use 3 and another to use 5. There are two flows to configure egress mapping - flow that pass via the kernel. Then you need to use kernel commands (e.g. vconfig set_egress_map, or other commands) to make the kernel set the egress priority. - flows that bypass the kernel such as RoCE, then you need to use tc_wrap to set the egress mapping. This post explains it very nicely for ConnectX-4. https://community.mellanox.com/docs/DOC-2474 Thanks, Ophir. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mlx5 core/en oops in 4.6-rc6+ 2016-05-19 17:13 ` Eran Ben Elisha @ 2016-06-08 12:48 ` Doug Ledford 0 siblings, 0 replies; 7+ messages in thread From: Doug Ledford @ 2016-06-08 12:48 UTC (permalink / raw) To: Eran Ben Elisha Cc: Saeed Mahameed, Linux Netdev List, ophirm, Eran Ben Elisha [-- Attachment #1.1: Type: text/plain, Size: 3935 bytes --] On 5/19/2016 1:13 PM, Eran Ben Elisha wrote: > Hi Doug, > Attaching here a response from Ophir Maor (from Mellanox community) This conversation is a low priority, spare time thread for me, so it can take a while to respond to sometimes ;-) >> >> Read your own guides ;-). >> >> I'm using this one for your switches: >> https://community.mellanox.com/docs/DOC-1417 >> >> And these to try and get the linux machines configured properly: >> https://community.mellanox.com/docs/DOC-1414 >> https://community.mellanox.com/docs/DOC-1415 >> https://community.mellanox.com/docs/DOC-2311 >> https://community.mellanox.com/docs/DOC-2474 >> http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf >> >> The guides are helpful if your setup allows you to follow their exact >> example. But, they are shy on information about how to modify the >> examples to your specific situation. For instance, I have to use vlan >> priority 5 as my no-drop priority for RoCE traffic. I can't reliably >> tell which portions of the guide I must switch the 3s to 5s in order to >> get the new priority, and which uses of 3s in the guides relate to other >> things that could be mapped to 5. On a separate note, it's unclear to >> me if your switches and cards support more than one no-drop priority >> (other vendor's RoCE cards I'm using here don't, they only allow one >> no-drop priority for RoCE traffic and it must be 5). If it does support >> more than one, I'd actually like both 3 and 5 to be no-drop and for one >> vlan to use 3 and another to use 5. > > There are two flows to configure egress mapping > > - flow that pass via the kernel. Then you need to use kernel commands > (e.g. vconfig set_egress_map, or other commands) to make the kernel > set the egress priority. Yes. Done. Which actually has nothing to do with RoCE (I don't think even kernel RoCE flows go through this since they don't use the kernel net stack but use the card's firmware and RoCE work requests to send data) and is just part of the Mellanox recommended "put all traffic on this vlan on this priority even if it isn't all RoCE". I'm not sure I agree with it, and explanations that specifically exclude it to make things clearer would be nice. > - flows that bypass the kernel such as RoCE, then you need to use > tc_wrap to set the egress mapping. tc_wrap is not an explanation, nor really a suitable answer to "how do I do this" as it's out of date for the current upstream kernels last I checked... > This post explains it very nicely for ConnectX-4. > > https://community.mellanox.com/docs/DOC-2474 Yes, I read this post, and I downloaded tc_wrap from Mellanox, and I dissected tc_wrap to figure out it was doing what I added to my dispatcher file, namely this: tc qdisc add dev mlx4_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 queues 32@0 32@32 32@64 32@96 32@128 32@160 32@192 32@224 But even though I was able to pull that out of tc_wrap, the explanation of how setting what appears to be a kernel queue discipline on packets that the kernel does not see and are handled entirely by the card causes those packets never seen by the kernel to be sent with a specific priority is completely missing. What is the chain here? Does setting the queue discipline here translate to a setting on the card and there is some magic in that setting that triggers the firmware to do the right thing on RoCE packets? Does the driver read the queue disc when setting up address handles to use on the work requests and get the information that way? How is this information actually making it to the packet generation engine in the firmware? And given how recent upstream kernels have changed the default queue discipline on these cards, it is unclear how this command might need to be modified to keep working. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-06-08 12:48 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-05 16:00 mlx5 core/en oops in 4.6-rc6+ Doug Ledford 2016-05-05 16:42 ` Saeed Mahameed 2016-05-05 17:16 ` Doug Ledford 2016-05-05 20:51 ` Saeed Mahameed 2016-05-12 17:28 ` Doug Ledford 2016-05-19 17:13 ` Eran Ben Elisha 2016-06-08 12:48 ` Doug Ledford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).