Netdev List
 help / color / mirror / Atom feed
* Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Eric Dumazet @ 2018-06-22 12:55 UTC (permalink / raw)
  To: Peter Robinson, netdev, linux-arm-kernel; +Cc: labbott
In-Reply-To: <CALeDE9PP__kPHX_aW24kwzGf9BgA0gQOQJSY+Qw0yFMOLn4Pcw@mail.gmail.com>



On 06/22/2018 04:19 AM, Peter Robinson wrote:
> Hi All,
> 
> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
> others, both LPAE/normal kernels.
> 
> I'm a bit out of my depth in this part of the kernel but I'm wondering
> if it's known, I couldn't find anything that looked obvious on a few
> mailing lists.
> 
> Peter

Hi Peter

Could you provide symbolic information ?

Thanks !

^ permalink raw reply

* Re: bnx2x: kernel panic in the bnx2x driver
From: Vishwanath Pai @ 2018-06-22 13:22 UTC (permalink / raw)
  To: Kalluru, Sudarsana, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <MW2PR07MB4139C0F5622B56A2294A80AC8A750@MW2PR07MB4139.namprd07.prod.outlook.com>

Hi Sudarsana,

Thanks for taking a look at my email. The fix you suggested would
definitely fix the kernel panic, but at the same time wouldn't it also
silently ignore the request by ethtool to set rx-flow-hash?

Thanks,
Vishwanath

On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     Thanks for your mail, and the analysis.
> The fix would be to invoke bnx2x_rss() only when the device is opened,
> 	if (bp->state == BNX2X_STATE_OPEN)
> 		return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
> 	else
> 		return 0;
> Ariel,
>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
> 
> Thanks,
> Sudarsana
> 
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com] 
> Sent: 22 June 2018 10:37
> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
> Subject: bnx2x: kernel panic in the bnx2x driver
> 
> External Email
> 
> Hi,
> 
> We recently noticed a kernel panic in the bnx2x driver when trying to set rx-flow-hash parameters via ethtool during if-pre-up.d. I am running kernel
> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
> 
> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
> [   18.280214] Oops: 0010 [#1] SMP PTI
> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
> [   18.280243] RIP: 0010:          (null)
> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
> [   18.280249] Call Trace:
> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
> [   18.280283]  ? security_capable+0x3c/0x60
> [   18.280284]  dev_ethtool+0350/0x2610
> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
> [   18.280290]  ? inet_ioctl+0x182/0x1a0
> [   18.280291]  dev_ioctl+0x203/0x3f0
> [   18.280293]  ? dev_ioctl+0x203/0x3f0
> [   18.280294]  sock_do_ioctl+0xae/0x150
> [   18.280296]  sock_ioctl+0x1e2/0x330
> [   18.280296]  ? sock_ioctl+0x1e2/0x330
> [   18.280299]  do_vfs_ioctl+0xa8/0x620
> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
> [   18.280302]  ? handle_mm_fault+0xe3/0x220
> [   18.280304]  ksys_ioctl+0x75/0x80
> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
> [   18.280307]  do_syscall_64+0x5a/0x120
> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
> [   18.280310] RIP: 0033:0x7fc0e7fba107
> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
> [   18.280315] Code:  Bad RIP value.
> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
> [  18.280318] CR2: 0000000000000000
> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
> 
> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
> 
> The problem here is that rss_obj in bnx2x struct for the device hasn't been initialized yet, which causes an exception in bnx2x_config_rss() when calling "r->set_pending(r)" because r->set_pending is NULL. It looks like a lot many things haven't been initialized at this point, most of that happens in this
> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
> 
> Thanks,
> Vishwanath
> 

^ permalink raw reply

* [PATCH net 0/2] net: dccp: fixes around rx_tstamp_last_feedback
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet

This patch series fix some issues with rx_tstamp_last_feedback.

- Switch to monotonic clock.
- Avoid potential overflows on fast hosts/networks.

Eric Dumazet (2):
  net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
  net: dccp: switch rx_tstamp_last_feedback to monotonic clock

 net/dccp/ccids/ccid3.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply

* [PATCH net 1/2] net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180622134415.104266-1-edumazet@google.com>

On fast hosts or malicious bots, we trigger a DCCP_BUG() which
seems excessive.

syzbot reported :

BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ #112
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
 ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:914 [inline]
 __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 process_backlog+0x219/0x760 net/core/dev.c:5373
 napi_poll net/core/dev.c:5771 [inline]
 net_rx_action+0x7da/0x1980 net/core/dev.c:5837
 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
---
 net/dccp/ccids/ccid3.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 8b5ba6dffac7ebc88fd21075793dc3db43a74a43..d57a2be1e2e09aee89347e286aca538303b7dee1 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -625,9 +625,8 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 	case CCID3_FBACK_PERIODIC:
 		delta = ktime_us_delta(now, hc->rx_tstamp_last_feedback);
 		if (delta <= 0)
-			DCCP_BUG("delta (%ld) <= 0", (long)delta);
-		else
-			hc->rx_x_recv = scaled_div32(hc->rx_bytes_recv, delta);
+			delta = 1;
+		hc->rx_x_recv = scaled_div32(hc->rx_bytes_recv, delta);
 		break;
 	default:
 		return;
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH net 2/2] net: dccp: switch rx_tstamp_last_feedback to monotonic clock
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180622134415.104266-1-edumazet@google.com>

To compute delays, better not use time of the day which can
be changed by admins or malicious programs.

Also change ccid3_first_li() to use s64 type for delta variable
to avoid potential overflows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
---
 net/dccp/ccids/ccid3.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index d57a2be1e2e09aee89347e286aca538303b7dee1..12877a1514e7b8e873cd26529e58f7ebaae99c1a 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -600,7 +600,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 {
 	struct ccid3_hc_rx_sock *hc = ccid3_hc_rx_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
-	ktime_t now = ktime_get_real();
+	ktime_t now = ktime_get();
 	s64 delta = 0;
 
 	switch (fbtype) {
@@ -632,7 +632,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 		return;
 	}
 
-	ccid3_pr_debug("Interval %ldusec, X_recv=%u, 1/p=%u\n", (long)delta,
+	ccid3_pr_debug("Interval %lldusec, X_recv=%u, 1/p=%u\n", delta,
 		       hc->rx_x_recv, hc->rx_pinv);
 
 	hc->rx_tstamp_last_feedback = now;
@@ -679,7 +679,8 @@ static int ccid3_hc_rx_insert_options(struct sock *sk, struct sk_buff *skb)
 static u32 ccid3_first_li(struct sock *sk)
 {
 	struct ccid3_hc_rx_sock *hc = ccid3_hc_rx_sk(sk);
-	u32 x_recv, p, delta;
+	u32 x_recv, p;
+	s64 delta;
 	u64 fval;
 
 	if (hc->rx_rtt == 0) {
@@ -687,7 +688,9 @@ static u32 ccid3_first_li(struct sock *sk)
 		hc->rx_rtt = DCCP_FALLBACK_RTT;
 	}
 
-	delta  = ktime_to_us(net_timedelta(hc->rx_tstamp_last_feedback));
+	delta = ktime_us_delta(ktime_get(), hc->rx_tstamp_last_feedback);
+	if (delta <= 0)
+		delta = 1;
 	x_recv = scaled_div32(hc->rx_bytes_recv, delta);
 	if (x_recv == 0) {		/* would also trigger divide-by-zero */
 		DCCP_WARN("X_recv==0\n");
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH net V3 1/1] net/smc: coordinate wait queues for nonblocking connect
From: Ursula Braun @ 2018-06-22 14:01 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun,
	xiyou.wangcong, hch

The recent poll change may lead to stalls for non-blocking connecting
SMC sockets, since sock_poll_wait is no longer performed on the
internal CLC socket, but on the outer SMC socket.  kernel_connect() on
the internal CLC socket returns with -EINPROGRESS, but the wake up
logic does not work in all cases. If the internal CLC socket is still
in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
does not sleep. It is supposed to sleep till the state of the internal
CLC socket switches to TCP_ESTABLISHED.

This patch temporarily propagates the wait queue from the internal
CLC sock to the SMC sock, till the non-blocking connect() is
finished.

In addition locking is reduced due to the removed poll waits.

Fixes: c0129a061442 ("smc: convert to ->poll_mask")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/af_smc.c | 13 +++++++++----
 net/smc/smc.h    |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index da7f02edcd37..7966e7ddb563 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -23,6 +23,7 @@
 #include <linux/workqueue.h>
 #include <linux/in.h>
 #include <linux/sched/signal.h>
+#include <linux/rcupdate.h>
 
 #include <net/sock.h>
 #include <net/tcp.h>
@@ -605,6 +606,11 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
 
 	smc_copy_sock_settings_to_clc(smc);
 	tcp_sk(smc->clcsock->sk)->syn_smc = 1;
+	if (flags & O_NONBLOCK) {
+		smc->smcwq = rcu_access_pointer(sk->sk_wq);
+		rcu_assign_pointer(sock->sk->sk_wq,
+				   rcu_access_pointer(smc->clcsock->sk->sk_wq));
+	}
 	rc = kernel_connect(smc->clcsock, addr, alen, flags);
 	if (rc)
 		goto out;
@@ -1285,12 +1291,9 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 
 	smc = smc_sk(sock->sk);
 	sock_hold(sk);
-	lock_sock(sk);
 	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
 		/* delegate to CLC child sock */
-		release_sock(sk);
 		mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
-		lock_sock(sk);
 		sk->sk_err = smc->clcsock->sk->sk_err;
 		if (sk->sk_err) {
 			mask |= EPOLLERR;
@@ -1299,7 +1302,10 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 			if (sk->sk_state == SMC_INIT &&
 			    mask & EPOLLOUT &&
 			    smc->clcsock->sk->sk_state != TCP_CLOSE) {
+				lock_sock(sk);
+				rcu_assign_pointer(sock->sk->sk_wq, smc->smcwq);
 				rc = __smc_connect(smc);
+				release_sock(sk);
 				if (rc < 0)
 					mask |= EPOLLERR;
 				/* success cases including fallback */
@@ -1334,7 +1340,6 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 			mask |= EPOLLPRI;
 
 	}
-	release_sock(sk);
 	sock_put(sk);
 
 	return mask;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 51ae1f10d81a..89d6d7ef973f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -190,6 +190,7 @@ struct smc_connection {
 struct smc_sock {				/* smc sock container */
 	struct sock		sk;
 	struct socket		*clcsock;	/* internal tcp socket */
+	struct socket_wq	*smcwq;		/* original smcsock wq */
 	struct smc_connection	conn;		/* smc connection */
 	struct smc_sock		*listen_smc;	/* listen parent */
 	struct work_struct	tcp_listen_work;/* handle tcp socket accepts */
-- 
2.16.4

^ permalink raw reply related

* RE: bnx2x: kernel panic in the bnx2x driver
From: Kalluru, Sudarsana @ 2018-06-22 14:21 UTC (permalink / raw)
  To: Vishwanath Pai, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <c6283d44-d2c4-5252-9128-f905c2068973@akamai.com>

Hi Vishwanath,
    The config will be cached in the device structure (bp->rss_conf_obj.udp_rss_v4) in this scenario, and will be applied in the load path (bnx2x_nic_load() --> bnx2x_init_rss()). Have unit tested the change on my setup.

Thanks,
Sudarsana

-----Original Message-----
From: Vishwanath Pai [mailto:vpai@akamai.com] 
Sent: 22 June 2018 18:52
To: Kalluru, Sudarsana <Sudarsana.Kalluru@cavium.com>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
Subject: Re: bnx2x: kernel panic in the bnx2x driver

Hi Sudarsana,

Thanks for taking a look at my email. The fix you suggested would definitely fix the kernel panic, but at the same time wouldn't it also silently ignore the request by ethtool to set rx-flow-hash?

Thanks,
Vishwanath

On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     Thanks for your mail, and the analysis.
> The fix would be to invoke bnx2x_rss() only when the device is opened,
>       if (bp->state == BNX2X_STATE_OPEN)
>               return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
>       else
>               return 0;
> Ariel,
>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
>
> Thanks,
> Sudarsana
>
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com]
> Sent: 22 June 2018 10:37
> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 
> <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; 
> pai.vishwain@gmail.com
> Subject: bnx2x: kernel panic in the bnx2x driver
>
> External Email
>
> Hi,
>
> We recently noticed a kernel panic in the bnx2x driver when trying to 
> set rx-flow-hash parameters via ethtool during if-pre-up.d. I am 
> running kernel
> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
>
> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
> [   18.280214] Oops: 0010 [#1] SMP PTI
> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
> [   18.280243] RIP: 0010:          (null)
> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
> [   18.280249] Call Trace:
> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
> [   18.280283]  ? security_capable+0x3c/0x60
> [   18.280284]  dev_ethtool+0350/0x2610
> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
> [   18.280290]  ? inet_ioctl+0x182/0x1a0
> [   18.280291]  dev_ioctl+0x203/0x3f0
> [   18.280293]  ? dev_ioctl+0x203/0x3f0
> [   18.280294]  sock_do_ioctl+0xae/0x150
> [   18.280296]  sock_ioctl+0x1e2/0x330
> [   18.280296]  ? sock_ioctl+0x1e2/0x330
> [   18.280299]  do_vfs_ioctl+0xa8/0x620
> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
> [   18.280302]  ? handle_mm_fault+0xe3/0x220
> [   18.280304]  ksys_ioctl+0x75/0x80
> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
> [   18.280307]  do_syscall_64+0x5a/0x120
> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
> [   18.280310] RIP: 0033:0x7fc0e7fba107
> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
> [   18.280315] Code:  Bad RIP value.
> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
> [  18.280318] CR2: 0000000000000000
> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
>
> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
>
> The problem here is that rss_obj in bnx2x struct for the device hasn't 
> been initialized yet, which causes an exception in bnx2x_config_rss() 
> when calling "r->set_pending(r)" because r->set_pending is NULL. It 
> looks like a lot many things haven't been initialized at this point, 
> most of that happens in this
> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
>
> Thanks,
> Vishwanath
>


^ permalink raw reply

* Re: [PATCH rdma-next 0/2] RoCE ICRC counter
From: Jason Gunthorpe @ 2018-06-22 15:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Mark Bloch,
	Talat Batheesh, Saeed Mahameed, linux-netdev
In-Reply-To: <20180621123756.32645-1-leon@kernel.org>

On Thu, Jun 21, 2018 at 03:37:54PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> Hi,
> 
> This series exposes RoCE ICRC counter through existing RDMA hw_counters
> sysfs interface.
> 
> First patch has all HW definitions in mlx5_ifc.h file and second patch is
> actual counter implementation.
> 
> Thanks
> 
> Talat Batheesh (2):
>   net/mlx5: Add RoCE RX ICRC encapsulated counter
>   IB/mlx5: Support RoCE ICRC encapsulated error counter
> 
>  drivers/infiniband/hw/mlx5/cmd.c     | 12 +++++++
>  drivers/infiniband/hw/mlx5/cmd.h     |  1 +
>  drivers/infiniband/hw/mlx5/main.c    | 62 ++++++++++++++++++++++++++++++++++--
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
>  include/linux/mlx5/mlx5_ifc.h        | 11 +++++--
>  5 files changed, 81 insertions(+), 6 deletions(-)

Applied to rdma for-next with the mellanox/mlx5-next branch

Thanks,
Jason

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-22 15:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180621211712-mutt-send-email-mst@kernel.org>

On Thu, 21 Jun 2018 21:20:13 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> > OK, so what about the following:
> > 
> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >   that we have a new uuid field in the virtio-net config space
> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > - when configuring, set the property to the group UUID of the vfio-pci
> >   device
> > - in the guest, use the uuid from the virtio-net device's config space
> >   if applicable; else, fall back to matching by MAC as done today
> > 
> > That should work for all virtio transports.  
> 
> True. I'm a bit unhappy that it's virtio net specific though
> since down the road I expect we'll have a very similar feature
> for scsi (and maybe others).
> 
> But we do not have a way to have fields that are portable
> both across devices and transports, and I think it would
> be a useful addition. How would this work though? Any idea?

Can we introduce some kind of device-independent config space area?
Pushing back the device-specific config space by a certain value if the
appropriate feature is negotiated and use that for things like the uuid?

But regardless of that, I'm not sure whether extending this approach to
other device types is the way to go. Tying together two different
devices is creating complicated situations at least in the hypervisor
(even if it's fairly straightforward in the guest). [I have not come
around again to look at the "how to handle visibility in QEMU"
questions due to lack of cycles, sorry about that.]

So, what's the goal of this approach? Only to allow migration with
vfio-pci, or also to plug in a faster device and use it instead of an
already attached paravirtualized device?

What about migration of vfio devices that are not easily replaced by a
paravirtualized device? I'm thinking of vfio-ccw, where our main (and
currently only) supported device is dasd (disks) -- which can do a lot
of specialized things that virtio-blk does not support (and should not
or even cannot support). Would it be more helpful to focus on generic
migration support for vfio instead of going about it device by device?

This network device approach already seems far along, so it makes sense
to continue with it. But I'm not sure whether we want to spend time and
energy on that for other device types rather than working on a general
solution for vfio migration.

^ permalink raw reply

* [bpf PATCH v3 0/4] BPF fixes for sockhash
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev

This addresses two syzbot issues that lead to identifying (by Eric and
Wei) a class of bugs where we don't correctly check for IPv4/v6
sockets and their associated state. The second issue was a locking
omission in sockhash.

The first patch addresses IPv6 socks and fixing an error where
sockhash would overwrite the prot pointer with IPv4 prot. To fix
this build similar solution to TLS ULP. Although we continue to
allow socks in all states not just ESTABLISH in this patch set
because as Martin points out there should be no issue with this
on the sockmap ULP because we don't use the ctx in this code.

The other issue syzbot found that the tcp_close() handler missed
locking the hash bucket lock which could result in corrupting the
sockhash bucket list if delete and close ran at the same time. 
And also the smap_list_remove() routine was not working correctly
at all. This was not caught in my testing because in general my
tests (to date at least lets add some more robust selftest in
bpf-next) do things in the "expected" order, create map, add socks,
delete socks, then tear down maps. The tests we have that do the
ops out of this order where only working on single maps not multi-
maps so we never saw the issue. Thanks syzbot. The fix is to
restructure the tcp_close() lock handling. And fix the obvious
bug in smap_list_remove().

Finally, during review I noticed the release handler was omitted
from the upstream code (patch 4) due to an incorrect merge conflict
fix when I ported the code to latest bpf-next before submitting.

v3: rework patches, dropping ESTABLISH check and adding rcu
    annotation along with the smap_list_remove fix

Also big thanks to Martin for thorough review he caught at least
one case where I missed a rcu_call().

---

John Fastabend (4):
      bpf: sockmap, fix crash when ipv6 sock is added
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps
      bpf: sockhash fix omitted bucket lock in sock_close
      bpf: sockhash, add release routine


 kernel/bpf/sockmap.c |  210 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 153 insertions(+), 57 deletions(-)

^ permalink raw reply

* [bpf PATCH v3 1/4] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used.

Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 kernel/bpf/sockmap.c |   58 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..d7fd17a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
 			    int offset, size_t size, int flags);
+static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 	return !empty;
 }
 
-static struct proto tcp_bpf_proto;
+enum {
+	SOCKMAP_IPV4,
+	SOCKMAP_IPV6,
+	SOCKMAP_NUM_PROTS,
+};
+
+enum {
+	SOCKMAP_BASE,
+	SOCKMAP_TX,
+	SOCKMAP_NUM_CONFIGS,
+};
+
+static struct proto *saved_tcpv6_prot __read_mostly;
+static DEFINE_SPINLOCK(tcpv6_prot_lock);
+static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
+static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
+			 struct proto *base)
+{
+	prot[SOCKMAP_BASE]			= *base;
+	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
+	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
+	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
+
+	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
+	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
+	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
+}
+
+static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
+{
+	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
+	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
+
+	sk->sk_prot = &bpf_tcp_prots[family][conf];
+}
+
 static int bpf_tcp_init(struct sock *sk)
 {
 	struct smap_psock *psock;
@@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
 	psock->save_close = sk->sk_prot->close;
 	psock->sk_proto = sk->sk_prot;
 
-	if (psock->bpf_tx_msg) {
-		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
-		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
-		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
-		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
+	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
+	if (sk->sk_family == AF_INET6 &&
+	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
+		spin_lock_bh(&tcpv6_prot_lock);
+		if (likely(sk->sk_prot != saved_tcpv6_prot)) {
+			build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
+			smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
+		}
+		spin_unlock_bh(&tcpv6_prot_lock);
 	}
-
-	sk->sk_prot = &tcp_bpf_proto;
+	update_sk_prot(sk, psock);
 	rcu_read_unlock();
 	return 0;
 }
@@ -1111,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
 
 static int bpf_tcp_ulp_register(void)
 {
-	tcp_bpf_proto = tcp_prot;
-	tcp_bpf_proto.close = bpf_tcp_close;
+	build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
 	/* Once BPF TX ULP is registered it is never unregistered. It
 	 * will be in the ULP list for the lifetime of the system. Doing
 	 * duplicate registers is not a problem.

^ permalink raw reply related

* [bpf PATCH v3 2/4] bpf: sockmap, fix smap_list_map_remove when psock is in many maps
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

If a hashmap is free'd with open socks it removes the reference to
the hash entry from the psock. If that is the last reference to the
psock then it will also be free'd by the reference counting logic.
However the current logic that removes the hash reference from the
list of references is broken. In map_list_map_remove() we first check
if the sockmap entry matches and then check if the hashmap entry
matches. But, the sockmap entry sill always match because its NULL in
this case which causes the first entry to be removed from the list.
If this is always the "right" entry (because the user adds/removes
entries in order) then everything is OK but otherwise a subsequent
bpf_tcp_close() may reference a free'd object.

To fix this create two list handlers one for sockmap and one for
sockhash.

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d7fd17a..69b26af 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1602,17 +1602,26 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void smap_list_remove(struct smap_psock *psock,
-			     struct sock **entry,
-			     struct htab_elem *hash_link)
+static void smap_list_map_remove(struct smap_psock *psock,
+				 struct sock **entry)
 {
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		if (e->entry == entry || e->hash_link == hash_link) {
+		if (e->entry == entry)
+			list_del(&e->list);
+	}
+}
+static void smap_list_hash_remove(struct smap_psock *psock,
+				  struct htab_elem *hash_link)
+{
+	struct smap_psock_map_entry *e, *tmp;
+
+	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+		struct htab_elem *c = e->hash_link;
+
+		if (c == hash_link)
 			list_del(&e->list);
-			break;
-		}
 	}
 }
 
@@ -1647,7 +1656,7 @@ static void sock_map_free(struct bpf_map *map)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, &stab->sock_map[i], NULL);
+			smap_list_map_remove(psock, &stab->sock_map[i]);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);
@@ -1706,7 +1715,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
 
 	if (psock->bpf_parse)
 		smap_stop_sock(psock, sock);
-	smap_list_remove(psock, &stab->sock_map[k], NULL);
+	smap_list_map_remove(psock, &stab->sock_map[k]);
 	smap_release_sock(psock, sock);
 out:
 	write_unlock_bh(&sock->sk_callback_lock);
@@ -1908,7 +1917,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		struct smap_psock *opsock = smap_psock_sk(osock);
 
 		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i], NULL);
+		smap_list_map_remove(opsock, &stab->sock_map[i]);
 		smap_release_sock(opsock, osock);
 		write_unlock_bh(&osock->sk_callback_lock);
 	}
@@ -2124,7 +2133,7 @@ static void sock_hash_free(struct bpf_map *map)
 			 * (psock) to be null and queued for garbage collection.
 			 */
 			if (likely(psock)) {
-				smap_list_remove(psock, NULL, l);
+				smap_list_hash_remove(psock, l);
 				smap_release_sock(psock, sock);
 			}
 			write_unlock_bh(&sock->sk_callback_lock);
@@ -2304,7 +2313,7 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		psock = smap_psock_sk(l_old->sk);
 
 		hlist_del_rcu(&l_old->hash_node);
-		smap_list_remove(psock, NULL, l_old);
+		smap_list_hash_remove(psock, l_old);
 		smap_release_sock(psock, l_old->sk);
 		free_htab_elem(htab, l_old);
 	}
@@ -2372,7 +2381,7 @@ static int sock_hash_delete_elem(struct bpf_map *map, void *key)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, NULL, l);
+			smap_list_hash_remove(psock, l);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);

^ permalink raw reply related

* [bpf PATCH v3 3/4] bpf: sockhash fix omitted bucket lock in sock_close
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

First in tcp_close, reduce scope of sk_callback_lock() the lock is
only needed for protecting maps list the ingress and cork
lists are protected by sock lock. Having the lock in wider scope is
harmless but may confuse the reader who may infer it is in fact
needed.

Next, in sock_hash_delete_elem() the pattern is as follows,

  sock_hash_delete_elem()
     [...]
     spin_lock(bucket_lock)
     l = lookup_elem_raw()
     if (l)
        hlist_del_rcu()
        write_lock(sk_callback_lock)
         .... destroy psock ...
        write_unlock(sk_callback_lock)
     spin_unlock(bucket_lock)

The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.

In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,

  bpf_tcp_close()
     [...]
     write_lock(sk_callback_lock)
     for each psock->maps // list of maps this sock is part of
         hlist_del_rcu(ref_hash_node);
         .... destroy psock ...
     write_unlock(sk_callback_lock)

Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().

To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock but we need the bucket lock to do the hlist_del_rcu.
To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,

  bpf_tcp_close()
  e = psock_map_pop(psock->maps) // done with sk_callback_lock
  bucket_lock() // lock hash list bucket
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
     //only get here if elmnt was not already removed
     hlist_del_rcu()
     ... destroy psock...
  }
  bucket_unlock()

And finally for all the above to work add missing sk_callback_lock
around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
delete and update may corrupt maps list. Then add RCU annotations and
use rcu_dereference/rcu_assign_pointer to manage values relying on
RCU so that the object is not free'd from sock_hash_free() while it
is being referenced in bpf_tcp_close().

(As an aside the sk_callback_lock serves two purposes. The
 first, is to update the sock callbacks sk_data_ready, sk_write_space,
 etc. The second is to protect the psock 'maps' list. The 'maps' list
 is used to (as shown above) to delete all map/hash references to a
 sock when the sock is closed)

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |  120 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 84 insertions(+), 36 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 69b26af..333427b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -72,6 +72,7 @@ struct bpf_htab {
 	u32 n_buckets;
 	u32 elem_size;
 	struct bpf_sock_progs progs;
+	struct rcu_head rcu;
 };
 
 struct htab_elem {
@@ -89,8 +90,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
 	struct list_head list;
 	struct sock **entry;
-	struct htab_elem *hash_link;
-	struct bpf_htab *htab;
+	struct htab_elem __rcu *hash_link;
+	struct bpf_htab __rcu *htab;
 };
 
 struct smap_psock {
@@ -258,16 +259,54 @@ static void bpf_tcp_release(struct sock *sk)
 	rcu_read_unlock();
 }
 
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+					 u32 hash, void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+
+	return NULL;
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	atomic_dec(&htab->count);
 	kfree_rcu(l, rcu);
 }
 
+static struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
+						  struct smap_psock *psock)
+{
+	struct smap_psock_map_entry *e;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	e = list_first_entry_or_null(&psock->maps,
+				     struct smap_psock_map_entry,
+				     list);
+	if (e)
+		list_del(&e->list);
+	write_unlock_bh(&sk->sk_callback_lock);
+	return e;
+}
+
 static void bpf_tcp_close(struct sock *sk, long timeout)
 {
 	void (*close_fun)(struct sock *sk, long timeout);
-	struct smap_psock_map_entry *e, *tmp;
+	struct smap_psock_map_entry *e;
 	struct sk_msg_buff *md, *mtmp;
 	struct smap_psock *psock;
 	struct sock *osk;
@@ -286,7 +325,6 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	 */
 	close_fun = psock->save_close;
 
-	write_lock_bh(&sk->sk_callback_lock);
 	if (psock->cork) {
 		free_start_sg(psock->sock, psock->cork);
 		kfree(psock->cork);
@@ -299,20 +337,38 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 		kfree(md);
 	}
 
-	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+	e = psock_map_pop(sk, psock);
+	while (e) {
 		if (e->entry) {
 			osk = cmpxchg(e->entry, sk, NULL);
 			if (osk == sk) {
-				list_del(&e->list);
 				smap_release_sock(psock, sk);
 			}
 		} else {
-			hlist_del_rcu(&e->hash_link->hash_node);
-			smap_release_sock(psock, e->hash_link->sk);
-			free_htab_elem(e->htab, e->hash_link);
+			struct htab_elem *link = rcu_dereference(e->hash_link);
+			struct bpf_htab *htab = rcu_dereference(e->htab);
+			struct hlist_head *head;
+			struct htab_elem *l;
+			struct bucket *b;
+
+			b = __select_bucket(htab, link->hash);
+			head = &b->head;
+			raw_spin_lock_bh(&b->lock);
+			l = lookup_elem_raw(head,
+					    link->hash, link->key,
+					    htab->map.key_size);
+			/* If another thread deleted this object skip deletion.
+			 * The refcnt on psock may or may not be zero.
+			 */
+			if (l) {
+				hlist_del_rcu(&link->hash_node);
+				smap_release_sock(psock, link->sk);
+				free_htab_elem(htab, link);
+			}
+			raw_spin_unlock_bh(&b->lock);
 		}
+		e = psock_map_pop(sk, psock);
 	}
-	write_unlock_bh(&sk->sk_callback_lock);
 	rcu_read_unlock();
 	close_fun(sk, timeout);
 }
@@ -1618,7 +1674,7 @@ static void smap_list_hash_remove(struct smap_psock *psock,
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		struct htab_elem *c = e->hash_link;
+		struct htab_elem *c = rcu_dereference(e->hash_link);
 
 		if (c == hash_link)
 			list_del(&e->list);
@@ -2090,14 +2146,13 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+static void __bpf_htab_free(struct rcu_head *rcu)
 {
-	return &htab->buckets[hash & (htab->n_buckets - 1)];
-}
+	struct bpf_htab *htab;
 
-static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
-{
-	return &__select_bucket(htab, hash)->head;
+	htab = container_of(rcu, struct bpf_htab, rcu);
+	bpf_map_area_free(htab->buckets);
+	kfree(htab);
 }
 
 static void sock_hash_free(struct bpf_map *map)
@@ -2116,10 +2171,13 @@ static void sock_hash_free(struct bpf_map *map)
 	 */
 	rcu_read_lock();
 	for (i = 0; i < htab->n_buckets; i++) {
-		struct hlist_head *head = select_bucket(htab, i);
+		struct bucket *b = __select_bucket(htab, i);
+		struct hlist_head *head;
 		struct hlist_node *n;
 		struct htab_elem *l;
 
+		raw_spin_lock_bh(&b->lock);
+		head = &b->head;
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			struct sock *sock = l->sk;
 			struct smap_psock *psock;
@@ -2137,12 +2195,12 @@ static void sock_hash_free(struct bpf_map *map)
 				smap_release_sock(psock, sock);
 			}
 			write_unlock_bh(&sock->sk_callback_lock);
-			kfree(l);
+			free_htab_elem(htab, l);
 		}
+		raw_spin_unlock_bh(&b->lock);
 	}
 	rcu_read_unlock();
-	bpf_map_area_free(htab->buckets);
-	kfree(htab);
+	call_rcu(&htab->rcu, __bpf_htab_free);
 }
 
 static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
@@ -2169,19 +2227,6 @@ static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
 	return l_new;
 }
 
-static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
-					 u32 hash, void *key, u32 key_size)
-{
-	struct htab_elem *l;
-
-	hlist_for_each_entry_rcu(l, head, hash_node) {
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
-			return l;
-	}
-
-	return NULL;
-}
-
 static inline u32 htab_map_hash(const void *key, u32 key_len)
 {
 	return jhash(key, key_len, 0);
@@ -2301,8 +2346,9 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		goto bucket_err;
 	}
 
-	e->hash_link = l_new;
-	e->htab = container_of(map, struct bpf_htab, map);
+	rcu_assign_pointer(e->hash_link, l_new);
+	rcu_assign_pointer(e->htab,
+			   container_of(map, struct bpf_htab, map));
 	list_add_tail(&e->list, &psock->maps);
 
 	/* add new element to the head of the list, so that
@@ -2313,8 +2359,10 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		psock = smap_psock_sk(l_old->sk);
 
 		hlist_del_rcu(&l_old->hash_node);
+		write_lock_bh(&l_old->sk->sk_callback_lock);
 		smap_list_hash_remove(psock, l_old);
 		smap_release_sock(psock, l_old->sk);
+		write_unlock_bh(&l_old->sk->sk_callback_lock);
 		free_htab_elem(htab, l_old);
 	}
 	raw_spin_unlock_bh(&b->lock);

^ permalink raw reply related

* [bpf PATCH v3 4/4] bpf: sockhash, add release routine
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 333427b..2456986 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2478,6 +2478,7 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
 	.map_get_next_key = sock_hash_get_next_key,
 	.map_update_elem = sock_hash_update_elem,
 	.map_delete_elem = sock_hash_delete_elem,
+	.map_release_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,

^ permalink raw reply related

* Re: bnx2x: kernel panic in the bnx2x driver
From: Vishwanath Pai @ 2018-06-22 14:57 UTC (permalink / raw)
  To: Kalluru, Sudarsana, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <MW2PR07MB4139CC84215EDFEDC980CF098A750@MW2PR07MB4139.namprd07.prod.outlook.com>

Ah, that is great! I will test it out on my machine and let you know.

Thanks,
Vishwanath

On 06/22/2018 10:21 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     The config will be cached in the device structure (bp->rss_conf_obj.udp_rss_v4) in this scenario, and will be applied in the load path (bnx2x_nic_load() --> bnx2x_init_rss()). Have unit tested the change on my setup.
>
> Thanks,
> Sudarsana
>
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com] 
> Sent: 22 June 2018 18:52
> To: Kalluru, Sudarsana <Sudarsana.Kalluru@cavium.com>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
> Subject: Re: bnx2x: kernel panic in the bnx2x driver
>
> Hi Sudarsana,
>
> Thanks for taking a look at my email. The fix you suggested would definitely fix the kernel panic, but at the same time wouldn't it also silently ignore the request by ethtool to set rx-flow-hash?
>
> Thanks,
> Vishwanath
>
> On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
>> Hi Vishwanath,
>>     Thanks for your mail, and the analysis.
>> The fix would be to invoke bnx2x_rss() only when the device is opened,
>>       if (bp->state == BNX2X_STATE_OPEN)
>>               return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
>>       else
>>               return 0;
>> Ariel,
>>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
>>
>> Thanks,
>> Sudarsana
>>
>> -----Original Message-----
>> From: Vishwanath Pai [mailto:vpai@akamai.com]
>> Sent: 22 June 2018 10:37
>> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 
>> <Dept-EngEverestLinuxL2@cavium.com>
>> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; 
>> pai.vishwain@gmail.com
>> Subject: bnx2x: kernel panic in the bnx2x driver
>>
>> External Email
>>
>> Hi,
>>
>> We recently noticed a kernel panic in the bnx2x driver when trying to 
>> set rx-flow-hash parameters via ethtool during if-pre-up.d. I am 
>> running kernel
>> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
>>
>> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
>> [   18.280214] Oops: 0010 [#1] SMP PTI
>> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
>> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
>> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
>> [   18.280243] RIP: 0010:          (null)
>> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
>> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
>> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
>> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
>> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
>> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
>> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
>> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
>> [   18.280249] Call Trace:
>> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
>> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
>> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
>> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
>> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
>> [   18.280283]  ? security_capable+0x3c/0x60
>> [   18.280284]  dev_ethtool+0350/0x2610
>> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
>> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
>> [   18.280290]  ? inet_ioctl+0x182/0x1a0
>> [   18.280291]  dev_ioctl+0x203/0x3f0
>> [   18.280293]  ? dev_ioctl+0x203/0x3f0
>> [   18.280294]  sock_do_ioctl+0xae/0x150
>> [   18.280296]  sock_ioctl+0x1e2/0x330
>> [   18.280296]  ? sock_ioctl+0x1e2/0x330
>> [   18.280299]  do_vfs_ioctl+0xa8/0x620
>> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
>> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
>> [   18.280302]  ? handle_mm_fault+0xe3/0x220
>> [   18.280304]  ksys_ioctl+0x75/0x80
>> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
>> [   18.280307]  do_syscall_64+0x5a/0x120
>> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
>> [   18.280310] RIP: 0033:0x7fc0e7fba107
>> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
>> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
>> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
>> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
>> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
>> [   18.280315] Code:  Bad RIP value.
>> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
>> [  18.280318] CR2: 0000000000000000
>> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
>>
>> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
>> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
>> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
>>
>> The problem here is that rss_obj in bnx2x struct for the device hasn't 
>> been initialized yet, which causes an exception in bnx2x_config_rss() 
>> when calling "r->set_pending(r)" because r->set_pending is NULL. It 
>> looks like a lot many things haven't been initialized at this point, 
>> most of that happens in this
>> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
>>
>> Thanks,
>> Vishwanath
>>

^ permalink raw reply

* Re: [PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Jesper Dangaard Brouer @ 2018-06-22 15:49 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: brouer, dsahern, netdev, borkmann, ast, davem, David Ahern
In-Reply-To: <20180621170936.2tobn5lu24l6xuo7@kafai-mbp.dhcp.thefacebook.com>

On Thu, 21 Jun 2018 10:09:36 -0700
Martin KaFai Lau <kafai@fb.com> wrote:

> On Wed, Jun 20, 2018 at 08:00:11PM -0700, dsahern@kernel.org wrote:
> > From: David Ahern <dsahern@gmail.com>
> > 
> > For ACLs implemented using either FIB rules or FIB entries, the BPF
> > program needs the FIB lookup status to be able to drop the packet.
> > Since the bpf_fib_lookup API has not reached a released kernel yet,
> > change the return code to contain an encoding of the FIB lookup
> > result and return the nexthop device index in the params struct.
> > 
> > In addition, inform the BPF program of any post FIB lookup reason as
> > to why the packet needs to go up the stack.
> > 
> > The fib result for unicast routes must have an egress device, so remove
> > the check that it is non-NULL.  
> Acked-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net V3 1/1] net/smc: coordinate wait queues for nonblocking connect
From: Eric Dumazet @ 2018-06-22 15:49 UTC (permalink / raw)
  To: Ursula Braun, davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl,
	xiyou.wangcong, hch
In-Reply-To: <20180622140139.22981-1-ubraun@linux.ibm.com>



On 06/22/2018 07:01 AM, Ursula Braun wrote:
> The recent poll change may lead to stalls for non-blocking connecting
> SMC sockets, since sock_poll_wait is no longer performed on the
> internal CLC socket, but on the outer SMC socket.  kernel_connect() on
> the internal CLC socket returns with -EINPROGRESS, but the wake up
> logic does not work in all cases. If the internal CLC socket is still
> in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
> does not sleep. It is supposed to sleep till the state of the internal
> CLC socket switches to TCP_ESTABLISHED.
> 
> This patch temporarily propagates the wait queue from the internal
> CLC sock to the SMC sock, till the non-blocking connect() is
> finished.
> 
> In addition locking is reduced due to the removed poll waits.
> 
> Fixes: c0129a061442 ("smc: convert to ->poll_mask")
> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
> ---
>  net/smc/af_smc.c | 13 +++++++++----
>  net/smc/smc.h    |  1 +
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index da7f02edcd37..7966e7ddb563 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -23,6 +23,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/in.h>
>  #include <linux/sched/signal.h>
> +#include <linux/rcupdate.h>
>  
>  #include <net/sock.h>
>  #include <net/tcp.h>
> @@ -605,6 +606,11 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
>  
>  	smc_copy_sock_settings_to_clc(smc);
>  	tcp_sk(smc->clcsock->sk)->syn_smc = 1;
> +	if (flags & O_NONBLOCK) {
> +		smc->smcwq = rcu_access_pointer(sk->sk_wq);
> +		rcu_assign_pointer(sock->sk->sk_wq,
> +				   rcu_access_pointer(smc->clcsock->sk->sk_wq));

That is obfuscation.

The following is much easier to read.

	sock->sk->sk_wq = smc->clcsock->sk->sk_wq;

But, this looks very suspect to me.

Nowhere in the stack we divert sk->sk_wq to something else.

What about rcu users of sock->sk->sk_wq ?

		

> +	}
>  	rc = kernel_connect(smc->clcsock, addr, alen, flags);
>  	if (rc)
>  		goto out;
> @@ -1285,12 +1291,9 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  
>  	smc = smc_sk(sock->sk);
>  	sock_hold(sk);
> -	lock_sock(sk);
>  	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
>  		/* delegate to CLC child sock */
> -		release_sock(sk);
>  		mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
> -		lock_sock(sk);
>  		sk->sk_err = smc->clcsock->sk->sk_err;
>  		if (sk->sk_err) {
>  			mask |= EPOLLERR;
> @@ -1299,7 +1302,10 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  			if (sk->sk_state == SMC_INIT &&
>  			    mask & EPOLLOUT &&
>  			    smc->clcsock->sk->sk_state != TCP_CLOSE) {
> +				lock_sock(sk);
> +				rcu_assign_pointer(sock->sk->sk_wq, smc->smcwq);
>  				rc = __smc_connect(smc);
> +				release_sock(sk);
>  				if (rc < 0)
>  					mask |= EPOLLERR;
>  				/* success cases including fallback */
> @@ -1334,7 +1340,6 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  			mask |= EPOLLPRI;
>  
>  	}
> -	release_sock(sk);
>  	sock_put(sk);
>  
>  	return mask;
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index 51ae1f10d81a..89d6d7ef973f 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -190,6 +190,7 @@ struct smc_connection {
>  struct smc_sock {				/* smc sock container */
>  	struct sock		sk;
>  	struct socket		*clcsock;	/* internal tcp socket */
> +	struct socket_wq	*smcwq;		/* original smcsock wq */
>  	struct smc_connection	conn;		/* smc connection */
>  	struct smc_sock		*listen_smc;	/* listen parent */
>  	struct work_struct	tcp_listen_work;/* handle tcp socket accepts */
> 


No refcounting when ->smcwq is set ?

This looks quite risky to me.

^ permalink raw reply

* Re: [PATCH 0/4] docs: e100[0] fix build errors
From: Jeff Kirsher @ 2018-06-22 16:44 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-1-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> I may be a little confused here, I'm getting docs build failure on
> Linus' mainline, linux-next, and your docs-next but different errors.
> There seems to be patches to the first two trees that are not in your
> docs-next tree?
> 
> Do networking docs typically go through your tree?  Looks like
> networking has done some conversion to rst that hasn't gone through
> your
> tree.  Or else my git skills are failing.

Networking documentation changes went through David Miller's networking
tree because he maintains changes under Documentation/networking/

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/4] Documentation: e100: Use correct heading adornment
From: Jeff Kirsher @ 2018-06-22 16:44 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-2-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 732 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> Recently documentation file was converted to rst.  The document title
> has the incorrect heading adornment.  From kernel docs:
> 
>         * Please stick to this order of heading adornments:
> 
>           1. ``=`` with overline for document title::
> 
>                ==============
>                Document title
>                ==============
> 
> Add  overline heading adornment to document title.
> 
> Fixes commit (85d63445f411 Documentation: e100: Update the Intel
> 10/100 driver doc)
> 
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Tobin C. Harding <me@tobin.cc>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 2/4] Documentation: e1000: Use correct heading adornment
From: Jeff Kirsher @ 2018-06-22 16:45 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-3-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> Recently documentation file was converted to rst.  The document title
> has the incorrect heading adornment.  From kernel docs:
> 
>         * Please stick to this order of heading adornments:
> 
>           1. ``=`` with overline for document title::
> 
>                ==============
>                Document title
>                ==============
> 
> Add  overline heading adornment to document title.
> 
> Fixes commit (228046e76189 Documentation: e1000: Update kernel
> documentation)
> 
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Tobin C. Harding <me@tobin.cc>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 3/4] Documentation: e100: Fix docs build error
From: Jeff Kirsher @ 2018-06-22 16:46 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-4-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 970 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> Recent patch updated e100 docs to rst format.  Docs build (`make
> htmldocs`) is currently failing due to this file with error:
> 
>         (SEVERE/4) Unexpected section title.
> 
> This is because a section of the file is indented 2 spaces.  Build
> error
> can be cleared by aligning the text with column 0.  While we are
> changing
> these lines we can make sure line length does not exceed 72, that
> newlines following headings are uniform, and that full stops are
> followed by two spaces.
> 
> Align text with column 0, limit line length to 72, ensure two spaces
> follow all full stops, ensure uniform use of newlines after heading.
> 
> Fixes commit (85d63445f411 Documentation: e100: Update the Intel
> 10/100 driver doc)
> 
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Tobin C. Harding <me@tobin.cc>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 4/4] Documentation: e1000: Fix docs build error
From: Jeff Kirsher @ 2018-06-22 16:47 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-5-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 965 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> Recent patch updated e1000 docs to rst format.  Docs build (`make
> htmldocs`) is currently failing due to this file with error:
> 
>         (SEVERE/4) Unexpected section title.
> 
> This is because a section of the file is indented 2 spaces.  Build
> error
> can be cleared by aligning the text with column 0.  While we are
> changing
> these lines we can make sure line length does not exceed 72, that
> newlines following headings are uniform, and that full stops are
> followed by two spaces.
> 
> Align text with column 0, limit line length to 72, ensure two spaces
> follow all full stops, ensure uniform use of newlines after heading.
> 
> Fixes commit (228046e76189 Documentation: e1000: Update kernel
> documentation)
> 
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Tobin C. Harding <me@tobin.cc>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Fw: [Bug 200191] New: UBSAN: Undefined behaviour in ./include/net/xfrm.h:894
From: Stephen Hemminger @ 2018-06-22 16:54 UTC (permalink / raw)
  To: steffen.klassert, herbert; +Cc: netdev



Begin forwarded message:

Date: Fri, 22 Jun 2018 15:20:06 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 200191] New: UBSAN: Undefined behaviour in ./include/net/xfrm.h:894


https://bugzilla.kernel.org/show_bug.cgi?id=200191

            Bug ID: 200191
           Summary: UBSAN: Undefined behaviour in ./include/net/xfrm.h:894
           Product: Networking
           Version: 2.5
    Kernel Version: v4.18-rc2
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: icytxw@gmail.com
        Regression: No

static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen)
{
        /* C99 6.5.7 (3): u32 << 32 is undefined behaviour */
        if (sizeof(long) == 4 && prefixlen == 0)
                return true;
        return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen)));
}


$ cat report0 
================================================================================
UBSAN: Undefined behaviour in ./include/net/xfrm.h:894:23
shift exponent -128 is negative
CPU: 0 PID: 6190 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x122/0x1c8 lib/dump_stack.c:113
 ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
 __ubsan_handle_shift_out_of_bounds+0x29a/0x2ff lib/ubsan.c:425
 addr4_match include/net/xfrm.h:894 [inline]
 __xfrm4_selector_match net/xfrm/xfrm_policy.c:77 [inline]
 xfrm_selector_match+0xde9/0x11e0 net/xfrm/xfrm_policy.c:102
 xfrm_sk_policy_lookup+0x179/0x460 net/xfrm/xfrm_policy.c:1178
 xfrm_lookup+0x20e/0x1be0 net/xfrm/xfrm_policy.c:2149
 xfrm_lookup_route+0x42/0x1f0 net/xfrm/xfrm_policy.c:2282
 ip_route_output_flow+0x86/0xc0 net/ipv4/route.c:2588
 udp_sendmsg+0x15c1/0x2180 net/ipv4/udp.c:1086
 inet_sendmsg+0x103/0x490 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:645 [inline]
 sock_sendmsg+0xf9/0x180 net/socket.c:655
 __sys_sendto+0x239/0x3c0 net/socket.c:1833
 __do_sys_sendto net/socket.c:1845 [inline]
 __se_sys_sendto net/socket.c:1841 [inline]
 __x64_sys_sendto+0xef/0x1c0 net/socket.c:1841
 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455a09
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83
eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00 
RSP: 002b:00007f0b710bdc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f0b710be6d4 RCX: 0000000000455a09
RDX: 0000000000000000 RSI: 00000000200014c0 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000020001540 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000005d7 R14: 00000000006fdcc8 R15: 0000000000000000
================================================================================
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
EXT4-fs (sda): re-mounted. Opts: jqfmt=vfsold,
sr 1:0:0:0: [sr0] unaligned transfer
sr 1:0:0:0: [sr0] unaligned transfer
audit: type=1326 audit(1529680282.002:2): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=6558 comm="syz-executor0" exe="/syz-executor0"
sig=31 arch=c000003e syscall=202 compat=0 ip=0x455a09 code=0x0
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
print_req_error: 25 callbacks suppressed
print_req_error: I/O error, dev sr0, sector 0
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
print_req_error: I/O error, dev sr0, sector 0
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
print_req_error: I/O error, dev sr0, sector 0
sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sr 1:0:0:0: [sr0] tag#0 Sense Key : Not Ready [current] 
sr 1:0:0:0: [sr0] tag#0 Add. Sense: Medium not present
sr 1:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 40 00
print_req_error: I/O error, dev sr0, sector 0
cgroup: cgroup2: unknown option ""
cgroup: cgroup2: unknown option ""

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* [PATCH] net: drivers/net: Convert random_ether_addr to eth_random_addr
From: Joe Perches @ 2018-06-22 17:51 UTC (permalink / raw)
  To: Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Hans Ulli Kroll, Linus Walleij, Yisen Zhuang, Salil Mehta,
	Jeff Kirsher, Bryan Whitehead, Microchip Linux Driver Support,
	Harish Patil, Manish Chopra, Dept-GELinuxNICDev,
	Solarflare linux maintainers, Edward Cree, Bert Kenward,
	Grygorii Strashko, Wingman Kwok, Murali 
  Cc: b.a.t.m.a.n, netdev, linux-usb, linux-wireless, linux-kernel,
	intel-wired-lan, Kalle Valo, linux-ntb, linux-omap,
	linux-arm-kernel

random_ether_addr is a #define for eth_random_addr which is
generally preferred in kernel code by ~3:1

Convert the uses of random_ether_addr to enable removing the #define

Miscellanea:

o Convert &vfmac[0] to equivalent vfmac and avoid unnecessary line wrap

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c          | 5 ++---
 drivers/net/ethernet/cortina/gemini.c                    | 2 +-
 drivers/net/ethernet/hisilicon/hip04_eth.c               | 2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c              | 2 +-
 drivers/net/ethernet/microchip/lan743x_main.c            | 2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c          | 2 +-
 drivers/net/ethernet/sfc/ef10_sriov.c                    | 2 +-
 drivers/net/ethernet/ti/cpsw.c                           | 2 +-
 drivers/net/ethernet/ti/netcp_core.c                     | 4 ++--
 drivers/net/ntb_netdev.c                                 | 2 +-
 drivers/net/usb/lan78xx.c                                | 2 +-
 drivers/net/wireless/ath/ath9k/hw.c                      | 2 +-
 net/batman-adv/bridge_loop_avoidance.c                   | 2 +-
 14 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 8a815bb57177..7cb4e753829b 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3569,9 +3569,8 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
 		for (j = 0; j < octeon_dev->sriov_info.max_vfs; j++) {
 			u8 vfmac[ETH_ALEN];
 
-			random_ether_addr(&vfmac[0]);
-			if (__liquidio_set_vf_mac(netdev, j,
-						  &vfmac[0], false)) {
+			eth_random_addr(vfmac);
+			if (__liquidio_set_vf_mac(netdev, j, vfmac, false)) {
 				dev_err(&octeon_dev->pci_dev->dev,
 					"Error setting VF%d MAC address\n",
 					j);
diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index 6d7404f66f84..ce1f04fdbf70 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -2435,7 +2435,7 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev)
 			port->mac_addr[0], port->mac_addr[1],
 			port->mac_addr[2]);
 		dev_info(dev, "using a random ethernet address\n");
-		random_ether_addr(netdev->dev_addr);
+		eth_random_addr(netdev->dev_addr);
 	}
 	gmac_write_mac_address(netdev);
 
diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 340e28211135..14374a856d30 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -904,7 +904,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
 		hip04_config_port(ndev, SPEED_100, DUPLEX_FULL);
 
 	hip04_config_fifo(priv);
-	random_ether_addr(ndev->dev_addr);
+	eth_random_addr(ndev->dev_addr);
 	hip04_update_mac_address(ndev);
 
 	ret = hip04_alloc_ring(ndev, d);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c944bd10b03d..95e9dfbe9839 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11978,7 +11978,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
 		snprintf(netdev->name, IFNAMSIZ, "%.*sv%%d",
 			 IFNAMSIZ - 4,
 			 pf->vsi[pf->lan_vsi]->netdev->name);
-		random_ether_addr(mac_addr);
+		eth_random_addr(mac_addr);
 
 		spin_lock_bh(&vsi->mac_filter_hash_lock);
 		i40e_add_mac_filter(vsi, mac_addr);
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index dd947e4dd3ce..e1747a490066 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -828,7 +828,7 @@ static int lan743x_mac_init(struct lan743x_adapter *adapter)
 	}
 
 	if (!mac_address_valid)
-		random_ether_addr(adapter->mac_address);
+		eth_random_addr(adapter->mac_address);
 	lan743x_mac_set_address(adapter, adapter->mac_address);
 	ether_addr_copy(netdev->dev_addr, adapter->mac_address);
 	return 0;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c
index 0c744b9c6e0a..77e386ebff09 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c
@@ -212,7 +212,7 @@ int qlcnic_sriov_init(struct qlcnic_adapter *adapter, int num_vfs)
 			vp->max_tx_bw = MAX_BW;
 			vp->min_tx_bw = MIN_BW;
 			vp->spoofchk = false;
-			random_ether_addr(vp->mac);
+			eth_random_addr(vp->mac);
 			dev_info(&adapter->pdev->dev,
 				 "MAC Address %pM is configured for VF %d\n",
 				 vp->mac, i);
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index b9a7548ec6a0..0afc3d335d56 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -210,7 +210,7 @@ void rmnet_vnd_setup(struct net_device *rmnet_dev)
 	rmnet_dev->netdev_ops = &rmnet_vnd_ops;
 	rmnet_dev->mtu = RMNET_DFLT_PACKET_SIZE;
 	rmnet_dev->needed_headroom = RMNET_NEEDED_HEADROOM;
-	random_ether_addr(rmnet_dev->dev_addr);
+	eth_random_addr(rmnet_dev->dev_addr);
 	rmnet_dev->tx_queue_len = RMNET_TX_QUEUE_LEN;
 
 	/* Raw IP mode */
diff --git a/drivers/net/ethernet/sfc/ef10_sriov.c b/drivers/net/ethernet/sfc/ef10_sriov.c
index 019cef1d3cf7..8820be83ce85 100644
--- a/drivers/net/ethernet/sfc/ef10_sriov.c
+++ b/drivers/net/ethernet/sfc/ef10_sriov.c
@@ -199,7 +199,7 @@ static int efx_ef10_sriov_alloc_vf_vswitching(struct efx_nic *efx)
 		return -ENOMEM;
 
 	for (i = 0; i < efx->vf_count; i++) {
-		random_ether_addr(nic_data->vf[i].mac);
+		eth_random_addr(nic_data->vf[i].mac);
 		nic_data->vf[i].efx = NULL;
 		nic_data->vf[i].vlan = EFX_EF10_NO_VLAN;
 
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 358edab9e72e..093998124149 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2927,7 +2927,7 @@ static int cpsw_probe_dual_emac(struct cpsw_priv *priv)
 		dev_info(cpsw->dev, "cpsw: Detected MACID = %pM\n",
 			 priv_sl2->mac_addr);
 	} else {
-		random_ether_addr(priv_sl2->mac_addr);
+		eth_random_addr(priv_sl2->mac_addr);
 		dev_info(cpsw->dev, "cpsw: Random MACID = %pM\n",
 			 priv_sl2->mac_addr);
 	}
diff --git a/drivers/net/ethernet/ti/netcp_core.c b/drivers/net/ethernet/ti/netcp_core.c
index e40aa3e31af2..6ebf110cd594 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -2052,7 +2052,7 @@ static int netcp_create_interface(struct netcp_device *netcp_device,
 		if (is_valid_ether_addr(efuse_mac_addr))
 			ether_addr_copy(ndev->dev_addr, efuse_mac_addr);
 		else
-			random_ether_addr(ndev->dev_addr);
+			eth_random_addr(ndev->dev_addr);
 
 		devm_iounmap(dev, efuse);
 		devm_release_mem_region(dev, res.start, size);
@@ -2061,7 +2061,7 @@ static int netcp_create_interface(struct netcp_device *netcp_device,
 		if (mac_addr)
 			ether_addr_copy(ndev->dev_addr, mac_addr);
 		else
-			random_ether_addr(ndev->dev_addr);
+			eth_random_addr(ndev->dev_addr);
 	}
 
 	ret = of_property_read_string(node_interface, "rx-channel",
diff --git a/drivers/net/ntb_netdev.c b/drivers/net/ntb_netdev.c
index 306a662eba94..df8d49ad48c3 100644
--- a/drivers/net/ntb_netdev.c
+++ b/drivers/net/ntb_netdev.c
@@ -430,7 +430,7 @@ static int ntb_netdev_probe(struct device *client_dev)
 	ndev->hw_features = ndev->features;
 	ndev->watchdog_timeo = msecs_to_jiffies(NTB_TX_TIMEOUT_MS);
 
-	random_ether_addr(ndev->perm_addr);
+	eth_random_addr(ndev->perm_addr);
 	memcpy(ndev->dev_addr, ndev->perm_addr, ndev->addr_len);
 
 	ndev->netdev_ops = &ntb_netdev_ops;
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 8dff87ec6d99..a89570f34937 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1720,7 +1720,7 @@ static void lan78xx_init_mac_address(struct lan78xx_net *dev)
 				  "MAC address read from EEPROM");
 		} else {
 			/* generate random MAC */
-			random_ether_addr(addr);
+			eth_random_addr(addr);
 			netif_dbg(dev, ifup, dev->net,
 				  "MAC address set to random addr");
 		}
diff --git a/drivers/net/wireless/ath/ath9k/hw.c b/drivers/net/wireless/ath/ath9k/hw.c
index e60bea4604e4..1665066f4e24 100644
--- a/drivers/net/wireless/ath/ath9k/hw.c
+++ b/drivers/net/wireless/ath/ath9k/hw.c
@@ -496,7 +496,7 @@ static void ath9k_hw_init_macaddr(struct ath_hw *ah)
 	ath_err(common, "eeprom contains invalid mac address: %pM\n",
 		common->macaddr);
 
-	random_ether_addr(common->macaddr);
+	eth_random_addr(common->macaddr);
 	ath_err(common, "random mac address will be used: %pM\n",
 		common->macaddr);
 
diff --git a/net/batman-adv/bridge_loop_avoidance.c b/net/batman-adv/bridge_loop_avoidance.c
index a2de5a44bd41..ff9659af6b91 100644
--- a/net/batman-adv/bridge_loop_avoidance.c
+++ b/net/batman-adv/bridge_loop_avoidance.c
@@ -1449,7 +1449,7 @@ static void batadv_bla_periodic_work(struct work_struct *work)
 		 * detection frames. Set the locally administered bit to avoid
 		 * collisions with users mac addresses.
 		 */
-		random_ether_addr(bat_priv->bla.loopdetect_addr);
+		eth_random_addr(bat_priv->bla.loopdetect_addr);
 		bat_priv->bla.loopdetect_addr[0] = 0xba;
 		bat_priv->bla.loopdetect_addr[1] = 0xbe;
 		bat_priv->bla.loopdetect_lasttime = jiffies;
-- 
2.15.0

^ permalink raw reply related

* Re: bnx2x: kernel panic in the bnx2x driver
From: Vishwanath Pai @ 2018-06-22 18:27 UTC (permalink / raw)
  To: Sudarsana.Kalluru, Ariel.Elior, Dept-EngEverestLinuxL2
  Cc: davem, netdev, dbanerje, pai.vishwain

The patch below worked for me (on 4.14.51 LTS kernel):

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..2b3863a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -3387,14 +3387,20 @@ static int bnx2x_set_rss_flags(struct bnx2x *bp, struct ethtool_rxnfc *info)
 			DP(BNX2X_MSG_ETHTOOL,
 			   "rss re-configured, UDP 4-tupple %s\n",
 			   udp_rss_requested ? "enabled" : "disabled");
-			return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
+			if (bp->state == BNX2X_STATE_OPEN)
+				return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
+			else
+				return 0;
 		} else if ((info->flow_type == UDP_V6_FLOW) &&
 			   (bp->rss_conf_obj.udp_rss_v6 != udp_rss_requested)) {
 			bp->rss_conf_obj.udp_rss_v6 = udp_rss_requested;
 			DP(BNX2X_MSG_ETHTOOL,
 			   "rss re-configured, UDP 4-tupple %s\n",
 			   udp_rss_requested ? "enabled" : "disabled");
-			return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
+			if (bp->state == BNX2X_STATE_OPEN)
+				return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
+			else
+				return 0;
 		}
 		return 0;
 
Although I think there might be another place where we may need to fix this as well:
bnx2x_config_rss_eth()

Thanks,
Vishwanath

On 06/22/2018 10:57 AM, Vishwanath Pai wrote:
> Ah, that is great! I will test it out on my machine and let you know.
>
> Thanks,
> Vishwanath
>
> On 06/22/2018 10:21 AM, Kalluru, Sudarsana wrote:
>> Hi Vishwanath,
>>     The config will be cached in the device structure (bp->rss_conf_obj.udp_rss_v4) in this scenario, and will be applied in the load path (bnx2x_nic_load() --> bnx2x_init_rss()). Have unit tested the change on my setup.
>>
>> Thanks,
>> Sudarsana
>>
>> -----Original Message-----
>> From: Vishwanath Pai [mailto:vpai@akamai.com] 
>> Sent: 22 June 2018 18:52
>> To: Kalluru, Sudarsana <Sudarsana.Kalluru@cavium.com>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
>> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
>> Subject: Re: bnx2x: kernel panic in the bnx2x driver
>>
>> Hi Sudarsana,
>>
>> Thanks for taking a look at my email. The fix you suggested would definitely fix the kernel panic, but at the same time wouldn't it also silently ignore the request by ethtool to set rx-flow-hash?
>>
>> Thanks,
>> Vishwanath
>>
>> On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
>>> Hi Vishwanath,
>>>     Thanks for your mail, and the analysis.
>>> The fix would be to invoke bnx2x_rss() only when the device is opened,
>>>       if (bp->state == BNX2X_STATE_OPEN)
>>>               return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
>>>       else
>>>               return 0;
>>> Ariel,
>>>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
>>>
>>> Thanks,
>>> Sudarsana
>>>
>>> -----Original Message-----
>>> From: Vishwanath Pai [mailto:vpai@akamai.com]
>>> Sent: 22 June 2018 10:37
>>> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 
>>> <Dept-EngEverestLinuxL2@cavium.com>
>>> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; 
>>> pai.vishwain@gmail.com
>>> Subject: bnx2x: kernel panic in the bnx2x driver
>>>
>>> External Email
>>>
>>> Hi,
>>>
>>> We recently noticed a kernel panic in the bnx2x driver when trying to 
>>> set rx-flow-hash parameters via ethtool during if-pre-up.d. I am 
>>> running kernel
>>> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
>>>
>>> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
>>> [   18.280214] Oops: 0010 [#1] SMP PTI
>>> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
>>> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
>>> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
>>> [   18.280243] RIP: 0010:          (null)
>>> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
>>> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
>>> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
>>> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
>>> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
>>> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
>>> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
>>> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
>>> [   18.280249] Call Trace:
>>> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
>>> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
>>> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
>>> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
>>> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
>>> [   18.280283]  ? security_capable+0x3c/0x60
>>> [   18.280284]  dev_ethtool+0350/0x2610
>>> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
>>> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
>>> [   18.280290]  ? inet_ioctl+0x182/0x1a0
>>> [   18.280291]  dev_ioctl+0x203/0x3f0
>>> [   18.280293]  ? dev_ioctl+0x203/0x3f0
>>> [   18.280294]  sock_do_ioctl+0xae/0x150
>>> [   18.280296]  sock_ioctl+0x1e2/0x330
>>> [   18.280296]  ? sock_ioctl+0x1e2/0x330
>>> [   18.280299]  do_vfs_ioctl+0xa8/0x620
>>> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
>>> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
>>> [   18.280302]  ? handle_mm_fault+0xe3/0x220
>>> [   18.280304]  ksys_ioctl+0x75/0x80
>>> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
>>> [   18.280307]  do_syscall_64+0x5a/0x120
>>> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
>>> [   18.280310] RIP: 0033:0x7fc0e7fba107
>>> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
>>> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
>>> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
>>> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
>>> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
>>> [   18.280315] Code:  Bad RIP value.
>>> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
>>> [  18.280318] CR2: 0000000000000000
>>> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
>>>
>>> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
>>> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
>>> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
>>>
>>> The problem here is that rss_obj in bnx2x struct for the device hasn't 
>>> been initialized yet, which causes an exception in bnx2x_config_rss() 
>>> when calling "r->set_pending(r)" because r->set_pending is NULL. It 
>>> looks like a lot many things haven't been initialized at this point, 
>>> most of that happens in this
>>> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
>>>
>>> Thanks,
>>> Vishwanath
>>>
>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox