Netdev List
 help / color / mirror / Atom feed
* [PATCH net] net: af_packet: fix race in PACKET_{R|T}X_RING
From: Eric Dumazet @ 2018-04-16  0:52 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

In order to remove the race caught by syzbot [1], we need
to lock the socket before using po->tp_version as this could
change under us otherwise.

This means lock_sock() and release_sock() must be done by
packet_set_ring() callers.

[1] :
BUG: KMSAN: uninit-value in packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
CPU: 0 PID: 20195 Comm: syzkaller707632 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
 packet_setsockopt+0x12c6/0x5a90 net/packet/af_packet.c:3662
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
 SyS_setsockopt+0x76/0xa0 net/socket.c:1828
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x449099
RSP: 002b:00007f42b5307ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 000000000070003c RCX: 0000000000449099
RDX: 0000000000000005 RSI: 0000000000000107 RDI: 0000000000000003
RBP: 0000000000700038 R08: 000000000000001c R09: 0000000000000000
R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000080eecf R14: 00007f42b53089c0 R15: 0000000000000001

Local variable description: ----req_u@packet_setsockopt
Variable was created at:
 packet_setsockopt+0x13f/0x5a90 net/packet/af_packet.c:3612
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/packet/af_packet.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 616cb9c18f88edd759dfb461051670c225978afa..c31b0687396a6ef45413f06efcc7c3f923e91d01 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3008,6 +3008,7 @@ static int packet_release(struct socket *sock)
 
 	packet_flush_mclist(sk);
 
+	lock_sock(sk);
 	if (po->rx_ring.pg_vec) {
 		memset(&req_u, 0, sizeof(req_u));
 		packet_set_ring(sk, &req_u, 1, 0);
@@ -3017,6 +3018,7 @@ static int packet_release(struct socket *sock)
 		memset(&req_u, 0, sizeof(req_u));
 		packet_set_ring(sk, &req_u, 1, 1);
 	}
+	release_sock(sk);
 
 	f = fanout_release(sk);
 
@@ -3643,6 +3645,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		union tpacket_req_u req_u;
 		int len;
 
+		lock_sock(sk);
 		switch (po->tp_version) {
 		case TPACKET_V1:
 		case TPACKET_V2:
@@ -3653,12 +3656,17 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			len = sizeof(req_u.req3);
 			break;
 		}
-		if (optlen < len)
-			return -EINVAL;
-		if (copy_from_user(&req_u.req, optval, len))
-			return -EFAULT;
-		return packet_set_ring(sk, &req_u, 0,
-			optname == PACKET_TX_RING);
+		if (optlen < len) {
+			ret = -EINVAL;
+		} else {
+			if (copy_from_user(&req_u.req, optval, len))
+				ret = -EFAULT;
+			else
+				ret = packet_set_ring(sk, &req_u, 0,
+						    optname == PACKET_TX_RING);
+		}
+		release_sock(sk);
+		return ret;
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -4208,8 +4216,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	/* Added to avoid minimal code churn */
 	struct tpacket_req *req = &req_u->req;
 
-	lock_sock(sk);
-
 	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
 	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
 
@@ -4347,7 +4353,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	if (pg_vec)
 		free_pg_vec(pg_vec, order, req->tp_block_nr);
 out:
-	release_sock(sk);
 	return err;
 }
 
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [PATCH] filter.txt: update 'tools/net/' to 'tools/bpf/'
From: David Miller @ 2018-04-16  0:45 UTC (permalink / raw)
  To: shhuiw; +Cc: ast, daniel, corbet, netdev, linux-doc
In-Reply-To: <20180415080712.2213-1-shhuiw@foxmail.com>

From: Wang Sheng-Hui <shhuiw@foxmail.com>
Date: Sun, 15 Apr 2018 16:07:12 +0800

> The tools are located at tootls/bpf/ instead of tools/net/.
> Update the filter.txt doc.
> 
> Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify ife action
From: David Ahern @ 2018-04-16  0:24 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1523655605-20765-1-git-send-email-mrv@mojatatu.com>

On 4/13/18 3:40 PM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_ife.c | 54 ++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH v2 iproute2-next 1/1] tc: jsonify skbedit action
From: David Ahern @ 2018-04-16  0:11 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1523383469-26207-1-git-send-email-mrv@mojatatu.com>

On 4/10/18 12:04 PM, Roman Mashak wrote:
> v2:
>    FIxed strings format in print_string()
> 
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_skbedit.c | 53 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 29 insertions(+), 24 deletions(-)
> 

applied to iproute2-next

^ permalink raw reply

* [PATCH] ibmvnic: Clear pending interrupt after device reset
From: Thomas Falcon @ 2018-04-15 23:53 UTC (permalink / raw)
  To: netdev; +Cc: linuxppc-dev, jallen, nfont, benh, Thomas Falcon

Due to a firmware bug, the hypervisor can send an interrupt to a
transmit or receive queue just prior to a partition migration, not
allowing the device enough time to handle it and send an EOI. When
the partition migrates, the interrupt is lost but an "EOI-pending"
flag for the interrupt line is still set in firmware. No further
interrupts will be sent until that flag is cleared, effectively
freezing that queue. To workaround this, the driver will disable the
hardware interrupt and send an H_EOI signal prior to re-enabling it.
This will flush the pending EOI and allow the driver to continue
operation.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index f84a920..ef7995fc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
 		netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
 		if (prev_state == VNIC_CLOSED)
 			enable_irq(adapter->rx_scrq[i]->irq);
-		else
-			enable_scrq_irq(adapter, adapter->rx_scrq[i]);
+		enable_scrq_irq(adapter, adapter->rx_scrq[i]);
 	}
 
 	for (i = 0; i < adapter->req_tx_queues; i++) {
 		netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
 		if (prev_state == VNIC_CLOSED)
 			enable_irq(adapter->tx_scrq[i]->irq);
-		else
-			enable_scrq_irq(adapter, adapter->tx_scrq[i]);
+		enable_scrq_irq(adapter, adapter->tx_scrq[i]);
 	}
 
 	rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
@@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
 			if (adapter->tx_scrq[i]->irq) {
 				netdev_dbg(netdev,
 					   "Disabling tx_scrq[%d] irq\n", i);
+				disable_scrq_irq(adapter, adapter->tx_scrq[i]);
 				disable_irq(adapter->tx_scrq[i]->irq);
 			}
 	}
@@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
 			if (adapter->rx_scrq[i]->irq) {
 				netdev_dbg(netdev,
 					   "Disabling rx_scrq[%d] irq\n", i);
+				disable_scrq_irq(adapter, adapter->rx_scrq[i]);
 				disable_irq(adapter->rx_scrq[i]->irq);
 			}
 		}
@@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter *adapter,
 {
 	struct device *dev = &adapter->vdev->dev;
 	unsigned long rc;
+	u64 val;
 
 	if (scrq->hw_irq > 0x100000000ULL) {
 		dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
 		return 1;
 	}
 
+	val = (0xff000000) | scrq->hw_irq;
+	rc = plpar_hcall_norets(H_EOI, val);
+	if (rc)
+		dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
+			val, rc);
+
 	rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
 				H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
 	if (rc)
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] ibmvnic: Clear pending interrupt after device reset
From: Thomas Falcon @ 2018-04-15 23:46 UTC (permalink / raw)
  To: netdev; +Cc: linuxppc-dev, jallen, nfont, benh
In-Reply-To: <1523834853-15448-1-git-send-email-tlfalcon@linux.vnet.ibm.com>

On 04/15/2018 06:27 PM, Thomas Falcon wrote:
> Due to a firmware bug, the hypervisor can send an interrupt to a
> transmit or receive queue just prior to a partition migration, not
> allowing the device enough time to handle it and send an EOI. When
> the partition migrates, the interrupt is lost but an "EOI-pending"
> flag for the interrupt line is still set in firmware. No further
> interrupts will be sent until that flag is cleared, effectively
> freezing that queue. To workaround this, the driver will disable the
> hardware interrupt and send an H_EOI signal prior to re-enabling it.
> This will flush the pending EOI and allow the driver to continue
> operation.

Excuse me, I misspelled the linuxppc-dev email address.

Tom

> Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index f84a920..ef7995fc 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
>  		netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
>  		if (prev_state == VNIC_CLOSED)
>  			enable_irq(adapter->rx_scrq[i]->irq);
> -		else
> -			enable_scrq_irq(adapter, adapter->rx_scrq[i]);
> +		enable_scrq_irq(adapter, adapter->rx_scrq[i]);
>  	}
>
>  	for (i = 0; i < adapter->req_tx_queues; i++) {
>  		netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
>  		if (prev_state == VNIC_CLOSED)
>  			enable_irq(adapter->tx_scrq[i]->irq);
> -		else
> -			enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> +		enable_scrq_irq(adapter, adapter->tx_scrq[i]);
>  	}
>
>  	rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> @@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
>  			if (adapter->tx_scrq[i]->irq) {
>  				netdev_dbg(netdev,
>  					   "Disabling tx_scrq[%d] irq\n", i);
> +				disable_scrq_irq(adapter, adapter->tx_scrq[i]);
>  				disable_irq(adapter->tx_scrq[i]->irq);
>  			}
>  	}
> @@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
>  			if (adapter->rx_scrq[i]->irq) {
>  				netdev_dbg(netdev,
>  					   "Disabling rx_scrq[%d] irq\n", i);
> +				disable_scrq_irq(adapter, adapter->rx_scrq[i]);
>  				disable_irq(adapter->rx_scrq[i]->irq);
>  			}
>  		}
> @@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter *adapter,
>  {
>  	struct device *dev = &adapter->vdev->dev;
>  	unsigned long rc;
> +	u64 val;
>
>  	if (scrq->hw_irq > 0x100000000ULL) {
>  		dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
>  		return 1;
>  	}
>
> +	val = (0xff000000) | scrq->hw_irq;
> +	rc = plpar_hcall_norets(H_EOI, val);
> +	if (rc)
> +		dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
> +			val, rc);
> +
>  	rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
>  				H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
>  	if (rc)

^ permalink raw reply

* [PATCH] ibmvnic: Clear pending interrupt after device reset
From: Thomas Falcon @ 2018-04-15 23:27 UTC (permalink / raw)
  To: netdev; +Cc: linuxppc-dev, jallen, nfont, benh, Thomas Falcon

Due to a firmware bug, the hypervisor can send an interrupt to a
transmit or receive queue just prior to a partition migration, not
allowing the device enough time to handle it and send an EOI. When
the partition migrates, the interrupt is lost but an "EOI-pending"
flag for the interrupt line is still set in firmware. No further
interrupts will be sent until that flag is cleared, effectively
freezing that queue. To workaround this, the driver will disable the
hardware interrupt and send an H_EOI signal prior to re-enabling it.
This will flush the pending EOI and allow the driver to continue
operation.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index f84a920..ef7995fc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
 		netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
 		if (prev_state == VNIC_CLOSED)
 			enable_irq(adapter->rx_scrq[i]->irq);
-		else
-			enable_scrq_irq(adapter, adapter->rx_scrq[i]);
+		enable_scrq_irq(adapter, adapter->rx_scrq[i]);
 	}
 
 	for (i = 0; i < adapter->req_tx_queues; i++) {
 		netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
 		if (prev_state == VNIC_CLOSED)
 			enable_irq(adapter->tx_scrq[i]->irq);
-		else
-			enable_scrq_irq(adapter, adapter->tx_scrq[i]);
+		enable_scrq_irq(adapter, adapter->tx_scrq[i]);
 	}
 
 	rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
@@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
 			if (adapter->tx_scrq[i]->irq) {
 				netdev_dbg(netdev,
 					   "Disabling tx_scrq[%d] irq\n", i);
+				disable_scrq_irq(adapter, adapter->tx_scrq[i]);
 				disable_irq(adapter->tx_scrq[i]->irq);
 			}
 	}
@@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
 			if (adapter->rx_scrq[i]->irq) {
 				netdev_dbg(netdev,
 					   "Disabling rx_scrq[%d] irq\n", i);
+				disable_scrq_irq(adapter, adapter->rx_scrq[i]);
 				disable_irq(adapter->rx_scrq[i]->irq);
 			}
 		}
@@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter *adapter,
 {
 	struct device *dev = &adapter->vdev->dev;
 	unsigned long rc;
+	u64 val;
 
 	if (scrq->hw_irq > 0x100000000ULL) {
 		dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
 		return 1;
 	}
 
+	val = (0xff000000) | scrq->hw_irq;
+	rc = plpar_hcall_norets(H_EOI, val);
+	if (rc)
+		dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
+			val, rc);
+
 	rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
 				H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
 	if (rc)
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net] team: avoid adding twice the same option to the event list
From: Paolo Abeni @ 2018-04-15 19:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jiri
In-Reply-To: <20180413.140705.1693433489799741559.davem@davemloft.net>

On Fri, 2018-04-13 at 14:07 -0400, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Fri, 13 Apr 2018 13:59:25 +0200
> 
> > When parsing the options provided by the user space,
> > team_nl_cmd_options_set() insert them in a temporary list to send
> > multiple events with a single message.
> > While each option's attribute is correctly validated, the code does
> > not check for duplicate entries before inserting into the event
> > list.
> > 
> > Exploiting the above, the syzbot was able to trigger the following
> > splat:
>  ...
> > This changeset addresses the avoiding list_add() if the current
> > option is already present in the event list.
> > 
> > Reported-and-tested-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")
> 
> Looks good to me.
> 
> It's too bad that the tmp list entries don't get marked as they are
> added, or get unlinked by the list processor.  Either scheme would
> make the "already added" test a lot simpler.

Yes, I considered both changes, but than opted for this solution,
beliving it would be less invasive and more suitable for -net.

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH iproute2] utils: Do not reset family for default, any, all addresses
From: Thomas Deutschmann @ 2018-04-15 19:14 UTC (permalink / raw)
  To: David Ahern, stephen; +Cc: netdev, Serhey Popovych
In-Reply-To: <20180413163633.1844-1-dsahern@gmail.com>

Hi,

I can confirm that this patch solves the issue for us and restores
previous behavior.

Thank you.


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5

^ permalink raw reply

* Re: linux-next on x60: network manager often complains "network is disabled" after resume
From: Pavel Machek @ 2018-04-15 16:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Woody Suwalski, Rafael J. Wysocki, kernel list,
	Linux-pm mailing list, Netdev list
In-Reply-To: <95efbba35c3389015d4919a59f8d01bc2d375a19.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2698 bytes --]

On Mon 2018-03-26 10:33:55, Dan Williams wrote:
> On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > > 
> > > > Broken state.
> > > > 
> > > > pavel@amd:~$ nmcli dev
> > > > DEVICE  TYPE      STATE        CONNECTION
> > > > eth1    ethernet  unavailable  --
> > > > lo      loopback  unmanaged    --
> > > > wlan0   wifi      unmanaged    --
> > > 
> > > If the state is "unmanaged" on resume, that would indicate a
> > > problem
> > > with sleep/wake and likely not a kernel network device issue.
> > > 
> > > We should probably move this discussion to the NM lists to debug
> > > further.  Before you suspend, run "nmcli gen log level trace" to
> > > turn
> > > on full debug logging, then reproduce the issue, and send a pointer
> > > to
> > > those logs (scrubbed for anything you consider sensitive) to the NM
> > > mailing list.
> > 
> > Hmm :-)
> > 
> > root@amd:/data/pavel# nmcli gen log level trace
> > Error: Unknown log level 'trace'
> 
> What NM version?  'trace' is pretty old (since 1.0 from December 2014)
> so unless you're using a really, really old version of Debian I'd
> expect you'd have it.  Anyway, debug would do.

Hmm.

pavel@duo:~$ /usr/sbin/NetworkManager --version
You must be root to run NetworkManager!
pavel@duo:~$ sudo /usr/sbin/NetworkManager --version
0.9.10.0

So I set the log level, but I still don't see much in the log:

Apr 14 18:14:29 duo dbus[3009]: [system] Successfully activated
service 'org.freedesktop.nm_dispatcher'
Apr 14 18:14:29 duo nm-dispatcher: Dispatching action 'down' for wlan1
Apr 14 18:14:29 duo systemd[1]: Started Network Manager Script
Dispatcher Service.
Apr 14 18:14:29 duo systemd-sleep[6853]: Suspending system...
Apr 14 21:27:53 duo systemd[1]: systemd-journald.service watchdog
timeout (limit 1min)!
pavel@duo:~$ date
Sun Apr 15 12:26:32 CEST 2018
pavel@duo:~$

Is it possible that time handling accross suspend changed in v4.17?

I get some weird effects. With display backlight...

> > Where do I get the logs? I don't see much in the syslog...
> 
> > And.. It seems that it is "every other suspend". One resume results
> > in
> > broken network, one in working one, one in broken one...
> 
> Does your distro use pm-utils, upower, or systemd for suspend/resume
> handling?

upower, I guess:

pavel@duo:/data/l/linux$ ps aux | grep upower
root      3820  0.0  0.1  42848  7984 ?        Ssl  Apr14   0:01
/usr/lib/upower/upowerd

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: linux-next on x60: network manager often complains "network is disabled" after resume
From: Pavel Machek @ 2018-04-15 16:15 UTC (permalink / raw)
  To: Woody Suwalski
  Cc: Rafael J. Wysocki, kernel list, Linux-pm mailing list,
	Netdev list
In-Reply-To: <c7d96582-e2e6-d9c8-1140-3f1dab836132@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1969 bytes --]

On Tue 2018-03-20 21:11:54, Woody Suwalski wrote:
> Woody Suwalski wrote:
> >Pavel Machek wrote:
> >>On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> >>>Pavel Machek wrote:
> >>>>Hi!
> >>>>
> >>>>With recent linux-next, after resume networkmanager often claims that
> >>>>"network is disabled". Sometimes suspend/resume clears that.
> >>>>
> >>>>Any ideas? Does it work for you?
> >>>>                                    Pavel
> >>>Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> >>Thanks for testing... but yes, 4.16 should be ok. If not fixed,
> >>problem will appear in 4.17-rc1.
> >>
> >Works here OK. Tried ~10 suspends, all restarted OK.
> >kernel next-20180320
> >nmcli shows that Wifi always connects OK
> >
> >Woody
> >
> Contrary, it just happened to me on a 64-bit build 4.16-rc5 on T440.
> I think that Dan's suspicion is correct - it is a snafu in the PM: trying to
> hibernate results in a message:
> Failed to hibernate system via logind: There's already a shutdown or sleep
> operation in progress.
> 
> And ps shows "Ds /lib/systemd/systemd-sleep suspend"...

Problem now seems to be in the mainline.

But no, I don't see systemd-sleep in my process list :-(.

I guess you can't reproduce it easily? I tried bisecting, but while it
happens often enough to make v4.17 hard to use, it does not permit
reliable bisect.

These should be bad according to my notes

b04240a33b99b32cf6fbdf5c943c04e505a0cb07 
 ed80dc19e4dd395c951f745acd1484d61c4cfb20
 52113a0d3889d6e2738cf09bf79bc9cac7b5e1c6
 4fc97ef94bbfa185d16b3e44199b7559d0668747
 14ebdb2c814f508936fe178a2abc906a16a3ab48
 639adbeef5ae1bb8eeebbb0cde0b885397bde192

bisection claimed

c16add24522547bf52c189b3c0d1ab6f5c2b4375

is first bad commit, but I'm not sure if I trust that.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* One question about __tcp_select_window()
From: Wang Jian @ 2018-04-15 12:50 UTC (permalink / raw)
  To: netdev

Hi all,

While I read __tcp_select_window() code, I find that it maybe return a
smaller window.
Below is one scenario I thought, may be not right:
In function __tcp_select_window(), assume:
full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS.
And assume disable window scaling, then
window = tp->rcv_wnd > free_space && window > free_space
then it will round down free_space and return it.

Is this expected behavior? The comment is also saying
"Get the largest window that is a nice multiple of mss."

Should we do something like below ? Or I miss something?
I don't know how to verify it now.

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk)
                 * We also don't do any window rounding when the free space
                 * is too small.
                 */
-               if (window <= free_space - mss || window > free_space)
+               if (window <= free_space - mss)
                        window = rounddown(free_space, mss);
-               else if (mss == full_space &&
+               else if (window <= free_space && mss == full_space &&
                         free_space > window + (full_space >> 1))
                        window = free_space;
        }

Thanks.

^ permalink raw reply

* Re: [Patch net] llc: properly handle dev_queue_xmit() return value
From: Noam Rathaus @ 2018-04-15 10:08 UTC (permalink / raw)
  To: David Miller; +Cc: Cong Wang, netdev
In-Reply-To: <CAHqykcRxO2SSQXbpg_tNs49TNxpLZzDsYePokJSusdkdfTyp8g@mail.gmail.com>

Hi,

Is there any update?

On Fri, Apr 13, 2018 at 7:49 PM, Noam Rathaus <noamr@beyondsecurity.com> wrote:
> Hi
>
> Any update?
>
> On Thu, 29 Mar 2018 at 14:11, Noam Rathaus <noamr@beyondsecurity.com> wrote:
>>
>> Hi,
>>
>> Will you notify me when its been accepted? if not, how can I do this
>> checking myself to see if it was accepted?
>>
>> On Tue, Mar 27, 2018 at 8:13 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Noam Rathaus <noamr@beyondsecurity.com>
>> > Date: Tue, 27 Mar 2018 16:27:49 +0000
>> >
>> >> Guys please fill me in on the next step?
>> >>
>> >> If it’s applied it means it’s part of the official code of the kernel
>> >> now?
>> >
>> > It means it is in my networking GIT tree and will make it's way to Linus
>> > in the not so distant future.
>>
>>
>>
>> --
>>
>> Thanks,
>> Noam Rathaus
>> Beyond Security
>>
>> PGP Key ID: 7EF920D3C045D63F (Exp 2019-03)
>
> --
> Thanks,
> Noam Rathaus



-- 

Thanks,
Noam Rathaus
Beyond Security

PGP Key ID: 7EF920D3C045D63F (Exp 2019-03)

^ permalink raw reply

* [PATCH] filter.txt: update 'tools/net/' to 'tools/bpf/'
From: Wang Sheng-Hui @ 2018-04-15  8:07 UTC (permalink / raw)
  To: ast, daniel, corbet, netdev, linux-doc

The tools are located at tootls/bpf/ instead of tools/net/.
Update the filter.txt doc.

Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
---
 Documentation/networking/filter.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a4508ec1816b..fd55c7de9991 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -169,7 +169,7 @@ access to BPF code as well.
 BPF engine and instruction set
 ------------------------------
 
-Under tools/net/ there's a small helper tool called bpf_asm which can
+Under tools/bpf/ there's a small helper tool called bpf_asm which can
 be used to write low-level filters for example scenarios mentioned in the
 previous section. Asm-like syntax mentioned here has been implemented in
 bpf_asm and will be used for further explanations (instead of dealing with
@@ -359,7 +359,7 @@ $ ./bpf_asm -c foo
 In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
 filters that might not be obvious at first, it's good to test filters before
 attaching to a live system. For that purpose, there's a small tool called
-bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
+bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
 for testing BPF filters against given pcap files, single stepping through the
 BPF code on the pcap's packets and to do BPF machine register dumps.
 
@@ -483,7 +483,7 @@ Example output from dmesg:
 [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
 
-In the kernel source tree under tools/net/, there's bpf_jit_disasm for
+In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
 generating disassembly out of the kernel log's hexdump:
 
 # ./bpf_jit_disasm
-- 
2.11.0

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-15  6:01 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
	Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <e93e22c3-6c2e-00c9-10c6-163c4aacff14@intel.com>

On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:

> I meant between PFs on 2 compute nodes.

If the PF serves as uplink rep, it functions as  a switch port -- applications
don't run on switch ports. One way to get apps to run on the host in switchdev
mode is probe one of the VFs there.


[...]

> By smartnic env, i guess you are referring to OVS control plane also running
> on the NIC.

correct

> I will look forward to your patches.

FWIW, note that my patches don't bring any newz for you.. I am aligning
mlx5 with what was agreed on netdev, e.g nfp does it (uplink rep and
such) already.

^ permalink raw reply

* [PATCH linux-stable-4.14] tcp: clear tp->packets_out when purging write queue
From: Soheil Hassas Yeganeh @ 2018-04-15  0:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: ycheng, ncardwell, subashab, hvtaifwkbgefbaei,
	Soheil Hassas Yeganeh, Eric Dumazet

From: Soheil Hassas Yeganeh <soheil@google.com>

Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
---
 include/net/tcp.h | 1 +
 net/ipv4/tcp.c    | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d323d4fa742ca..fb653736f3353 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1616,6 +1616,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
 	tcp_init_send_head(sk);
+	tcp_sk(sk)->packets_out = 0;
 }
 
 static inline struct sk_buff *tcp_write_queue_head(const struct sock *sk)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 38b9a6276a9de..4dda8d301802e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2354,7 +2354,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	tp->snd_cwnd = 2;
 	icsk->icsk_probes_out = 0;
-	tp->packets_out = 0;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_cnt = 0;
 	tp->window_clamp = 0;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net] tcp: clear tp->packets_out when purging write queue
From: Soheil Hassas Yeganeh @ 2018-04-15  0:44 UTC (permalink / raw)
  To: davem, netdev
  Cc: ycheng, ncardwell, subashab, hvtaifwkbgefbaei,
	Soheil Hassas Yeganeh, Eric Dumazet

From: Soheil Hassas Yeganeh <soheil@google.com>

Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4fa3f812b9ff8..9ce1c726185eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2368,6 +2368,7 @@ void tcp_write_queue_purge(struct sock *sk)
 	INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue);
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
+	tcp_sk(sk)->packets_out = 0;
 }
 
 int tcp_disconnect(struct sock *sk, int flags)
@@ -2417,7 +2418,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	tp->snd_cwnd = 2;
 	icsk->icsk_probes_out = 0;
-	tp->packets_out = 0;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_cnt = 0;
 	tp->window_clamp = 0;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: Cavium Octeon III network driver.
From: Florian Fainelli @ 2018-04-15  0:08 UTC (permalink / raw)
  To: Steven J. Hill, netdev
In-Reply-To: <c269ed89-75ac-895a-984f-badc0b4d9a05@cavium.com>

Hi Steven,

On 04/13/2018 03:43 PM, Steven J. Hill wrote:
> Patches for Cavium's Octeon III network driver were submitted by
> David Daney back on 20180222. David has since left the company and
> I am now responsible for the upstreaming effort. When looking at
> <pachwork.ozlabs.org> they are marked as "Not Applicable". What
> steps do I take next? Thanks.

net-next tree is currently closed, but once it opens back up, you would
likely want to resubmit those patches. Last I remember they were ready
to go.
-- 
Florian

^ permalink raw reply

* Re: Regression with 5dcd8400884c ("macsec: missing dev_put() on error in macsec_newlink()")
From: Sabrina Dubroca @ 2018-04-14 22:31 UTC (permalink / raw)
  To: Laura Abbott
  Cc: Dan Carpenter, David S. Miller, Linux Kernel Mailing List, netdev
In-Reply-To: <9a3a84ff-1fd1-c063-0c50-a297d29a692b@redhat.com>

Hello Laura,

2018-04-14, 10:56:55 -0700, Laura Abbott wrote:
> Hi,
> 
> Fedora got a bug report of a regression when trying to remove the
> the macsec module (https://bugzilla.redhat.com/show_bug.cgi?id=1566410).
> I did a bisect and found
> 
> commit 5dcd8400884cc4a043a6d4617e042489e5d566a9
> Author: Dan Carpenter <dan.carpenter@oracle.com>
> Date:   Wed Mar 21 11:09:01 2018 +0300
> 
>     macsec: missing dev_put() on error in macsec_newlink()
>     We moved the dev_hold(real_dev); call earlier in the function but forgot
>     to update the error paths.
>     Fixes: 0759e552bce7 ("macsec: fix negative refcnt on parent link")
>     Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> The script I used for testing based on the reporter is attached. It
> looks like modprobe is stuck in the D state. Any idea?

I don't think that reference was actually leaked. It gets released in
macsec_free_netdev() when the device is deleted.

modprobe getting stuck is just a side-effect of the refcount going
negative on the parent device, since removing the module needs to take
the lock that is held by device deletion.

I'll send a revert tomorrow.

Thanks for the report,

-- 
Sabrina

^ permalink raw reply

* Re: v6/sit tunnels and VRFs
From: Jeff Barnhill @ 2018-04-14 22:07 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev
In-Reply-To: <e19f2fb3-319c-e8ea-5fc3-5072ddb69c5b@gmail.com>

I didn't see an easy way to achieve this behavior without affecting
the non-VRF routing lookups (such as deleting non-VRF rules).  We have
some automated tests that were looking for specific responses, but, of
course, those can be changed.  Among a few of my colleagues, this
became a discussion about maintaining consistent behavior between VRF
and non-VRF, such that a ping or some other tool wouldn't respond
differently.  That's the main reason I asked the question here - to
see how important this was in general use. It sounds like in your
experience, the specific error message/code hasn't been an issue.

Thanks,
Jeff


On Fri, Apr 13, 2018 at 4:31 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/13/18 2:23 PM, Jeff Barnhill wrote:
>> It seems that the ENETUNREACH response is still desirable in the VRF
>> case since the only difference (when using VRF vs. not) is that the
>> lookup should be restrained to a specific VRF.
>
> VRF is just policy routing to a table. If the table wants the lookup to
> stop, then it needs a default route. What you are referring to is the
> lookup goes through all tables and does not find an answer so it fails
> with -ENETUNREACH. I do not know of any way to make that happen with the
> existing default route options and in the past 2+ years we have not hit
> any s/w that discriminates -ENETUNREACH from -EHOSTUNREACH.
>
> I take it this is code from your internal code base. Why does it care
> between those two failures?

^ permalink raw reply

* [PATCH 3/3] net: macb: Receive Side Coalescing (RSC) feature added.
From: Rafal Ozieblo @ 2018-04-14 20:55 UTC (permalink / raw)
  To: Nicolas Ferre, netdev, linux-kernel; +Cc: Rafal Ozieblo
In-Reply-To: <1523739187-20077-1-git-send-email-rafalo@cadence.com>

This is basically the same as Large Receive Offload (LRO)
in Linux framework.

Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
---
 drivers/net/ethernet/cadence/macb.h      |  6 +++
 drivers/net/ethernet/cadence/macb_main.c | 70 +++++++++++++++++++++++++++++++-
 2 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index a2cb805..9ebdde7 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -83,6 +83,7 @@
 #define GEM_USRIO		0x000c /* User IO */
 #define GEM_DMACFG		0x0010 /* DMA Configuration */
 #define GEM_JML			0x0048 /* Jumbo Max Length */
+#define GEM_RSC			0x0058 /* RSC Control */
 #define GEM_HRB			0x0080 /* Hash Bottom */
 #define GEM_HRT			0x0084 /* Hash Top */
 #define GEM_SA1B		0x0088 /* Specific1 Bottom */
@@ -318,6 +319,11 @@
 #define GEM_ADDR64_OFFSET	30 /* Address bus width - 64b or 32b */
 #define GEM_ADDR64_SIZE		1
 
+/* Bitfields in RSC control */
+#define GEM_RSCCTRL_OFFSET	1 /* RSC control */
+#define GEM_RSCCTRL_SIZE	15
+#define GEM_CLRMSK_OFFSET	16 /* RSC clear mask */
+#define GEM_CLRMSK_SIZE		1
 
 /* Bitfields in NSR */
 #define MACB_NSR_LINK_OFFSET	0 /* pcs_link_state */
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 27c406c..92bdcf1 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -2377,6 +2377,8 @@ static int macb_open(struct net_device *dev)
 
 	if (!(bp->dev->hw_features & NETIF_F_LRO))
 		bufsz += NET_IP_ALIGN;
+	else
+		bufsz = 0xFF * 64; // For RSC Buffer Sizes must be set to 16K.
 
 	/* RX buffers initialization */
 	macb_init_rx_buffer_size(bp, bufsz);
@@ -2801,6 +2803,62 @@ static int macb_get_ts_info(struct net_device *netdev,
 	return ethtool_op_get_ts_info(netdev, info);
 }
 
+static void gem_enable_hdr_data_split(struct macb *bp, bool enable)
+{
+	u32 dmacfg;
+
+	dmacfg = gem_readl(bp, DMACFG);
+	if (enable)
+		dmacfg |= GEM_BIT(HDRS);
+	else
+		dmacfg &= ~GEM_BIT(HDRS);
+	gem_writel(bp, DMACFG, dmacfg);
+}
+
+static void gem_update_rsc_state(struct macb *bp, netdev_features_t feature)
+{
+	u32 rsc_control, rsc_control_new, queue, rsc;
+	bool enable, jumbo, any_enabled = false;
+	struct ethtool_rx_fs_item *item;
+	unsigned long flags;
+	u32 ncfgr;
+
+	enable = (!!(feature & NETIF_F_NTUPLE) && !!(feature & NETIF_F_LRO));
+	rsc = gem_readl(bp, RSC);
+	rsc_control = GEM_BFEXT(RSCCTRL, rsc);
+	rsc_control_new = 0;
+	if (enable) {
+		list_for_each_entry(item, &bp->rx_fs_list.list, list) {
+			queue = item->fs.ring_cookie;
+			rsc_control_new |= (1 << (queue - 1));
+			any_enabled = true;
+			netdev_dbg(bp->dev, "RSC %sabled for queue %u\n",
+				   enable ? "en" : "dis", queue);
+		}
+	}
+	if (rsc_control_new != rsc_control) {
+		rsc = GEM_BFINS(RSCCTRL, rsc_control_new, rsc);
+		gem_writel(bp, RSC, rsc);
+	}
+	if (bp->caps & MACB_CAPS_JUMBO) {
+		/* Don't enable jumbo mode for RSC:
+		 * disable unless not RSC and large MTU
+		 */
+		ncfgr = gem_readl(bp, NCFGR);
+		enable = !any_enabled;
+		jumbo = !!MACB_BFEXT(JFRAME, ncfgr);
+		/* and don't touch if already in the state we want */
+		if ((jumbo && !enable) || (!jumbo && enable)) {
+			ncfgr = MACB_BFINS(JFRAME, enable, ncfgr);
+			spin_lock_irqsave(&bp->lock, flags);
+			gem_writel(bp, NCFGR, ncfgr);
+			spin_unlock_irqrestore(&bp->lock, flags);
+		}
+	}
+	/* Need to enable header-data splitting also */
+	gem_enable_hdr_data_split(bp, any_enabled);
+}
+
 static void gem_enable_flow_filters(struct macb *bp, bool enable)
 {
 	struct ethtool_rx_fs_item *item;
@@ -2969,6 +3027,8 @@ static int gem_add_flow_filter(struct net_device *netdev,
 	if (netdev->features & NETIF_F_NTUPLE)
 		gem_enable_flow_filters(bp, 1);
 
+	/* enable RSC if LRO & NTUPLE on */
+	gem_update_rsc_state(bp, netdev->features);
 	spin_unlock_irqrestore(&bp->rx_fs_lock, flags);
 	return 0;
 
@@ -3009,6 +3069,7 @@ static int gem_del_flow_filter(struct net_device *netdev,
 			return 0;
 		}
 	}
+	gem_update_rsc_state(bp, netdev->features);
 
 	spin_unlock_irqrestore(&bp->rx_fs_lock, flags);
 	return -EINVAL;
@@ -3191,7 +3252,12 @@ static int macb_set_features(struct net_device *netdev,
 		bool turn_on = features & NETIF_F_NTUPLE;
 
 		gem_enable_flow_filters(bp, turn_on);
+		gem_update_rsc_state(bp, features);
 	}
+
+	/* LRO (Large Receive Offload) aka RSC (Receive Side Coalescing) */
+	if ((changed & NETIF_F_LRO) && macb_is_gem(bp))
+		gem_update_rsc_state(bp, features);
 	return 0;
 }
 
@@ -3449,8 +3515,10 @@ static int macb_init(struct platform_device *pdev)
 		dev->hw_features |= MACB_NETIF_LSO;
 
 	/* Check RSC capability */
-	if (GEM_BFEXT(PBUF_RSC, gem_readl(bp, DCFG6)))
+	if (GEM_BFEXT(PBUF_RSC, gem_readl(bp, DCFG6))) {
 		dev->hw_features |= NETIF_F_LRO;
+		gem_writel(bp, RSC, GEM_BIT(CLRMSK));
+	}
 
 	/* Checksum offload is only available on gem with packet buffer */
 	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE))
-- 
2.4.5

^ permalink raw reply related

* [PATCH 2/3] net: macb: Add support for header data spliting
From: Rafal Ozieblo @ 2018-04-14 20:54 UTC (permalink / raw)
  To: Nicolas Ferre, netdev, linux-kernel; +Cc: Rafal Ozieblo
In-Reply-To: <1523739187-20077-1-git-send-email-rafalo@cadence.com>

This patch adds support for frames splited between
many rx buffers. Header data spliting can be used
but also buffers shorter than max frame length.
The only limitation is that frame header can't
be splited.

Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
---
 drivers/net/ethernet/cadence/macb.h      |  13 +++
 drivers/net/ethernet/cadence/macb_main.c | 137 +++++++++++++++++++++++--------
 2 files changed, 118 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 33c9a48..a2cb805 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -295,6 +295,8 @@
 /* Bitfields in DMACFG. */
 #define GEM_FBLDO_OFFSET	0 /* fixed burst length for DMA */
 #define GEM_FBLDO_SIZE		5
+#define GEM_HDRS_OFFSET		5 /* Header Data Splitting */
+#define GEM_HDRS_SIZE		1
 #define GEM_ENDIA_DESC_OFFSET	6 /* endian swap mode for management descriptor access */
 #define GEM_ENDIA_DESC_SIZE	1
 #define GEM_ENDIA_PKT_OFFSET	7 /* endian swap mode for packet data access */
@@ -755,8 +757,12 @@ struct gem_tx_ts {
 #define MACB_RX_SOF_SIZE			1
 #define MACB_RX_EOF_OFFSET			15
 #define MACB_RX_EOF_SIZE			1
+#define MACB_RX_HDR_OFFSET			16
+#define MACB_RX_HDR_SIZE			1
 #define MACB_RX_CFI_OFFSET			16
 #define MACB_RX_CFI_SIZE			1
+#define MACB_RX_EOH_OFFSET			17
+#define MACB_RX_EOH_SIZE			1
 #define MACB_RX_VLAN_PRI_OFFSET			17
 #define MACB_RX_VLAN_PRI_SIZE			3
 #define MACB_RX_PRI_TAG_OFFSET			20
@@ -1086,6 +1092,11 @@ struct tsu_incr {
 	u32 ns;
 };
 
+struct rx_frag_list {
+	struct sk_buff		*skb_head;
+	struct sk_buff		*skb_tail;
+};
+
 struct macb_queue {
 	struct macb		*bp;
 	int			irq;
@@ -1121,6 +1132,8 @@ struct macb_queue {
 	unsigned int		tx_ts_head, tx_ts_tail;
 	struct gem_tx_ts	tx_timestamps[PTP_TS_BUFFER_SIZE];
 #endif
+	struct rx_frag_list	rx_frag;
+	u32			rx_frag_len;
 };
 
 struct ethtool_rx_fs_item {
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 43201a8..27c406c 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -967,6 +967,13 @@ static void discard_partial_frame(struct macb_queue *queue, unsigned int begin,
 	 */
 }
 
+void gem_reset_rx_state(struct macb_queue *queue)
+{
+	queue->rx_frag.skb_head = NULL;
+	queue->rx_frag.skb_tail = NULL;
+	queue->rx_frag_len = 0;
+}
+
 static int gem_rx(struct macb_queue *queue, int budget)
 {
 	struct macb *bp = queue->bp;
@@ -977,6 +984,9 @@ static int gem_rx(struct macb_queue *queue, int budget)
 	int			count = 0;
 
 	while (count < budget) {
+		struct sk_buff *skb_head, *skb_tail;
+		bool eoh = false, header = false;
+		bool sof, eof;
 		u32 ctrl;
 		dma_addr_t addr;
 		bool rxused;
@@ -995,57 +1005,118 @@ static int gem_rx(struct macb_queue *queue, int budget)
 			break;
 
 		queue->rx_tail++;
-		count++;
-
-		if (!(ctrl & MACB_BIT(RX_SOF) && ctrl & MACB_BIT(RX_EOF))) {
+		skb = queue->rx_skbuff[entry];
+		if (unlikely(!skb)) {
 			netdev_err(bp->dev,
-				   "not whole frame pointed by descriptor\n");
+				   "inconsistent Rx descriptor chain\n");
 			bp->dev->stats.rx_dropped++;
 			queue->stats.rx_dropped++;
 			break;
 		}
-		skb = queue->rx_skbuff[entry];
-		if (unlikely(!skb)) {
+		skb_head = queue->rx_frag.skb_head;
+		skb_tail = queue->rx_frag.skb_tail;
+		sof = !!(ctrl & MACB_BIT(RX_SOF));
+		eof = !!(ctrl & MACB_BIT(RX_EOF));
+		if (GEM_BFEXT(HDRS, gem_readl(bp, DMACFG))) {
+			eoh = !!(ctrl & MACB_BIT(RX_EOH));
+			if (!eof)
+				header = !!(ctrl & MACB_BIT(RX_HDR));
+		}
+
+		queue->rx_skbuff[entry] = NULL;
+		/* Discard if out-of-sequence or header split across buffers */
+		if ((!skb_head /* first frame buffer */
+		&& (!sof /* without start of frame */
+		|| (header && !eoh))) /* or without whole header */
+		|| (skb_head && sof)) { /* or new start before EOF */
+			struct sk_buff *tmp_skb;
+
 			netdev_err(bp->dev,
-				   "inconsistent Rx descriptor chain\n");
+				   "Incomplete frame received! (skb_head=%p sof=%u hdr=%u eoh=%u)\n",
+				   skb_head, (u32)sof, (u32)header, (u32)eoh);
+			dev_kfree_skb(skb);
+			if (skb_head) {
+				skb = skb_shinfo(skb_head)->frag_list;
+				dev_kfree_skb(skb_head);
+				while (skb) {
+					tmp_skb = skb;
+					skb = skb->next;
+					dev_kfree_skb(tmp_skb);
+				}
+			}
 			bp->dev->stats.rx_dropped++;
 			queue->stats.rx_dropped++;
+			gem_reset_rx_state(queue);
 			break;
 		}
+
 		/* now everything is ready for receiving packet */
-		queue->rx_skbuff[entry] = NULL;
 		len = ctrl & bp->rx_frm_len_mask;
 
+		/* Buffer lengths in the descriptor:
+		 * eoh: len = header size,
+		 * eof: len = frame size (including header),
+		 * else: len = 0, length equals bp->rx_buffer_size
+		 */
+		if (!len)
+			len = bp->rx_buffer_size;
+		else
+			/* If EOF or EOH reduce the size of the packet
+			 * by already received bytes
+			 */
+			len -= queue->rx_frag_len;
+
 		netdev_vdbg(bp->dev, "gem_rx %u (len %u)\n", entry, len);
 
+		gem_ptp_do_rxstamp(bp, skb, desc);
+
 		skb_put(skb, len);
 		dma_unmap_single(&bp->pdev->dev, addr,
 				 bp->rx_buffer_size, DMA_FROM_DEVICE);
 
-		skb->protocol = eth_type_trans(skb, bp->dev);
-		skb_checksum_none_assert(skb);
-		if (bp->dev->features & NETIF_F_RXCSUM &&
-		    !(bp->dev->flags & IFF_PROMISC) &&
-		    GEM_BFEXT(RX_CSUM, ctrl) & GEM_RX_CSUM_CHECKED_MASK)
-			skb->ip_summed = CHECKSUM_UNNECESSARY;
-
-		bp->dev->stats.rx_packets++;
-		queue->stats.rx_packets++;
-		bp->dev->stats.rx_bytes += skb->len;
-		queue->stats.rx_bytes += skb->len;
-
-		gem_ptp_do_rxstamp(bp, skb, desc);
-
-#if defined(DEBUG) && defined(VERBOSE_DEBUG)
-		netdev_vdbg(bp->dev, "received skb of length %u, csum: %08x\n",
-			    skb->len, skb->csum);
-		print_hex_dump(KERN_DEBUG, " mac: ", DUMP_PREFIX_ADDRESS, 16, 1,
-			       skb_mac_header(skb), 16, true);
-		print_hex_dump(KERN_DEBUG, "data: ", DUMP_PREFIX_ADDRESS, 16, 1,
-			       skb->data, 32, true);
-#endif
-
-		netif_receive_skb(skb);
+		if (!skb_head) {
+			/* first buffer in frame */
+			skb->protocol = eth_type_trans(skb, bp->dev);
+			skb_checksum_none_assert(skb);
+			if (bp->dev->features & NETIF_F_RXCSUM &&
+			    !(bp->dev->flags & IFF_PROMISC) &&
+			    GEM_BFEXT(RX_CSUM, ctrl) & GEM_RX_CSUM_CHECKED_MASK)
+				skb->ip_summed = CHECKSUM_UNNECESSARY;
+			queue->rx_frag.skb_head = skb;
+			queue->rx_frag.skb_tail = skb;
+			skb_head = skb;
+		} else {
+			/* not first buffer in frame */
+			if (!skb_shinfo(skb_head)->frag_list)
+				skb_shinfo(skb_head)->frag_list = skb;
+			else
+				skb_tail->next = skb;
+			queue->rx_frag.skb_tail = skb;
+			skb_head->len += len;
+			skb_head->data_len += len;
+			skb_head->truesize += len;
+		}
+		if (eof) {
+			bp->dev->stats.rx_packets++;
+			queue->stats.rx_packets++;
+			bp->dev->stats.rx_bytes += skb->len;
+			queue->stats.rx_bytes += skb->len;
+
+	#if defined(DEBUG) && defined(VERBOSE_DEBUG)
+			netdev_vdbg(bp->dev, "received skb of length %u, csum: %08x\n",
+				    skb->len, skb->csum);
+			print_hex_dump(KERN_DEBUG, " mac: ", DUMP_PREFIX_ADDRESS, 16, 1,
+				       skb_mac_header(skb), 16, true);
+			print_hex_dump(KERN_DEBUG, "data: ", DUMP_PREFIX_ADDRESS, 16, 1,
+				       skb->data, 32, true);
+	#endif
+
+			netif_receive_skb(skb_head);
+			gem_reset_rx_state(queue);
+			count++;
+		} else {
+			queue->rx_frag_len += len;
+		}
 	}
 
 	gem_rx_refill(queue);
@@ -1905,6 +1976,8 @@ static int macb_alloc_consistent(struct macb *bp)
 		netdev_dbg(bp->dev,
 			   "Allocated RX ring of %d bytes at %08lx (mapped %p)\n",
 			   size, (unsigned long)queue->rx_ring_dma, queue->rx_ring);
+
+		gem_reset_rx_state(queue);
 	}
 	if (bp->macbgem_ops.mog_alloc_rx_buffers(bp))
 		goto out_err;
-- 
2.4.5

^ permalink raw reply related

* [PATCH 1/3] net: macb: Add support for rsc capable hardware
From: Rafal Ozieblo @ 2018-04-14 20:53 UTC (permalink / raw)
  To: Nicolas Ferre, netdev, linux-kernel; +Cc: Rafal Ozieblo
In-Reply-To: <1523739187-20077-1-git-send-email-rafalo@cadence.com>

When the pbuf_rsc has been enabled in hardware
the receive buffer offset for incoming packets
cannot be changed in the network configuration register
(even when rsc is not use at all).

Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
---
 drivers/net/ethernet/cadence/macb.h      |  2 ++
 drivers/net/ethernet/cadence/macb_main.c | 22 ++++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 8665982..33c9a48 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -477,6 +477,8 @@
 /* Bitfields in DCFG6. */
 #define GEM_PBUF_LSO_OFFSET			27
 #define GEM_PBUF_LSO_SIZE			1
+#define GEM_PBUF_RSC_OFFSET			26
+#define GEM_PBUF_RSC_SIZE			1
 #define GEM_DAW64_OFFSET			23
 #define GEM_DAW64_SIZE				1
 
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index b4c9268..43201a8 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -930,8 +930,9 @@ static void gem_rx_refill(struct macb_queue *queue)
 			macb_set_addr(bp, desc, paddr);
 			desc->ctrl = 0;
 
-			/* properly align Ethernet header */
-			skb_reserve(skb, NET_IP_ALIGN);
+			if (!(bp->dev->hw_features & NETIF_F_LRO))
+				/* properly align Ethernet header */
+				skb_reserve(skb, NET_IP_ALIGN);
 		} else {
 			desc->addr &= ~MACB_BIT(RX_USED);
 			desc->ctrl = 0;
@@ -2110,7 +2111,13 @@ static void macb_init_hw(struct macb *bp)
 	config = macb_mdc_clk_div(bp);
 	if (bp->phy_interface == PHY_INTERFACE_MODE_SGMII)
 		config |= GEM_BIT(SGMIIEN) | GEM_BIT(PCSSEL);
-	config |= MACB_BF(RBOF, NET_IP_ALIGN);	/* Make eth data aligned */
+	/* When the pbuf_rsc has been enabled in hardware the receive buffer
+	 * offset cannot be changed in the network configuration register.
+	 */
+	if (!(bp->dev->hw_features &  NETIF_F_LRO))
+		/* Make eth data aligned */
+		config |= MACB_BF(RBOF, NET_IP_ALIGN);
+
 	config |= MACB_BIT(PAE);		/* PAuse Enable */
 	config |= MACB_BIT(DRFCS);		/* Discard Rx FCS */
 	if (bp->caps & MACB_CAPS_JUMBO)
@@ -2281,7 +2288,7 @@ static void macb_set_rx_mode(struct net_device *dev)
 static int macb_open(struct net_device *dev)
 {
 	struct macb *bp = netdev_priv(dev);
-	size_t bufsz = dev->mtu + ETH_HLEN + ETH_FCS_LEN + NET_IP_ALIGN;
+	size_t bufsz = dev->mtu + ETH_HLEN + ETH_FCS_LEN;
 	struct macb_queue *queue;
 	unsigned int q;
 	int err;
@@ -2295,6 +2302,9 @@ static int macb_open(struct net_device *dev)
 	if (!dev->phydev)
 		return -EAGAIN;
 
+	if (!(bp->dev->hw_features & NETIF_F_LRO))
+		bufsz += NET_IP_ALIGN;
+
 	/* RX buffers initialization */
 	macb_init_rx_buffer_size(bp, bufsz);
 
@@ -3365,6 +3375,10 @@ static int macb_init(struct platform_device *pdev)
 	if (GEM_BFEXT(PBUF_LSO, gem_readl(bp, DCFG6)))
 		dev->hw_features |= MACB_NETIF_LSO;
 
+	/* Check RSC capability */
+	if (GEM_BFEXT(PBUF_RSC, gem_readl(bp, DCFG6)))
+		dev->hw_features |= NETIF_F_LRO;
+
 	/* Checksum offload is only available on gem with packet buffer */
 	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE))
 		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
-- 
2.4.5

^ permalink raw reply related

* [PATCH 0/3] Receive Side Coalescing for macb driver
From: Rafal Ozieblo @ 2018-04-14 20:53 UTC (permalink / raw)
  To: Nicolas Ferre, netdev, linux-kernel; +Cc: Rafal Ozieblo

This patch series adds support for receive side coalescing
for Cadence GEM driver. Receive segmentation coalescing
is a mechanism to reduce CPU overhead. This is done by
coalescing received TCP message segments together into
a single large message. This means that when the message
is complete the CPU only has to process the single header
and act upon the one data payload.

Rafal Ozieblo (3):
  net: macb: Add support for rsc capable hardware
  net: macb: Add support for header data spliting
  net: macb: Receive Side Coalescing (RSC) feature added.

 drivers/net/ethernet/cadence/macb.h      |  21 +++
 drivers/net/ethernet/cadence/macb_main.c | 227 ++++++++++++++++++++++++++-----
 2 files changed, 212 insertions(+), 36 deletions(-)

-- 
2.4.5

^ permalink raw reply

* Re: [PATCH] x86/cpufeature: guard asm_volatile_goto usage with CC_HAVE_ASM_GOTO
From: Yonghong Song @ 2018-04-14 20:30 UTC (permalink / raw)
  To: Peter Zijlstra, Alexei Starovoitov
  Cc: mingo, daniel, linux-kernel, x86, kernel-team, Thomas Gleixner,
	netdev, Jesper Dangaard Brouer
In-Reply-To: <20180414101112.GX4064@hirez.programming.kicks-ass.net>



On 4/14/18 3:11 AM, Peter Zijlstra wrote:
> On Fri, Apr 13, 2018 at 01:42:14PM -0700, Alexei Starovoitov wrote:
>> On 4/13/18 11:19 AM, Peter Zijlstra wrote:
>>> On Tue, Apr 10, 2018 at 02:28:04PM -0700, Alexei Starovoitov wrote:
>>>> Instead of
>>>> #ifdef CC_HAVE_ASM_GOTO
>>>> we can replace it with
>>>> #ifndef __BPF__
>>>> or some other name,
>>>
>>> I would prefer the BPF specific hack; otherwise we might be encouraging
>>> people to build the kernel proper without asm-goto.
>>>
>>
>> I don't understand this concern.
> 
> The thing is; this will be a (temporary) BPF specific hack. Hiding it
> behind something that looks 'normal' (CC_HAVE_ASM_GOTO) is just not
> right.

This is a fair concern. I will use a different macro and send v2 soon.
Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox