Netdev List
 help / color / mirror / Atom feed
* Re: INFO: task hung in ip6gre_exit_batch_net
From: Dmitry Vyukov @ 2018-06-04 15:22 UTC (permalink / raw)
  To: syzbot
  Cc: Christian Brauner, David Miller, David Ahern, Florian Westphal,
	Jiri Benc, Kirill Tkhai, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich
In-Reply-To: <0000000000006e4595056dd23c16@google.com>

On Mon, Jun 4, 2018 at 5:03 PM, syzbot
<syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

Another hang on rtnl lock:

#syz dup: INFO: task hung in netdev_run_todo

May be related to "unregister_netdevice: waiting for DEV to become free":
https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9

Any other explanations for massive hangs on rtnl lock for minutes?


> INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
>       Not tainted 4.17.0-rc6+ #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u4:1    D13944    22      2 0x80000000
> Workqueue: netns cleanup_net
> Call Trace:
>  context_switch kernel/sched/core.c:2859 [inline]
>  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
>  schedule+0xef/0x430 kernel/sched/core.c:3545
>  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
>  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
>  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
>  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
>  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
>
> Showing all locks held in the system:
> 4 locks held by kworker/u4:1/22:
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20
> net/core/net_namespace.c:490
>  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 3 locks held by kworker/1:1/24:
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_data kernel/workqueue.c:617 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 2 locks held by khungtaskd/893:
>  #0: 000000007eeb621a (rcu_read_lock){....}, at:
> check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
>  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60
> kernel/hung_task.c:249
>  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:
> debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
> 2 locks held by getty/4481:
>  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4482:
>  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4483:
>  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4484:
>  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4485:
>  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4486:
>  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4487:
>  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/1:3/27147:
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: __write_once_size
> include/linux/compiler.h:215 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: arch_atomic64_set
> arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set
> include/asm-generic/atomic-long.h:57 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/29665:
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock
> include/net/sock.h:1469 [inline]
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:
> tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
> 1 lock held by syz-executor7/29689:
>  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
>
> =============================================
>
> NMI backtrace for cpu 0
> CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
>  check_hung_task kernel/hung_task.c:132 [inline]
>  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
>  watchdog+0xc10/0xf60 kernel/hung_task.c:249
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> Sending NMI from CPU 0 to CPUs 1:
> NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/0000000000006e4595056dd23c16%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* [RFC PATCH 0/4] Add support for DCB and XPS with MACVLAN offload
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan

This patch set is meant to address a number of shortcomings in the current
implementation of MACVLAN offload. The main issue being the fact that the
current Tx queue implementation for ixgbe required the use of
ndo_select_queue which doesn't necessarily play well with DCB or XPS.

I started this patch set to address that and thought I would submit the
start of the series as an RFC just to make sure I am headed in the right
direction and to see if there is anything I am overlooking. With these
patches applied we now start seeing what I am calling "subordinate
channels" which show themselves via a "-x" suffix where the "-x" can be
anything from "-1" to "-32767". So for example traffic class 0 with a
subordinate class of 5 will be called out as "0-5" if you dump the traffic
class via the sysfs value for the queue.

The main pieces that still need to be handled are the updating of the
ndo_select_queue function and the fallback call to allow support for
passing the accel_priv which I am changing to a netdev. I figure those
patches will be mostly noise so I didn't see the need to include them in
this RFC.

---

Alexander Duyck (4):
      net-sysfs: Drop support for XPS and traffic_class on single queue device
      net: Add support for subordinate device traffic classes
      ixgbe: Add code to populate and use macvlan tc to Tx queue map
      net: Add support for subordinate traffic classes to netdev_pick_tx


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   55 +++++++---
 drivers/net/macvlan.c                         |   10 --
 include/linux/netdevice.h                     |   20 +++
 net/core/dev.c                                |  144 +++++++++++++++++++++----
 net/core/net-sysfs.c                          |   36 ++++++
 5 files changed, 215 insertions(+), 50 deletions(-)

^ permalink raw reply

* [RFC PATCH 1/4] net-sysfs: Drop support for XPS and traffic_class on single queue device
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/net-sysfs.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 				  char *buf)
 {
 	struct net_device *dev = queue->dev;
-	int index = get_netdev_queue_index(queue);
-	int tc = netdev_txq_to_tc(dev, index);
+	int index;
+	int tc;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
+	index = get_netdev_queue_index(queue);
+	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	unsigned long index;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	int err;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 

^ permalink raw reply related

* [RFC PATCH 2/4] net: Add support for subordinate device traffic classes
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/netdevice.h |   16 ++++++++
 net/core/dev.c            |   89 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |   21 ++++++++++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492..bbc8045 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 	 * (/sys/class/net/DEV/Q/trans_timeout)
 	 */
 	unsigned long		trans_timeout;
+
+	/* Suboordinate device that the queue has been assigned to */
+	struct net_device	*sb_dev;
 /*
  * write-mostly part
  */
@@ -1960,7 +1963,7 @@ struct net_device {
 #ifdef CONFIG_DCB
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-	u8			num_tc;
+	s16			num_tc;
 	struct netdev_tc_txq	tc_to_txq[TC_MAX_QUEUE];
 	u8			prio_tc_map[TC_BITMASK + 1];
 
@@ -2014,6 +2017,17 @@ int netdev_get_num_tc(struct net_device *dev)
 	return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+	return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 034c2a9..eba255d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 		struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
 		int i;
 
+		/* walk through the TCs and see if it falls into any of them */
 		for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
 			if ((txq - tc->offset) < tc->count)
 				return i;
 		}
 
+		/* didn't find it, just return -1 to indicate no match */
 		return -1;
 	}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	bool active = false;
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+	/* Unbind any subordinate channels */
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev)
+			netdev_unbind_sb_channel(dev, txq->sb_dev);
+	}
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
+	/* Reset TC configuration of device */
 	dev->num_tc = 0;
 	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
 	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
 	dev->num_tc = num_tc;
 	return 0;
 }
 EXPORT_SYMBOL(netdev_set_num_tc);
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+#ifdef CONFIG_XPS
+	netif_reset_xps_queues_gt(sb_dev, 0);
+#endif
+	memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
+	memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
+
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev == sb_dev)
+			txq->sb_dev = NULL;
+	}
+}
+EXPORT_SYMBOL(netdev_unbind_sb_channel);
+
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset)
+{
+	/* Make certain the sb_dev and dev are already configured */
+	if (sb_dev->num_tc >= 0 || tc >= dev->num_tc)
+		return -EINVAL;
+
+	/* We cannot hand out queues we don't have */
+	if ((offset + count) > dev->real_num_tx_queues)
+		return -EINVAL;
+
+	/* Record the mapping */
+	sb_dev->tc_to_txq[tc].count = count;
+	sb_dev->tc_to_txq[tc].offset = offset;
+
+	/* Provide a way for Tx queue to find the tc_to_txq map or
+	 * XPS map for itself.
+	 */
+	while (count--)
+		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_bind_sb_channel_queue);
+
+int netdev_set_sb_channel(struct net_device *dev, u16 channel)
+{
+	/* Do not use a multiqueue device to represent a subordinate channel */
+	if (netif_is_multiqueue(dev))
+		return -ENODEV;
+
+	/* We allow channels 1 - 32767 to be used for subordinate channels.
+	 * Channel 0 is meant to be "native" mode and used only to represent
+	 * the main root device. We allow writing 0 to reset the device back
+	 * to normal mode after being used as a subordinate channel.
+	 */
+	if (channel > S16_MAX)
+		return -EINVAL;
+
+	dev->num_tc = -channel;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_set_sb_channel);
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater than real_num_tx_queues stale skbs on the qdisc must be flushed.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 335c6a4..bd067b1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1054,11 +1054,23 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 		return -ENOENT;
 
 	index = get_netdev_queue_index(queue);
+
+	/* If queue belongs to subordinate dev use its tc mapping */
+	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
-	return sprintf(buf, "%u\n", tc);
+	/* We can report the traffic class one of two ways:
+	 * Subordinate device traffic classes are reported with the traffic
+	 * class first, and then the subordinate class so for example TC0 on
+	 * subordinate device 2 will be reported as "0-2". If the queue
+	 * belongs to the root device it will be reported with just the
+	 * traffic class, so just "0" for TC 0 for example.
+	 */
+	return dev->num_tc < 0 ? sprintf(buf, "%u%d\n", tc, dev->num_tc) :
+				 sprintf(buf, "%u\n", tc);
 }
 
 #ifdef CONFIG_XPS
@@ -1225,7 +1237,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;

^ permalink raw reply related

* [RFC PATCH 3/4] ixgbe: Add code to populate and use macvlan tc to Tx queue map
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f460c16..05d6d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 			     struct ixgbe_fwd_adapter *accel)
 {
+	u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+	int num_tc = netdev_get_num_tc(adapter->netdev);
 	struct net_device *vdev = accel->netdev;
 	int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 	accel->rx_base_queue = baseq;
 	accel->tx_base_queue = baseq;
 
+	/* record configuration for macvlan interface in vdev */
+	for (i = 0; i < num_tc; i++)
+		netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+					     i, rss_i, baseq + (rss_i * i));
+
 	for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
 		adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 
 	netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 
@@ -8211,18 +8222,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
 	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-	struct ixgbe_adapter *adapter;
-	int txq;
 #ifdef IXGBE_FCOE
+	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
 #endif
+	int txq;
 
 	if (fwd_adapter) {
-		adapter = netdev_priv(dev);
-		txq = reciprocal_scale(skb_get_hash(skb),
-				       adapter->num_rx_queues_per_pool);
+		u8 tc = netdev_get_num_tc(dev) ?
+			netdev_get_prio_tc_map(dev, skb->priority) : 0;
+		struct net_device *vdev = fwd_adapter->netdev;
+
+		txq = vdev->tc_to_txq[tc].offset;
+		txq += reciprocal_scale(skb_get_hash(skb),
+					vdev->tc_to_txq[tc].count);
 
-		return txq + fwd_adapter->tx_base_queue;
+		return txq;
 	}
 
 #ifdef IXGBE_FCOE
@@ -8776,6 +8791,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device *vdev, void *data)
 	/* if we cannot find a free pool then disable the offload */
 	netdev_err(vdev, "L2FW offload disabled due to lack of queue resources\n");
 	macvlan_release_l2fw_offload(vdev);
+
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	kfree(accel);
 
 	return 0;
@@ -9780,6 +9800,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	if (!macvlan_supports_dest_filter(vdev))
 		return ERR_PTR(-EMEDIUMTYPE);
 
+	/* We need to lock down the macvlan to be a single queue device so that
+	 * we can reuse the tc_to_txq field in the macvlan netdev to represent
+	 * the queue mapping to our netdev.
+	 */
+	if (netif_is_multiqueue(vdev))
+		return ERR_PTR(-ERANGE);
+
 	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
 	if (pool == adapter->num_rx_pools) {
 		u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9836,6 +9863,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 		return ERR_PTR(-ENOMEM);
 
 	set_bit(pool, adapter->fwd_bitmask);
+	netdev_set_sb_channel(vdev, pool);
 	accel->pool = pool;
 	accel->netdev = vdev;
 
@@ -9877,6 +9905,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 		ring->netdev = NULL;
 	}
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(pdev, accel->netdev);
+	netdev_set_sb_channel(accel->netdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 }

^ permalink raw reply related

* [RFC PATCH 4/4] net: Add support for subordinate traffic classes to netdev_pick_tx
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   15 ++-----
 drivers/net/macvlan.c                         |   10 +----
 include/linux/netdevice.h                     |    4 +-
 net/core/dev.c                                |   55 +++++++++++++++----------
 4 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 05d6d48..c42498d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8218,20 +8218,18 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
 					      input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
-	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
 	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
-#endif
 	int txq;
 
-	if (fwd_adapter) {
+	if (accel_priv) {
 		u8 tc = netdev_get_num_tc(dev) ?
 			netdev_get_prio_tc_map(dev, skb->priority) : 0;
-		struct net_device *vdev = fwd_adapter->netdev;
+		struct net_device *vdev = accel_priv;
 
 		txq = vdev->tc_to_txq[tc].offset;
 		txq += reciprocal_scale(skb_get_hash(skb),
@@ -8240,7 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		return txq;
 	}
 
-#ifdef IXGBE_FCOE
 
 	/*
 	 * only execute the code below if protocol is FCoE
@@ -8267,11 +8264,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		txq -= f->indices;
 
 	return txq + f->offset;
-#else
-	return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 			       struct xdp_frame *xdpf)
 {
@@ -10071,7 +10066,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
-	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= ixgbe_set_mac,
@@ -10094,6 +10088,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_poll_controller	= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
 	.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
 	.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	const struct macvlan_port *port = vlan->port;
 	const struct macvlan_dev *dest;
-	void *accel_priv = NULL;
 
 	if (vlan->mode == MACVLAN_MODE_BRIDGE) {
 		const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NET_XMIT_SUCCESS;
 		}
 	}
-
-	/* For packets that are non-multicast and not bridged we will pass
-	 * the necessary information so that the lowerdev can distinguish
-	 * the source of the packets via the accel_priv value.
-	 */
-	accel_priv = vlan->accel_priv;
 xmit_world:
 	skb->dev = vlan->lowerdev;
-	return dev_queue_xmit_accel(skb, accel_priv);
+	return dev_queue_xmit_accel(skb,
+				    netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bbc8045..df5a582 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2072,7 +2072,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv);
+				    struct net_device *sb_dev);
 
 /* returns the headroom that the master device needs to take in account
  * when forwarding to this dev
@@ -2523,7 +2523,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index eba255d..507e606 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2704,24 +2704,26 @@ void netif_device_attach(struct net_device *dev)
  * Returns a Tx hash based on the given packet descriptor a Tx queues' number
  * to be used as a distribution range.
  */
-static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
+static u16 skb_tx_hash(const struct net_device *dev,
+		       const struct net_device *sb_dev,
+		       struct sk_buff *skb)
 {
 	u32 hash;
 	u16 qoffset = 0;
 	u16 qcount = dev->real_num_tx_queues;
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+
+		qoffset = sb_dev->tc_to_txq[tc].offset;
+		qcount = sb_dev->tc_to_txq[tc].count;
+	}
+
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
 		while (unlikely(hash >= qcount))
 			hash -= qcount;
-		return hash;
-	}
-
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
+		return hash + qoffset;
 	}
 
 	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
@@ -3483,7 +3485,9 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(struct net_device *dev,
+				struct net_device *sb_dev,
+				struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
@@ -3491,7 +3495,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(sb_dev->xps_maps);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3519,17 +3523,18 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+			     struct net_device *sb_dev)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
 
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
+		int new_index = get_xps_queue(dev, sb_dev, skb);
 
 		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
+			new_index = skb_tx_hash(dev, sb_dev, skb);
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
@@ -3542,9 +3547,14 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 	return queue_index;
 }
 
+static u16 __netdev_pick_tx(struct net_device *dev,
+			    struct sk_buff *skb)
+{
+	return ___netdev_pick_tx(dev, skb, NULL);
+}
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv)
+				    struct net_device *sb_dev)
 {
 	int queue_index = 0;
 
@@ -3559,10 +3569,11 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 		const struct net_device_ops *ops = dev->netdev_ops;
 
 		if (ops->ndo_select_queue)
-			queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
+			queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
 							    __netdev_pick_tx);
 		else
-			queue_index = __netdev_pick_tx(dev, skb);
+			queue_index = ___netdev_pick_tx(dev, skb,
+							sb_dev ? : dev);
 
 		queue_index = netdev_cap_txqueue(dev, queue_index);
 	}
@@ -3574,7 +3585,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 /**
  *	__dev_queue_xmit - transmit a buffer
  *	@skb: buffer to transmit
- *	@accel_priv: private data used for L2 forwarding offload
+ *	@sb_dev: suboordinate device used for L2 forwarding offload
  *
  *	Queue a buffer for transmission to a network device. The caller must
  *	have set the device and priority and built the buffer before calling
@@ -3597,7 +3608,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
-static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
@@ -3636,7 +3647,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	else
 		skb_dst_force(skb);
 
-	txq = netdev_pick_tx(dev, skb, accel_priv);
+	txq = netdev_pick_tx(dev, skb, sb_dev);
 	q = rcu_dereference_bh(txq->qdisc);
 
 	trace_net_dev_queue(skb);
@@ -3710,9 +3721,9 @@ int dev_queue_xmit(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dev_queue_xmit);
 
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
 {
-	return __dev_queue_xmit(skb, accel_priv);
+	return __dev_queue_xmit(skb, sb_dev);
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 

^ permalink raw reply related

* Re: [PATCH net-next v5 00/11] Modify action API for implementing lockless actions
From: David Miller @ 2018-06-04 15:28 UTC (permalink / raw)
  To: vladbu
  Cc: netdev, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, keescook, marcelo.leitner, kliteyn
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri,  1 Jun 2018 18:15:07 +0300

> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path.

Hello Vlad.

I've read over this patch series a few times, and even though I can't
find any issues yet, I think this series is complex and invasive
enough that it should receive more time for review by other experts.

Therefore I'm deferring this to the next merge window in order to
accomodate that.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: aquantia: make function aq_fw2x_get_mac_permanent static
From: David Miller @ 2018-06-04 15:31 UTC (permalink / raw)
  To: colin.king; +Cc: igor.russkikh, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180601152834.17958-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri,  1 Jun 2018 16:28:34 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The function aq_fw2x_get_mac_permanent is local to the source and does
> not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> warning: symbol 'aq_fw2x_get_mac_permanent' was not declared. Should it
> be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net] net/packet: refine check for priv area size
From: David Miller @ 2018-06-04 15:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180601162302.32717-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Fri,  1 Jun 2018 09:23:02 -0700

> syzbot was able to trick af_packet again [1]

:-(

> Various commits tried to address the problem in the past,
> but failed to take into account V3 header size.
> 
> [1]
 ...
> Fixes: 2b6867c2ce76 ("net/packet: fix overflow in check for priv area size")
> Fixes: dc808110bb62 ("packet: handle too big packets for PACKET_V3")
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77990 SoC
From: Rob Herring @ 2018-06-04 15:32 UTC (permalink / raw)
  To: Sergei Shtylyov, Shimoda, Yoshihiro
  Cc: Simon Horman, David Miller, netdev,
	open list:MEDIA DRIVERS FOR RENESAS - FCP, Mark Rutland,
	devicetree
In-Reply-To: <19e43b97-9235-a752-7cf9-56a5e3bb281a@cogentembedded.com>

On Tue, May 15, 2018 at 3:43 AM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> On 5/13/2018 10:58 AM, Simon Horman wrote:
>
>>>> Add documentation for r8a77990 compatible string to renesas ravb device
>>>> tree bindings documentation.
>>>>
>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
>>>
>>>
>>> I'm assuming this isn't targetted at one of my trees.  Just FYI.
>>
>>
>> Hi Dave,
>>
>> I think this is appropriate for net-next but if not I can take it.
>
>
>    There's also Rob, and IIRC he has taken alike patch recently...

Applied.

Rob

^ permalink raw reply

* [PATCH net-next] net: sched: return error code when tcf proto is not found
From: Vlad Buslov @ 2018-06-04 15:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs, xiyou.wangcong, jiri, dan.carpenter, Vlad Buslov

If requested tcf proto is not found, get and del filter netlink protocol
handlers output error message to extack, but do not return actual error
code. Add check to return ENOENT when result of tp find function is NULL
pointer.

Fixes: c431f89b18a2 ("net: sched: split tc_ctl_tfilter into three
handlers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

---
 net/sched/cls_api.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c06585fb2dc6..cdc3c87c53e6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1274,7 +1274,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
@@ -1374,7 +1374,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
-- 
2.7.5

^ permalink raw reply related

* Re: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: David Miller @ 2018-06-04 15:35 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil, netdev, linux-kernel,
	linuxarm, wangxi11
In-Reply-To: <20180601170011.22700-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Fri, 1 Jun 2018 18:00:11 +0100

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Please indent things properly when arguments span multiple lines.

The second line here should be precisely after the opennning parenthesis
of the previous line.

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Likewise.

^ permalink raw reply

* Re: [RFC PATCH net-next] net: mvpp2: mvpp2_percpu_read_relaxed() can be static
From: David Miller @ 2018-06-04 15:36 UTC (permalink / raw)
  To: fengguang.wu
  Cc: maxime.chevallier, kbuild-all, netdev, antoine.tenart,
	thomas.petazzoni, ymarkman, stefanc, linux-kernel
In-Reply-To: <20180601194613.GA44770@xian>

From: kbuild test robot <fengguang.wu@intel.com>
Date: Sat, 2 Jun 2018 03:46:13 +0800

> 
> Fixes: db9d7d36eecc ("net: mvpp2: Split the PPv2 driver to a dedicated directory")
> Signed-off-by: kbuild test robot <fengguang.wu@intel.com>

Applied to net-next, thank you.

^ permalink raw reply

* RE: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: Salil Mehta @ 2018-06-04 15:39 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
	wangxi (M)
In-Reply-To: <20180604.113552.1136710379930167373.davem@davemloft.net>

Hi Dave,
Thanks for pointing out. Will fix them.

Best regards
Salil

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Monday, June 04, 2018 4:36 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net] net: hns: Fix the process of adding broadcast
> addresses to tcam
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Fri, 1 Jun 2018 18:00:11 +0100
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Please indent things properly when arguments span multiple lines.
> 
> The second line here should be precisely after the opennning
> parenthesis
> of the previous line.
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Likewise.

^ permalink raw reply

* Re: [RFC feedback] AF_XDP and non-Intel hardware
From: Björn Töpel @ 2018-06-04 15:43 UTC (permalink / raw)
  To: MykytaI Iziumtsev
  Cc: Netdev, Björn Töpel, Karlsson, Magnus, Zhang, Qi Z,
	Francois Ozog, Ilias Apalodimas, Brian Brooks,
	Jesper Dangaard Brouer, Andy Gospodarek, michael.chan,
	Luke Gorrie
In-Reply-To: <CAPvmOuEeag+R9sa_yGRy2ZVpmcofoa7Ey6y1To3qi=QV4BV6ww@mail.gmail.com>

Den mån 4 juni 2018 kl 11:44 skrev Mykyta Iziumtsev
<mykyta.iziumtsev@linaro.org>:
>
> Hi Björn and Magnus,
>
> Here is a short update on the proposed changes to AF_XDP.
>
> During implementation phase I figured out that 'end-of-page' in RX
> descriptor's flags won't work, unfortunately. This is because the
> kernel driver can't know if the frame will be last on the page when RX
> descriptor is being filled in. Imagine that the driver is processing a
> frame on a page, and there is still 1000 bytes of non utilized memory
> beyond this frame. If the next frame (which haven't yet arrived) is
> less than that -- the NIC can place it on the same page, otherwise the
> NIC will skip remaining 1000 bytes and take next page. Of course, the
> driver can update RX descriptor retrospectively, or use 'empty' RX
> descriptors just to indicate page completion, but it's too complex or
> inefficient and as such doesn't look good.
>

I think both "update RX descriptor retrospectively" and "zero-length
end-of-chunk descriptor" are OK (as long as the zero-length descs much
less common that the non-zero one). Out of curiosity, does Chelsio (or
any other type-writer based) NIC has a public data sheet? It would be
interesting to see how this is solved in practice.

> What I could propose instead is to make the 'filling' and 'page
> completion' symmetric. That is, UMEM will have 1 queue to pass pages
> from application to the kernel driver (fring in your terminology) and
> one to indicate complete pages back to the application.
>
> That 'page completion' queue can be added as a third queue to UMEM, or
> alternatively TX completions can be decoupled from UMEM to make it an
> simple 'memory area' object with two directional queues which work
> with RX pages only. TX completions can be handled in a simpler manner
> as part of TX queues: just make sure consumer index is advanced only
> when the frame is finally sent out. The TX completion processing can
> be then done by observing consume index advance from application. That
> would  not only be more in line with best NIC ring design practices
> but will improve cache locality and eliminate critical section  if TX
> queues are assigned to different CPUs.
>
...and this is also a valid option. Adding an additional umem ring for
end-of-chunk completions, might make sense, but let's the zero-copy
implementations drive that requirement.

As you probably noted Magnus and I just sent out a patch set with an
updated descriptor layout. A next step would be adding support for
type-writer styled NICs.

Thanks for the information, and hopefully we'll see an additional
AF_XDP zero-copy implementation on list (nudge, nudge). ;-)

Björn

> With best regards,
> Mykyta
>
> On 22 May 2018 at 14:09, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >> On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
> >>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >>>> Hi Björn and Magnus,
> >>>>
> >>>> (This thread is a follow up to private dialogue. The intention is to
> >>>> let community know that AF_XDP can be enhanced further to make it
> >>>> compatible with wider range of NIC vendors).
> >>>>
> >>>
> >>> Mykyta, thanks for doing the write-up and sending it to the netdev
> >>> list! The timing could not be better -- we need to settle on an uapi
> >>> that works for all vendors prior enabling it in the kernel.
> >>>
> >>>> There are two NIC variations which don't fit well with current AF_XDP proposal.
> >>>>
> >>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
> >>>> expects NIC to put incoming frames into slots defined by UMEM area.
> >>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
> >>>> slots available to NIC are communicated to the kernel via UMEM fill
> >>>> queue. While Intel NICs support only one slot size, NXP and Cavium
> >>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
> >>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
> >>>> pulls a slot from best fit pool based on frame size.
> >>>>
> >>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
> >>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
> >>>> predefined addresses. This is not the case here as the NIC is in
> >>>> charge of placing frames in memory based on it's own algorithm. For
> >>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
> >>>> and puts incoming frames on the page in a 'tape recorder' fashion.
> >>>> Assuming 'base' is page address and flen[N] is an array of frame
> >>>> lengths, the frame placement in memory will look like that:
> >>>>   base + 0
> >>>>   base + frame_len[0]
> >>>>   base + frame_len[0] + frame_len[1]
> >>>>   base + frame_len[0] + frame_len[1] + frame_len[2]
> >>>>   ...
> >>>>
> >>>> To better support these two NIC variations I suggest to abandon 'frame
> >>>> size' structuring in UMEM and stick with 'pages' instead. The NIC
> >>>> kernel driver is then responsible for splitting provided pages into
> >>>> slots expected by underlying HW (or not splitting at all in case of
> >>>> Chelsio/Netcope).
> >>>>
> >>>
> >>> Let's call the first variation "multi-pool" and the second
> >>> "type-writer" for simplicity. The type-writer model is very
> >>> interesting, and makes a lot of sense when the PCIe bus is a
> >>> constraint.
> >>>
> >>>> On XDP_UMEM_REG the application needs to specify page_size. Then the
> >>>> application can pass empty pages to the kernel driver using UMEM
> >>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
> >>>> format needs to be changed as well: frame location will be defined by
> >>>> offset from the beginning of UMEM area instead of frame index. As
> >>>> payload headroom can vary with AF_XDP we'll need to specify it in
> >>>> xdp_desc as well. Beside that it could be worth to consider changing
> >>>> payload length to u16 as 64k+ frames aren't very common in networking.
> >>>> The resulting xdp_desc would look like that:
> >>>>
> >>>> struct xdp_desc {
> >>>>         __u64 offset;
> >>>>         __u16 headroom;
> >>>>         __u16 len;
> >>>>         __u8 flags;
> >>>>         __u8 padding[3];
> >>>> };
> >>>>
> >>>
> >>> Let's expand a bit here; Instead of passing indicies to fixed sized
> >>> frames to the fill ring, we now pass a memory region. For simplicity,
> >>> let's say a page. The fill ring descriptor requires offset and
> >>> len. The offset is a global offset from an UMEM perspective, and len
> >>> is the size of the region.
> >>>
> >>
> >> I would rather stick with region equal to page (regular or huge page,
> >> defined by application). The page size can be extracted from
> >> vm_area_struct in XDP_UMEM_REG (preferred) or configured by
> >> application.
> >>
> >
> > Ok, thinking more about it I prefer this as well. This means that we
> > only need to grow the UMEM fring/cring descriptors to u64, and not
> > care about length. As you state below, this makes the validation
> > simple.
> >
> > We might consider exposing a "page size hint", that the user can set.
> > For 4G hugepage scenario, it might make sense to have a chunk *less*
> > than 4G to avoid HW Rx memory running low when the end of a chunk is
> > approaching.
> >
> > As for THP, I need to think about proper behavior here.
> >
> >>> On the Rx ring the descriptor, as you wrote, must be changed as well
> >>> to your suggestion above. Note, that headroom is still needed, since
> >>> XDP can change the size of a packet, so the fixed headroom stated in
> >>> UMEM registration is not sufficient.
> >>>
> >>> This model is obviously more flexible, but then also a bit more
> >>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> >>> behaviour be? Should the fill ring entry be used only for *one* frame,
> >>> or chopped up to multiple receive frames? Should it be configurable?
> >>> Driver specific?
> >>
> >> I think driver-specific makes most sense here. In case of fixed-frame
> >> NIC the driver shall chop the ring entry into multiple receive frames.
> >>
> >
> > Let's start there, keeping the configuration space small.
> >
> >>>
> >>> Also, validating the entries in the fill queue require more work
> >>> (compared to the current index variant). Currently, we only skip
> >>> invalid indicies. What should we do when say, you pass a memory window
> >>> that is too narrow (say 128B) but correct from a UMEM
> >>> perspective. Going this path, we need to add pretty hard constraints
> >>> so we don't end up it too complex code -- because then it'll be too
> >>> slow.
> >>
> >> If we stick with pages -- the only possible erroneous input will be
> >> 'page out of UMEM boundaries'. The validation will be essentially:
> >>
> >> if ((offset > umem->size) || (offset & (umem->page_size - 1))
> >>     fail
> >>
> >> The question is what shall be done if validation fails ? Would
> >> SEGFAULT be reasonable ? This is more or less equivalent to
> >> dereferencing invalid pointer.
> >>
> >
> > The current scheme is simply dropping that kernel skips the invalid
> > fill ring entry. SIGSEGV is an interesting idea!
> >
> >>>
> >>>> In current proposal you have a notion of 'frame ownership': 'owned by
> >>>> kernel' or 'owned by application'. The ownership is transferred by
> >>>> means of enqueueing frame index in UMEM 'fill' queue (from application
> >>>> to kernel) or in UMEM 'tx completion' queue (from kernel to
> >>>> application). If you decide to adopt 'page' approach this notion needs
> >>>> to be changed a bit. This is because in case of packet forwarding one
> >>>> and the same page can be used for RX (parts of it enqueued in HW 'free
> >>>> lists') and TX (forwarding of previously RXed packets).
> >>>>
> >>>> I propose to define 'ownership' as a right to manipulate the
> >>>> partitioning of the page into frames. Whenever application passes a
> >>>> page to the kernel via UMEM 'fill' queue -- the ownership is
> >>>> transferred to the kernel. The application can't allocate packets on
> >>>> this page until kernel is done with it, but it can change payload of
> >>>> RXed packets before forwarding them. The kernel can pass ownership
> >>>> back by means of 'end-of-page' in xdp_desc.flags.
> >>>>
> >>>
> >>> I like the end-of-page mechanism.
> >>>
> >>>> The pages are definitely supposed to be recycled sooner or later. Even
> >>>> if it's not part of kernel API and the housekeeping implementation
> >>>> resided completely in application I still would like to propose
> >>>> possible (hopefully, cost efficient) solution to that. The recycling
> >>>> could be achieved by keeping refcount on pages and recycling the page
> >>>> only when it's owned by application and refcount reaches 0.
> >>>>
> >>>> Whenever application transfers page ownership to the kernel the
> >>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
> >>>> corresponding page needs to be identified (xdp_desc.offset >>
> >>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
> >>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
> >>>> -- the refcount gets decremented only on TX completion (again,
> >>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
> >>>> application itself the payload buffers needs to be allocated from
> >>>> empty page owned by the application and refcount needs to be
> >>>> incremented as well.
> >>>>
> >>>
> >>> Strictly speaking, we're not enforcing correctness in the current
> >>> solution. If the userspace application passes index 1 mulitple times
> >>> to the fill ring, and at the same time send index 1, things will
> >>> break. So, with the existing solution the userspace application
> >>> *still* needs to track the frames. With this new model, the
> >>> tracking/refcounting will be more complex. That might be a concern.
> >>>
> >>> For the multi-pool NICs I think we can still just have one UMEM, and
> >>> let the driver decide where in which pool to place this certain chunk
> >>> of memory. Thoughts?
> >>
> >> Definitely agree with that. This is HW specific and exposing it to the
> >> application would only harm portability.
> >>
> >
> > Good stuff, we're on the same page then.
> >
> >>>
> >>> Now, how do we go forward? I think this is very important, and I will
> >>> hack a copy-mode version for this new API. I'm a bit worried that the
> >>> complexity/configuration space will grow and impact performance, but
> >>> let's see.
> >>>
> >>> To prove that the API works for the NICs you mentioned, we need an
> >>> actual zero-copy implementation for them. Do you think Linaro could
> >>> work on a zero-copy variant for any of the NICs above?
> >>>
> >>
> >> Linaro will definitely contribute zero-copy implementation for some
> >> ARM-based NICs with 'multi-pool' variation.
> >
> > Very nice!
> >
> >> Implementation of
> >> 'type-writer' variation is up to Chelsio/Netcope, we only try to come
> >> up with API which (most probably) will fit them as well.
> >>
> >
> > Let's hope we get an implementation from these vendors as well! :-)
> >
> >
> > Björn
> >
> >>>
> >>> Again thanks for bringing this up!
> >>> Björn
> >>>
> >>>
> >>>
> >>>> [1] https://en.wikipedia.org/wiki/Internet_Mix
> >>>>
> >>>> With best regards,
> >>>> Mykyta

^ permalink raw reply

* [PATCH bpf-next] bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPS
From: Yonghong Song @ 2018-06-04 15:53 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id()
helper") introduced a new helper bpf_get_current_cgroup_id().
The helper has a dependency on CONFIG_CGROUPS.

When CONFIG_CGROUPS is not defined, using the helper will result
the following verifier error:
  kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
which is hard for users to interpret.
Guarding the reference to bpf_get_current_cgroup_id_proto with
CONFIG_CGROUPS will result in below better message:
  unknown func bpf_get_current_cgroup_id#80

Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/trace/bpf_trace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e2ab5b7..0ae6829 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -564,8 +564,10 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+#ifdef CONFIG_CGROUPS
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
+#endif
 	default:
 		return NULL;
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net] VSOCK: check sk state before receive
From: Jorgen S. Hansen @ 2018-06-04 16:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Hangbin Liu, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <20180530091727.GF14623@stefanha-x1.localdomain>


> On May 30, 2018, at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
>> Hmm...Although I won't reproduce this bug with my reproducer after
>> apply my patch. I could still get a similiar issue with syzkaller sock vnet test.
>> 
>> It looks this patch is not complete. Here is the KASAN call trace with my patch.
>> I can also reproduce it without my patch.
> 
> Seems like a race between vmci_datagram_destroy_handle() and the
> delayed callback, vmci_transport_recv_dgram_cb().
> 
> I don't know the VMCI transport well so I'll leave this to Jorgen.

Yes, it looks like we are calling the delayed callback after we return from vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here - the refcounting of VMCI datagram endpoints should guard against this, since the delayed callback does a get on the datagram resource, so this could a VMCI driver issue, and not a problem in the VMCI transport for AF_VSOCK.

> 
>> ==================================================================
>> BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> Read of size 4 at addr ffff880026a3a914 by task kworker/0:2/96
>> 
>> CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> Workqueue: events dg_delayed_dispatch [vmw_vmci]
>> Call Trace:
>> __dump_stack lib/dump_stack.c:77 [inline]
>> dump_stack+0xdd/0x18e lib/dump_stack.c:113
>> print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
>> kasan_report_error mm/kasan/report.c:354 [inline]
>> kasan_report+0x1dd/0x460 mm/kasan/report.c:412
>> vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
>> dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
>> process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
>> worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
>> kthread+0x343/0x4b0 kernel/kthread.c:240
>> ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412
>> 
>> Allocated by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>> slab_post_alloc_hook mm/slab.h:444 [inline]
>> slab_alloc_node mm/slub.c:2741 [inline]
>> slab_alloc mm/slub.c:2749 [inline]
>> kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
>> sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
>> sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
>> __vsock_create+0xc8/0x9b0 [vsock]
>> vsock_create+0xfd/0x1a0 [vsock]
>> __sock_create+0x310/0x690 net/socket.c:1285
>> sock_create net/socket.c:1325 [inline]
>> __sys_socket+0x101/0x240 net/socket.c:1355
>> __do_sys_socket net/socket.c:1364 [inline]
>> __se_sys_socket net/socket.c:1362 [inline]
>> __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
>> do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> Freed by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
>> slab_free_hook mm/slub.c:1388 [inline]
>> slab_free_freelist_hook mm/slub.c:1415 [inline]
>> slab_free mm/slub.c:2988 [inline]
>> kmem_cache_free+0xce/0x410 mm/slub.c:3004
>> sk_prot_free net/core/sock.c:1509 [inline]
>> __sk_destruct+0x629/0x940 net/core/sock.c:1593
>> sk_destruct+0x4e/0x90 net/core/sock.c:1601
>> __sk_free+0xd3/0x320 net/core/sock.c:1612
>> sk_free+0x2a/0x30 net/core/sock.c:1623
>> __vsock_release+0x431/0x610 [vsock]
>> vsock_release+0x3c/0xc0 [vsock]
>> sock_release+0x91/0x200 net/socket.c:594
>> sock_close+0x17/0x20 net/socket.c:1149
>> __fput+0x368/0xa20 fs/file_table.c:209
>> task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
>> exit_task_work include/linux/task_work.h:22 [inline]
>> do_exit+0x1876/0x26c0 kernel/exit.c:865
>> do_group_exit+0x159/0x3e0 kernel/exit.c:968
>> get_signal+0x65a/0x1780 kernel/signal.c:2482
>> do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
>> exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
>> prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> The buggy address belongs to the object at ffff880026a3a600
>> which belongs to the cache AF_VSOCK of size 1056
>> The buggy address is located 788 bytes inside of
>> 1056-byte region [ffff880026a3a600, ffff880026a3aa20)
>> The buggy address belongs to the page:
>> page:ffffea00009a8e00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
>> flags: 0xfffffc0008100(slab|head)
>> raw: 000fffffc0008100 0000000000000000 0000000000000000 00000001000d000d
>> raw: dead000000000100 dead000000000200 ffff880034471a40 0000000000000000
>> page dumped because: kasan: bad access detected
>> 
>> Memory state around the buggy address:
>> ffff880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>                         ^
>> ffff880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>> ==================================================================


^ permalink raw reply

* Re: [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
From: Alexei Starovoitov @ 2018-06-04 16:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan
In-Reply-To: <20180604115715.17895-1-bjorn.topel@gmail.com>

On Mon, Jun 04, 2018 at 01:57:10PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
> https://www.spinics.net/lists/netdev/msg503664.html) is that it does
> not support NICs that have a "type-writer" model in an efficient
> way. In this model, a memory window is passed to the hardware and
> multiple frames might be filled into that window, instead of just one
> that we have in the current fixed frame-size model.
> 
> This patch set fixes two bugs in the current implementation and then
> changes the uapi so that the type-writer model can be supported
> efficiently by a possible future extension of AF_XDP.
> 
> These are the uapi changes in this patch:
> 
> * Change the "u32 idx" in the descriptors to "u64 addr". The current
>   idx based format does NOT work for the type-writer model (as packets
>   can start anywhere within a frame) but that a relative address
>   pointer (the u64 addr) works well for both models in the prototype
>   code we have that supports both models. We increased it from u32 to
>   u64 to support umems larger than 4G. We have also removed the u16
>   offset when having a "u64 addr" since that information is already
>   carried in the least significant bits of the address.
> 
> * We want to use "u8 padding[5]" for something useful in the future
>   (since we are not allowed to change its name), so we now call it
>   just options so it can be extended for various purposes in the
>   future. It is an u32 as that it what is left of the 16 byte
>   descriptor.
> 
> * We changed the name of frame_size in the UMEM_REG setsockopt to
>   chunk_size since this naming also makes sense to the type-writer
>   model.
> 
> With these changes to the uapi, we believe the type-writer model can
> be supported without having to resort to a new descriptor format. The
> type-writer model could then be supported, from the uapi point of
> view, by setting a flag at bind time and providing a new flag bit in
> the options field of the descriptor that signals to user space that
> all packets have been written in a chunk. Or with a new chunk
> completion queue as suggested by Mykyta in his latest feedback mail on
> the list.

for the set:
Acked-by: Alexei Starovoitov <ast@kernel.org>
Thank you for these fixes.
According to unofficial feedback from brcm and netronome folks
the descriptor format should work for these nics too.
At some point we may consider second format, but I think SW
should drive HW requirements and not the other way around.

^ permalink raw reply

* [PATCH bpf-next v2 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Toke Høiland-Jørgensen @ 2018-06-04 16:33 UTC (permalink / raw)
  To: netdev

This adds two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 tools/testing/selftests/bpf/trace_helpers.c |   47 ++++++++++++++++++++++++++-
 tools/testing/selftests/bpf/trace_helpers.h |    4 ++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..1e62d89f34cf 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
 
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
 {
 	void *base;
 	int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
 		return -1;
 	}
 
-	header = base;
+	*header = base;
 	return 0;
 }
 
+int perf_event_mmap(int fd)
+{
+	return perf_event_mmap_header(fd, &header);
+}
+
 static int perf_event_poll(int fd)
 {
 	struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
 
 	return ret;
 }
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn)
+{
+	enum bpf_perf_event_ret ret;
+	struct pollfd *pfds;
+	void *buf = NULL;
+	size_t len = 0;
+	int i;
+
+	pfds = malloc(sizeof(*pfds) * num_fds);
+	if (!pfds)
+		return -1;
+
+	memset(pfds, 0, sizeof(*pfds) * num_fds);
+	for (i = 0; i < num_fds; i++) {
+		pfds[i].fd = fds[i];
+		pfds[i].events = POLLIN;
+	}
+
+	for (;;) {
+		poll(pfds, num_fds, 1000);
+		for (i = 0; i < num_fds; i++) {
+			if (pfds[i].revents) {
+				ret = bpf_perf_event_read_simple(headers[i], page_cnt * page_size,
+								page_size, &buf, &len,
+								bpf_perf_event_print,
+								output_fn);
+				if (ret != LIBBPF_PERF_EVENT_CONT)
+					break;
+			}
+		}
+	}
+	free(buf);
+	free(pfds);
+
+	return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
 #define __TRACE_HELPER_H
 
 #include <libbpf.h>
+#include <linux/perf_event.h>
 
 struct ksym {
 	long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
 int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
 /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
 int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn);
 #endif

^ permalink raw reply related

* [PATCH bpf-next v2 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-04 16:33 UTC (permalink / raw)
  To: netdev
In-Reply-To: <152813003609.3465.618891361534945522.stgit@alrua-kau>

This adds an example program showing how to sample packets from XDP using
the perf event buffer. The example userspace program just prints the
ethernet header for every packet sampled.

The example sets up a perf file descriptor per CPU, allowing the XDP
program to pass BPF_F_CURRENT_CPU and work no matter which CPU handles the
packet.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 samples/bpf/Makefile               |    4 +
 samples/bpf/xdp_sample_pkts_kern.c |   62 ++++++++++++++
 samples/bpf/xdp_sample_pkts_user.c |  162 ++++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..6f0c6d276a86 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := bpf_load.o xdp_sample_pkts_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES		+= $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4		+= -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index 000000000000..4560522ca015
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,62 @@
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 24
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+
+        /* Metadata will be in the perf event before the packet data. */
+	struct S {
+		u16 cookie;
+		u16 pkt_len;
+	} __attribute__((packed)) metadata;
+
+	if (data + SAMPLE_SIZE < data_end) {
+		/* The XDP perf_event_output handler will use the upper 32 bits
+		 * of the flags argument as a number of bytes to include of the
+		 * packet payload in the event data. If the size is too big, the
+		 * call to bpf_perf_event_output will fail and return -EFAULT.
+		 *
+		 * See bpf_xdp_event_output in net/core/filter.c.
+		 *
+		 * The BPF_F_CURRENT_CPU flag means that the event output fd
+		 * will be indexed by the CPU number in the event map.
+		 */
+		u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
+		int ret;
+
+		metadata.cookie = 0xdead;
+		metadata.pkt_len = (u16)(data_end - data);
+
+		ret = bpf_perf_event_output(ctx, &my_map, flags,
+				      &metadata, sizeof(metadata));
+		if(ret)
+			bpf_printk("perf_event_output failed: %d\n", ret);
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index 000000000000..35c5dd953f48
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,162 @@
+/* This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <net/if.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <libbpf.h>
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define MAX_CPUS 24
+static int pmu_fds[MAX_CPUS], if_idx = 0;
+static struct perf_event_mmap_page *headers[MAX_CPUS];
+static char *if_name;
+
+static int do_attach(int idx, int fd, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, fd, 0);
+	if (err < 0)
+		printf("ERROR: failed to attach program to %s\n", name);
+
+	return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, -1, 0);
+	if (err < 0)
+		printf("ERROR: failed to detach program from %s\n", name);
+
+	return err;
+}
+
+#define SAMPLE_SIZE 64
+
+static int print_bpf_output(void *data, int size)
+{
+	struct {
+		__u16 cookie;
+		__u16 pkt_len;
+		__u8  pkt_data[SAMPLE_SIZE];
+	} __attribute__((packed)) *e = data;
+	int i;
+
+	if (e->cookie != 0xdead) {
+		printf("BUG cookie %x sized %d\n",
+		       e->cookie, size);
+		return LIBBPF_PERF_EVENT_ERROR;
+	}
+
+	printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
+	for (i = 0; i < 14 && i < e->pkt_len; i++)
+		printf("%02x ", e->pkt_data[i]);
+	printf("\n");
+
+	return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void test_bpf_perf_event(int num)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.wakeup_events = 1, /* get an fd notification for every event */
+	};
+	int i;
+
+	for (i = 0; i < num; i++) {
+		int key = i;
+
+		pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+
+		assert(pmu_fds[i] >= 0);
+		assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fds[i], BPF_ANY) == 0);
+		ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
+	}
+}
+
+static void sig_handler(int signo)
+{
+	do_detach(if_idx, if_name);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	char filename[256];
+	int ret, err;
+	int numcpus;
+	int i;
+
+	if (argc < 2) {
+		printf("Usage: %s <ifname>\n", argv[0]);
+		return 1;
+	}
+
+	numcpus = get_nprocs();
+	if (numcpus > MAX_CPUS)
+		numcpus = MAX_CPUS;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if_idx = if_nametoindex(argv[1]);
+	if (!if_idx)
+		if_idx = strtoul(argv[1], NULL, 0);
+
+	if (!if_idx) {
+		fprintf(stderr, "Invalid ifname\n");
+		return 1;
+	}
+	if_name = argv[1];
+	err = do_attach(if_idx, prog_fd[0], argv[1]);
+	if (err)
+		return err;
+
+	if (signal(SIGINT, sig_handler) ||
+	    signal(SIGHUP, sig_handler) ||
+	    signal(SIGTERM, sig_handler)) {
+		perror("signal");
+		return 1;
+	}
+
+	test_bpf_perf_event(numcpus);
+
+	for (i = 0; i < numcpus; i++)
+		if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
+			return 1;
+
+	ret = perf_event_poller_multi(pmu_fds, headers, numcpus, print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
+}

^ permalink raw reply related

* Re: [PATCH net-next 0/2] cls_flower: Various fixes
From: Cong Wang @ 2018-06-04 16:45 UTC (permalink / raw)
  To: Roi Dayan
  Cc: Jiri Pirko, Paul Blakey, Jiri Pirko, Jamal Hadi Salim,
	David Miller, Linux Kernel Network Developers, Yevgeny Kliteynik,
	Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <695d484a-b981-054a-1da0-e63719bc9100@mellanox.com>

On Mon, Jun 4, 2018 at 12:35 AM, Roi Dayan <roid@mellanox.com> wrote:
>
>
> On 03/06/2018 22:39, Jiri Pirko wrote:
>>
>> Sun, Jun 03, 2018 at 08:33:25PM CEST, xiyou.wangcong@gmail.com wrote:
>>>
>>> On Wed, May 30, 2018 at 1:17 AM, Paul Blakey <paulb@mellanox.com> wrote:
>>>>
>>>> Two of the fixes are for my multiple mask patch
>>>>
>>>> Paul Blakey (2):
>>>>    cls_flower: Fix missing free of rhashtable
>>>>    cls_flower: Fix comparing of old filter mask with new filter
>>>
>>>
>>> Both are bug fixes and one-line fixes, so definitely should go
>>> to -net tree and -stable tree.
>>
>>
>> I agree.
>>
>
> it's because the commit they fix doesn't exists in net yet.
>

Oh, sorry, my bad, I thought Apr 30 is in a previous release...

Anyway, v2 should be applied cleanly to -net-next or -net after
net-next is merged into it.

^ permalink raw reply

* Re: [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
From: Alexei Starovoitov @ 2018-06-04 16:38 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch serie introduces zerocopy (ZC) support for
> AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> without any copies and can also transmit packets without incurring any
> copies. No modifications to the application are needed, but the NIC
> driver needs to be modified to support ZC. If ZC is not supported by
> the driver, the modes introduced in the AF_XDP patch will be
> used. Using ZC in our micro benchmarks results in significantly
> improved performance as can be seen in the performance section later
> in this cover letter.
> 
> Note that for an untrusted application, HW packet steering to a
> specific queue pair (the one associated with the application) is a
> requirement when using ZC, as the application would otherwise be able
> to see other user space processes' packets. If the HW cannot support
> the required packet steering you need to use the XDP_SKB mode or the
> XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
> patch set can be used to do load balancing in that case.
> 
> For benchmarking, you can use the xdpsock application from the AF_XDP
> patch set without any modifications. Say that you would like your UDP
> traffic from port 4242 to end up in queue 16, that we will enable
> AF_XDP on. Here, we use ethtool for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
> done using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference. The numbers within
> parantheses are from the RFC V1 ZC patch set.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.1(21.5)
> txpush       2.6*       -          22.0(21.6)
> l2fwd        1.9*       2.5*       15.3(15.0)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1*       3.3*       3.3(3.3)
> l2fwd        1.4*       1.8*       3.1(3.1)
> 
> * From AF_XDP V3 patch set and cover letter.
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patches 1-3: Plumbing for AF_XDP ZC support
> Patches 4-5: AF_XDP ZC for RX
> Patches 6-7: AF_XDP ZC for TX

Acked-by: Alexei Starovoitov <ast@kernel.org>
for above patches

> Patch 8-10: ZC support for i40e.

these also look good to me.
would be great if i40e experts take a look at them asap.

If there are no major objections we'd like to merge all of it
for this merge window.

Thanks!

^ permalink raw reply

* [PATCH V2 net] net: hns: Fix the process of adding broadcast addresses to tcam
From: Salil Mehta @ 2018-06-04 16:50 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang

From: Xi Wang <wangxi11@huawei.com>

If the multicast mask value in device tree is configured not all
0xff, the broadcast mac will be lost from tcam table after the
execution of command 'ifconfig up'. The address is appended by
hns_ae_start, but will be clear later by hns_nic_set_rx_mode
called in dev_open process.

This patch fixed it by not use the multicast mask when add a
broadcast address.

Fixes: b5996f11ea54 ("net: add Hisilicon Network Subsystem basic ethernet support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixed comments by David Miller
   Link: https://lkml.org/lkml/2018/6/4/671
Patch V1: Initial Submit
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 23 +++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index e0bc79e..85e1d14 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -1648,6 +1648,15 @@ int hns_dsaf_rm_mac_addr(
 				      mac_entry->addr);
 }
 
+static void hns_dsaf_setup_mc_mask(struct dsaf_device *dsaf_dev,
+				   u8 port_num, u8 *mask, u8 *addr)
+{
+	if (MAC_IS_BROADCAST(addr))
+		memset(mask, 0xff, ETH_ALEN);
+	else
+		memcpy(mask, dsaf_dev->mac_cb[port_num]->mc_mask, ETH_ALEN);
+}
+
 static void hns_dsaf_mc_mask_bit_clear(char *dst, const char *src)
 {
 	u16 *a = (u16 *)dst;
@@ -1676,7 +1685,6 @@ int hns_dsaf_add_mac_mc_port(struct dsaf_device *dsaf_dev,
 	struct dsaf_drv_tbl_tcam_key tmp_mac_key;
 	struct dsaf_tbl_tcam_data tcam_data;
 	u8 mc_addr[ETH_ALEN];
-	u8 *mc_mask;
 	int mskid;
 
 	/*chechk mac addr */
@@ -1687,9 +1695,12 @@ int hns_dsaf_add_mac_mc_port(struct dsaf_device *dsaf_dev,
 	}
 
 	ether_addr_copy(mc_addr, mac_entry->addr);
-	mc_mask = dsaf_dev->mac_cb[mac_entry->in_port_num]->mc_mask;
 	if (!AE_IS_VER1(dsaf_dev->dsaf_ver)) {
+		u8 mc_mask[ETH_ALEN];
+
 		/* prepare for key data setting */
+		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
+				       mc_mask, mac_entry->addr);
 		hns_dsaf_mc_mask_bit_clear(mc_addr, mc_mask);
 
 		/* config key mask */
@@ -1844,7 +1855,6 @@ int hns_dsaf_del_mac_mc_port(struct dsaf_device *dsaf_dev,
 	struct dsaf_drv_tbl_tcam_key mask_key, tmp_mac_key;
 	struct dsaf_tbl_tcam_data *pmask_key = NULL;
 	u8 mc_addr[ETH_ALEN];
-	u8 *mc_mask;
 
 	if (!(void *)mac_entry) {
 		dev_err(dsaf_dev->dev,
@@ -1861,14 +1871,17 @@ int hns_dsaf_del_mac_mc_port(struct dsaf_device *dsaf_dev,
 
 	/* always mask vlan_id field */
 	ether_addr_copy(mc_addr, mac_entry->addr);
-	mc_mask = dsaf_dev->mac_cb[mac_entry->in_port_num]->mc_mask;
 
 	if (!AE_IS_VER1(dsaf_dev->dsaf_ver)) {
+		u8 mc_mask[ETH_ALEN];
+
 		/* prepare for key data setting */
+		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
+				       mc_mask, mac_entry->addr);
 		hns_dsaf_mc_mask_bit_clear(mc_addr, mc_mask);
 
 		/* config key mask */
-		hns_dsaf_set_mac_key(dsaf_dev, &mask_key, 0x00, 0xff, mc_addr);
+		hns_dsaf_set_mac_key(dsaf_dev, &mask_key, 0x00, 0xff, mc_mask);
 
 		mask_key.high.val = le32_to_cpu(mask_key.high.val);
 		mask_key.low.val = le32_to_cpu(mask_key.low.val);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net] l2tp: fix refcount leakage on PPPoL2TP sockets
From: Guillaume Nault @ 2018-06-04 16:52 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

Commit d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session
object destroy") tried to fix a race condition where a PPPoL2TP socket
would disappear while the L2TP session was still using it. However, it
missed the root issue which is that an L2TP session may accept to be
reconnected if its associated socket has entered the release process.

The tentative fix makes the session hold the socket it is connected to.
That saves the kernel from crashing, but introduces refcount leakage,
preventing the socket from completing the release process. Once stalled,
everything the socket depends on can't be released anymore, including
the L2TP session and the l2tp_ppp module.

The root issue is that, when releasing a connected PPPoL2TP socket, the
session's ->sk pointer (RCU-protected) is reset to NULL and we have to
wait for a grace period before destroying the socket. The socket drops
the session in its ->sk_destruct callback function, so the session
will exist until the last reference on the socket is dropped.
Therefore, there is a time frame where pppol2tp_connect() may accept
reconnecting a session, as it only checks ->sk to figure out if the
session is connected. This time frame is shortened by the fact that
pppol2tp_release() calls l2tp_session_delete(), making the session
unreachable before resetting ->sk. However, pppol2tp_connect() may
grab the session before it gets unhashed by l2tp_session_delete(), but
it may test ->sk after the later got reset. The race is not so hard to
trigger and syzbot found a pretty reliable reproducer:
https://syzkaller.appspot.com/bug?id=418578d2a4389074524e04d641eacb091961b2cf

Before d02ba2a6110c, another race could let pppol2tp_release()
overwrite the ->__sk pointer of an L2TP session, thus tricking
pppol2tp_put_sk() into calling sock_put() on a socket that is different
than the one for which pppol2tp_release() was originally called. To get
there, we had to trigger the race described above, therefore having one
PPPoL2TP socket being released, while the session it is connected to is
reconnecting to a different PPPoL2TP socket. When releasing this new
socket fast enough, pppol2tp_release() overwrites the session's
->__sk pointer with the address of the new socket, before the first
pppol2tp_put_sk() call gets scheduled. Then the pppol2tp_put_sk() call
invoked by the original socket will sock_put() the new socket,
potentially dropping its last reference. When the second
pppol2tp_put_sk() finally runs, its socket has already been freed.

With d02ba2a6110c, the session takes a reference on both sockets.
Furthermore, the session's ->sk pointer is reset in the
pppol2tp_session_close() callback function rather than in
pppol2tp_release(). Therefore, ->__sk can't be overwritten and
pppol2tp_put_sk() is called only once (l2tp_session_delete() will only
run pppol2tp_session_close() once, to protect the session against
concurrent deletion requests). Now pppol2tp_put_sk() will properly
sock_put() the original socket, but the new socket will remain, as
l2tp_session_delete() prevented the release process from completing.
Here, we don't depend on the ->__sk race to trigger the bug. Getting
into the pppol2tp_connect() race is enough to leak the reference, no
matter when new socket is released.

So it all boils down to pppol2tp_connect() failing to realise that the
session has already been connected. This patch drops the unneeded extra
reference counting (mostly reverting d02ba2a6110c) and checks that
neither ->sk nor ->__sk is set before allowing a session to be
connected.

Fixes: d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session object destroy")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ppp.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 1fd9e145076a..466f17646625 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -428,16 +428,6 @@ static void pppol2tp_put_sk(struct rcu_head *head)
  */
 static void pppol2tp_session_close(struct l2tp_session *session)
 {
-	struct pppol2tp_session *ps;
-
-	ps = l2tp_session_priv(session);
-	mutex_lock(&ps->sk_lock);
-	ps->__sk = rcu_dereference_protected(ps->sk,
-					     lockdep_is_held(&ps->sk_lock));
-	RCU_INIT_POINTER(ps->sk, NULL);
-	if (ps->__sk)
-		call_rcu(&ps->rcu, pppol2tp_put_sk);
-	mutex_unlock(&ps->sk_lock);
 }
 
 /* Really kill the session socket. (Called from sock_put() if
@@ -480,15 +470,24 @@ static int pppol2tp_release(struct socket *sock)
 	sock_orphan(sk);
 	sock->sk = NULL;
 
-	/* If the socket is associated with a session,
-	 * l2tp_session_delete will call pppol2tp_session_close which
-	 * will drop the session's ref on the socket.
-	 */
 	session = pppol2tp_sock_to_session(sk);
 	if (session) {
+		struct pppol2tp_session *ps;
+
 		l2tp_session_delete(session);
-		/* drop the ref obtained by pppol2tp_sock_to_session */
-		sock_put(sk);
+
+		ps = l2tp_session_priv(session);
+		mutex_lock(&ps->sk_lock);
+		ps->__sk = rcu_dereference_protected(ps->sk,
+						     lockdep_is_held(&ps->sk_lock));
+		RCU_INIT_POINTER(ps->sk, NULL);
+		mutex_unlock(&ps->sk_lock);
+		call_rcu(&ps->rcu, pppol2tp_put_sk);
+
+		/* Rely on the sock_put() call at the end of the function for
+		 * dropping the reference held by pppol2tp_sock_to_session().
+		 * The last reference will be dropped by pppol2tp_put_sk().
+		 */
 	}
 
 	release_sock(sk);
@@ -742,7 +741,8 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 		 */
 		mutex_lock(&ps->sk_lock);
 		if (rcu_dereference_protected(ps->sk,
-					      lockdep_is_held(&ps->sk_lock))) {
+					      lockdep_is_held(&ps->sk_lock)) ||
+		    ps->__sk) {
 			mutex_unlock(&ps->sk_lock);
 			error = -EEXIST;
 			goto end;
@@ -803,7 +803,6 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 
 out_no_ppp:
 	/* This is how we get the session context from the socket. */
-	sock_hold(sk);
 	sk->sk_user_data = session;
 	rcu_assign_pointer(ps->sk, sk);
 	mutex_unlock(&ps->sk_lock);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization
From: Eric Dumazet @ 2018-06-04 17:18 UTC (permalink / raw)
  To: Maciej Żenczykowski, Maciej Żenczykowski,
	David S . Miller, Eric Dumazet
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Wei Wang
In-Reply-To: <20180603174117.48539-1-zenczykowski@gmail.com>



On 06/03/2018 10:41 AM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
> to an integer.
> 
> It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
> while 2 enables timewait socket reuse only for sockets that we can
> prove are loopback connections:
>   ie. bound to 'lo' interface or where one of source or destination
>   IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1.
> 
> This enables quicker reuse of ephemeral ports for loopback connections
> - where tcp_tw_reuse is 100% safe from a protocol perspective
> (this assumes no artificially induced packet loss on 'lo').
> 
> This also makes estblishing many loopback connections *much* faster
> (allocating ports out of the first half of the ephemeral port range
> is significantly faster, then allocating from the second half)
> 
> Without this change in a 32K ephemeral port space my sample program
> (it just establishes and closes [::1]:ephemeral -> [::1]:server_port
> connections in a tight loop) fails after 32765 connections in 24 seconds.
> With it enabled 50000 connections only take 4.7 seconds.
> 
> This is particularly problematic for IPv6 where we only have one local
> address and cannot play tricks with varying source IP from 127.0.0.0/8
> pool.
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Wei Wang <weiwan@google.com>

This seems fine, thanks Maciej

Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox