Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v3 6/7] net: dsa: qca8k: Replace GPL boilerplate by SPDX
From: Michal Vokáč @ 2018-05-23  6:20 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, devicetree, f.fainelli, vivien.didelot, andrew,
	mark.rutland, robh+dt, davem, michal.vokac
In-Reply-To: <1527056424-14528-1-git-send-email-michal.vokac@ysoft.com>

Replace the GPLv2 license boilerplate with the SPDX license identifier.

Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:
 - none

Changes in v2:
 - Add commit message.
 - Add "Reviewed-by" tags from Andrew and Florian.

 drivers/net/dsa/qca8k.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 7eba987..c834893 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -1,17 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2009 Felix Fietkau <nbd@nbd.name>
  * Copyright (C) 2011-2012 Gabor Juhos <juhosg@openwrt.org>
  * Copyright (c) 2015, The Linux Foundation. All rights reserved.
  * Copyright (c) 2016 John Crispin <john@phrozen.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 and
- * only version 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
  */
 
 #define DEBUG
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v3 7/7] net: dsa: qca8k: Remove redundant parentheses
From: Michal Vokáč @ 2018-05-23  6:20 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, devicetree, f.fainelli, vivien.didelot, andrew,
	mark.rutland, robh+dt, davem, michal.vokac
In-Reply-To: <1527056424-14528-1-git-send-email-michal.vokac@ysoft.com>

Fix warning reported by checkpatch.

Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:
 - none

Changes in v2:
 - Fix typo in subject.
 - Add "Reviewed-by" tags from Andrew and Florian.

 drivers/net/dsa/qca8k.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index c834893..c0da402 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -513,7 +513,7 @@ qca8k_port_set_status(struct qca8k_priv *priv, int port, int enable)
 	pr_debug("qca: port %i set status %i\n", port, enable);
 
 	/* Port 0 and 6 have no internal PHY */
-	if ((port > 0) && (port < 6))
+	if (port > 0 && port < 6)
 		mask |= QCA8K_PORT_STATUS_LINK_AUTO;
 
 	if (enable)
-- 
2.7.4

^ permalink raw reply related

* Re: [net-next] i40iw/i40e: Remove link dependency on i40e
From: Christoph Hellwig @ 2018-05-23  6:20 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Jason Gunthorpe, davem, dledford, Sindhu Devale, netdev,
	linux-rdma, nhorman, sassmann, jogreene, Shiraz Saleem
In-Reply-To: <2fcc22b088fd04dafbdc1582d03c4bca21eefade.camel@intel.com>

On Tue, May 22, 2018 at 02:50:32PM -0700, Jeff Kirsher wrote:
> The ABI rarely changes, if at all.  The issue OSV's are seeing is that
> upgrading i40e, requires that i40iw be recompiled even though there
> were no updates/changes to the ABI.

So fscking what.  If you upgrade one part of the kernel you have to
rebuild anbything.  If people try to get away without that that is
their problem, an certainly not something in any way supported by
Linux.

^ permalink raw reply

* [PATCH net-next v3 4/7] net: dsa: qca8k: Force CPU port to its highest bandwidth
From: Michal Vokáč @ 2018-05-23  6:20 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, devicetree, f.fainelli, vivien.didelot, andrew,
	mark.rutland, robh+dt, davem, michal.vokac
In-Reply-To: <1527056424-14528-1-git-send-email-michal.vokac@ysoft.com>

By default autonegotiation is enabled to configure MAC on all ports.
For the CPU port autonegotiation can not be used so we need to set
some sensible defaults manually.

This patch forces the default setting of the CPU port to 1000Mbps/full
duplex which is the chip maximum capability.

Also correct size of the bit field used to configure link speed.

Fixes: 6b93fb46480a ("net-next: dsa: add new driver for qca8xxx family")
Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:
 - none

Changes in v2:
 - Add "Fixes" tag as pointed out by Florian.
 - Add "Reviewed-by" tags from Andrew and Florian.

 drivers/net/dsa/qca8k.c | 6 +++++-
 drivers/net/dsa/qca8k.h | 6 ++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 0d224f3..14a108b38 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -537,6 +537,7 @@ qca8k_setup(struct dsa_switch *ds)
 {
 	struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
 	int ret, i, phy_mode = -1;
+	u32 mask;
 
 	pr_debug("qca: setup\n");
 
@@ -564,7 +565,10 @@ qca8k_setup(struct dsa_switch *ds)
 	if (ret < 0)
 		return ret;
 
-	/* Enable CPU Port */
+	/* Enable CPU Port, force it to maximum bandwidth and full-duplex */
+	mask = QCA8K_PORT_STATUS_SPEED_1000 | QCA8K_PORT_STATUS_TXFLOW |
+	       QCA8K_PORT_STATUS_RXFLOW | QCA8K_PORT_STATUS_DUPLEX;
+	qca8k_write(priv, QCA8K_REG_PORT_STATUS(QCA8K_CPU_PORT), mask);
 	qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL0,
 		      QCA8K_GLOBAL_FW_CTRL0_CPU_PORT_EN);
 	qca8k_port_set_status(priv, QCA8K_CPU_PORT, 1);
diff --git a/drivers/net/dsa/qca8k.h b/drivers/net/dsa/qca8k.h
index 1cf8a92..5bda165 100644
--- a/drivers/net/dsa/qca8k.h
+++ b/drivers/net/dsa/qca8k.h
@@ -51,8 +51,10 @@
 #define QCA8K_GOL_MAC_ADDR0				0x60
 #define QCA8K_GOL_MAC_ADDR1				0x64
 #define QCA8K_REG_PORT_STATUS(_i)			(0x07c + (_i) * 4)
-#define   QCA8K_PORT_STATUS_SPEED			GENMASK(2, 0)
-#define   QCA8K_PORT_STATUS_SPEED_S			0
+#define   QCA8K_PORT_STATUS_SPEED			GENMASK(1, 0)
+#define   QCA8K_PORT_STATUS_SPEED_10			0
+#define   QCA8K_PORT_STATUS_SPEED_100			0x1
+#define   QCA8K_PORT_STATUS_SPEED_1000			0x2
 #define   QCA8K_PORT_STATUS_TXMAC			BIT(2)
 #define   QCA8K_PORT_STATUS_RXMAC			BIT(3)
 #define   QCA8K_PORT_STATUS_TXFLOW			BIT(4)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework
From: Jiri Pirko @ 2018-05-23  6:27 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh, aaron.f.brown, anjali.singhai
In-Reply-To: <8f611f3b-88d7-4d41-fd47-d07f11d0f25a@intel.com>

Tue, May 22, 2018 at 10:54:29PM CEST, sridhar.samudrala@intel.com wrote:
>
>
>On 5/22/2018 9:12 AM, Jiri Pirko wrote:
>> Fixing the subj, sorry about that.
>> 
>> Tue, May 22, 2018 at 05:46:21PM CEST, mst@redhat.com wrote:
>> > On Tue, May 22, 2018 at 05:36:14PM +0200, Jiri Pirko wrote:
>> > > Tue, May 22, 2018 at 05:28:42PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 5/22/2018 2:08 AM, Jiri Pirko wrote:
>> > > > > Tue, May 22, 2018 at 11:06:37AM CEST, jiri@resnulli.us wrote:
>> > > > > > Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > Use the registration/notification framework supported by the generic
>> > > > > > > failover infrastructure.
>> > > > > > > 
>> > > > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > > > In previous patchset versions, the common code did
>> > > > > > netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> > > > > > (netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> > > > > > 
>> > > > > > This should be part of the common "failover" code.
>> > > > Based on Stephen's feedback on earlier patches, i tried to minimize the changes to
>> > > > netvsc and only commonize the notifier and the main event handler routine.
>> > > > Another complication is that netvsc does part of registration in a delayed workqueue.
>> > > :( This kind of degrades the whole efford of having single solution
>> > > in "failover" module. I think that common parts, as
>> > > netdev_rx_handler_register() and others certainly is should be inside
>> > > the common module. This is not a good time to minimize changes. Let's do
>> > > the thing properly and fix the netvsc mess now.
>> > > 
>> > > 
>> > > > It should be possible to move some of the code from net_failover.c to generic
>> > > > failover.c in future if Stephen is ok with it.
>> > > > 
>> > > > 
>> > > > > Also note that in the current patchset you use IFF_FAILOVER flag for
>> > > > > master, yet for the slave you use IFF_SLAVE. That is wrong.
>> > > > > IFF_FAILOVER_SLAVE should be used.
>> > > > Not sure which code you are referring to.  I only set IFF_FAILOVER_SLAVE
>> > > > in patch 3.
>> > > The existing netvsc driver.
>> > We really can't change netvsc's flags now, even if it's interface is
>> > messy, it's being used in the field. We can add a flag that makes netvsc
>> > behave differently, and if this flag also allows enhanced functionality
>> > userspace will gradually switch.
>> Okay, although in this case, it really does not make much sense, so be
>> it. Leave the netvsc set the ->priv flag to IFF_SLAVE as it is doing
>> now. (This once-wrong-forever-wrong policy is flustrating me).
>> 
>> But since this patchset introduces private flag IFF_FAILOVER and
>> IFF_FAILOVER_SLAVE, and we set IFF_FAILOVER to the netvsc netdev
>> instance, we should also set IFF_FAILOVER_SLAVE to the enslaved VF
>> netdevice to get at least some consistency between virtio_net and
>> netvsc.
>
>OK. I can make this change to set/unset IFF_FAILOVER_SLAVE in the netvsc
>register/unregister routines so that it is consistent with virtio_net.
>
>Based on your discussion with mst, i think we can even remove IFF_SLAVE
>setting on netvsc as it should not impact userspace.  If Stephen is OK
>we can make this change too.
>
>Do you see any other items that need to be resolved for this series to go in
>this merge window?

As I wrote previously, the common code including rx_handler registration
and setting of flags and master link should be done in a common code,
moved away from netvsc code.

Thanks.


>
>
>
>> 
>> > Anything breaking userspace I fully expect Stephen to nack and
>> > IMO with good reason.
>> > 
>> > -- 
>> > MST
>

^ permalink raw reply

* Re: [PATCH net-next 2/3] net/ipv6: Udate fib6_table_lookup tracepoint
From: kbuild test robot @ 2018-05-23  6:46 UTC (permalink / raw)
  To: dsahern; +Cc: kbuild-all, netdev, David Ahern
In-Reply-To: <20180521212443.23612-3-dsahern@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 6415 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/dsahern-kernel-org/net-ipv4-Udate-fib_table_lookup-tracepoint/20180523-083238
config: x86_64-randconfig-s4-05231222 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/core/net-traces.o: In function `perf_trace_fib6_table_lookup':
>> include/trace/events/fib6.h:13: undefined reference to `ip6_rt_type_to_error'
   net/core/net-traces.o: In function `trace_event_raw_event_fib6_table_lookup':
>> include/trace/events/fib6.h:13: undefined reference to `ip6_rt_type_to_error'

vim +13 include/trace/events/fib6.h

b811580d David Ahern     2015-11-19  12  
b811580d David Ahern     2015-11-19 @13  TRACE_EVENT(fib6_table_lookup,
b811580d David Ahern     2015-11-19  14  
d4bea421 David Ahern     2018-05-09  15  	TP_PROTO(const struct net *net, const struct fib6_info *f6i,
b65f164d Paolo Abeni     2017-10-19  16  		 struct fib6_table *table, const struct flowi6 *flp),
b811580d David Ahern     2015-11-19  17  
d4bea421 David Ahern     2018-05-09  18  	TP_ARGS(net, f6i, table, flp),
b811580d David Ahern     2015-11-19  19  
b811580d David Ahern     2015-11-19  20  	TP_STRUCT__entry(
b811580d David Ahern     2015-11-19  21  		__field(	u32,	tb_id		)
6d233bf4 David Ahern     2018-05-21  22  		__field(	int,	err		)
b811580d David Ahern     2015-11-19  23  		__field(	int,	oif		)
b811580d David Ahern     2015-11-19  24  		__field(	int,	iif		)
b811580d David Ahern     2015-11-19  25  		__field(	__u8,	tos		)
b811580d David Ahern     2015-11-19  26  		__field(	__u8,	scope		)
b811580d David Ahern     2015-11-19  27  		__field(	__u8,	flags		)
b811580d David Ahern     2015-11-19  28  		__array(	__u8,	src,	16	)
b811580d David Ahern     2015-11-19  29  		__array(	__u8,	dst,	16	)
6d233bf4 David Ahern     2018-05-21  30  		__field(        u16,	sport		)
6d233bf4 David Ahern     2018-05-21  31  		__field(        u16,	dport		)
6d233bf4 David Ahern     2018-05-21  32  		__field(        u8,	proto		)
6d233bf4 David Ahern     2018-05-21  33  		__field(        u8,	rt_type		)
b811580d David Ahern     2015-11-19  34  		__dynamic_array(	char,	name,	IFNAMSIZ )
b811580d David Ahern     2015-11-19  35  		__array(		__u8,	gw,	16	 )
b811580d David Ahern     2015-11-19  36  	),
b811580d David Ahern     2015-11-19  37  
b811580d David Ahern     2015-11-19  38  	TP_fast_assign(
b811580d David Ahern     2015-11-19  39  		struct in6_addr *in6;
b811580d David Ahern     2015-11-19  40  
b65f164d Paolo Abeni     2017-10-19  41  		__entry->tb_id = table->tb6_id;
6d233bf4 David Ahern     2018-05-21  42  		__entry->err = ip6_rt_type_to_error(f6i->fib6_type);
b811580d David Ahern     2015-11-19  43  		__entry->oif = flp->flowi6_oif;
b811580d David Ahern     2015-11-19  44  		__entry->iif = flp->flowi6_iif;
69716a2b Daniel Borkmann 2016-03-18  45  		__entry->tos = ip6_tclass(flp->flowlabel);
b811580d David Ahern     2015-11-19  46  		__entry->scope = flp->flowi6_scope;
b811580d David Ahern     2015-11-19  47  		__entry->flags = flp->flowi6_flags;
b811580d David Ahern     2015-11-19  48  
b811580d David Ahern     2015-11-19  49  		in6 = (struct in6_addr *)__entry->src;
b811580d David Ahern     2015-11-19  50  		*in6 = flp->saddr;
b811580d David Ahern     2015-11-19  51  
b811580d David Ahern     2015-11-19  52  		in6 = (struct in6_addr *)__entry->dst;
b811580d David Ahern     2015-11-19  53  		*in6 = flp->daddr;
b811580d David Ahern     2015-11-19  54  
6d233bf4 David Ahern     2018-05-21  55  		__entry->proto = flp->flowi6_proto;
6d233bf4 David Ahern     2018-05-21  56  		if (__entry->proto == IPPROTO_TCP ||
6d233bf4 David Ahern     2018-05-21  57  		    __entry->proto == IPPROTO_UDP) {
6d233bf4 David Ahern     2018-05-21  58  			__entry->sport = ntohs(flp->fl6_sport);
6d233bf4 David Ahern     2018-05-21  59  			__entry->dport = ntohs(flp->fl6_dport);
6d233bf4 David Ahern     2018-05-21  60  		} else {
6d233bf4 David Ahern     2018-05-21  61  			__entry->sport = 0;
6d233bf4 David Ahern     2018-05-21  62  			__entry->dport = 0;
6d233bf4 David Ahern     2018-05-21  63  		}
6d233bf4 David Ahern     2018-05-21  64  
d4bea421 David Ahern     2018-05-09  65  		if (f6i->fib6_nh.nh_dev) {
d4bea421 David Ahern     2018-05-09  66  			__assign_str(name, f6i->fib6_nh.nh_dev);
b811580d David Ahern     2015-11-19  67  		} else {
6d233bf4 David Ahern     2018-05-21  68  			__assign_str(name, "-");
b811580d David Ahern     2015-11-19  69  		}
d4bea421 David Ahern     2018-05-09  70  		if (f6i == net->ipv6.fib6_null_entry) {
b811580d David Ahern     2015-11-19  71  			struct in6_addr in6_zero = {};
b811580d David Ahern     2015-11-19  72  
b811580d David Ahern     2015-11-19  73  			in6 = (struct in6_addr *)__entry->gw;
b811580d David Ahern     2015-11-19  74  			*in6 = in6_zero;
b811580d David Ahern     2015-11-19  75  
d4bea421 David Ahern     2018-05-09  76  		} else if (f6i) {
b811580d David Ahern     2015-11-19  77  			in6 = (struct in6_addr *)__entry->gw;
d4bea421 David Ahern     2018-05-09  78  			*in6 = f6i->fib6_nh.nh_gw;
b811580d David Ahern     2015-11-19  79  		}
b811580d David Ahern     2015-11-19  80  	),
b811580d David Ahern     2015-11-19  81  
6d233bf4 David Ahern     2018-05-21  82  	TP_printk("table %3u oif %d iif %d proto %u %pI6c/%u -> %pI6c/%u tos %d scope %d flags %x ==> dev %s gw %pI6c err %d",
6d233bf4 David Ahern     2018-05-21  83  		  __entry->tb_id, __entry->oif, __entry->iif, __entry->proto,
6d233bf4 David Ahern     2018-05-21  84  		  __entry->src, __entry->sport, __entry->dst, __entry->dport,
6d233bf4 David Ahern     2018-05-21  85  		  __entry->tos, __entry->scope, __entry->flags,
6d233bf4 David Ahern     2018-05-21  86  		  __get_str(name), __entry->gw, __entry->err)
b811580d David Ahern     2015-11-19  87  );
b811580d David Ahern     2015-11-19  88  

:::::: The code at line 13 was first introduced by commit
:::::: b811580d91e9c0945b0a923dcec3e10cce04ac30 net: IPv6 fib lookup tracepoint

:::::: TO: David Ahern <dsa@cumulusnetworks.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30959 bytes --]

^ permalink raw reply

* Re: [PATCH net] sctp: fix the issue that flags are ignored when using kernel_connect
From: Xin Long @ 2018-05-23  6:55 UTC (permalink / raw)
  To: David Miller
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Neil Horman,
	Michal Kubecek
In-Reply-To: <20180522.134014.1038689773683362410.davem@davemloft.net>

On Wed, May 23, 2018 at 1:40 AM, David Miller <davem@davemloft.net> wrote:
> From: Xin Long <lucien.xin@gmail.com>
> Date: Sun, 20 May 2018 16:39:10 +0800
>
>> Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags
>> param can't be passed into its proto .connect where this flags is really
>> needed.
>>
>> sctp works around it by getting flags from socket file in __sctp_connect.
>> It works for connecting from userspace, as inherently the user sock has
>> socket file and it passes f_flags as the flags param into the proto_ops
>> .connect.
>>
>> However, the sock created by sock_create_kern doesn't have a socket file,
>> and it passes the flags (like O_NONBLOCK) by using the flags param in
>> kernel_connect, which calls proto_ops .connect later.
>>
>> So to fix it, this patch defines a new proto_ops .connect for sctp,
>> sctp_inet_connect, which calls __sctp_connect() directly with this
>> flags param. After this, the sctp's proto .connect can be removed.
>>
>> Note that sctp_inet_connect doesn't need to do some checks that are not
>> needed for sctp, which makes thing better than with inet_dgram_connect.
>>
>> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>
> Applied, thank you.
>
> I don't see a Fixes: tag, please give me some guidance me wrt. -stable.
The problem is there since the beginning, I think there's no need
to go to -stable.

^ permalink raw reply

* Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr
From: Vlad Buslov @ 2018-05-23  6:57 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim,
	Jiri Pirko, LKML
In-Reply-To: <CAM_iQpVMFSxnN6dEi0U4=aw7s7TcJ88ShiNTDXcoRUVL82jTZw@mail.gmail.com>


On Wed 23 May 2018 at 01:10, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, May 21, 2018 at 1:03 PM, Vlad Buslov <vladbu@mellanox.com> wrote:
>> Initial net_device implementation used ingress_lock spinlock to synchronize
>> ingress path of device. This lock was used in both process and bh context.
>> In some code paths action map lock was obtained while holding ingress_lock.
>> Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
>> modified actions to always disable bh, while using action map lock, in
>> order to prevent deadlock on ingress_lock in softirq. This lock was removed
>> from net_device, so disabling bh, while accessing action map, is no longer
>> necessary.
>>
>> Replace all action idr spinlock usage with regular calls that do not
>> disable bh.
>
> While your patch is probably fine, the above justification seems not.

Sorry if I missed something. My justification is based on commit
description that added bh disable in subject code.

>
> In the past, tc actions could be released in BH context because tc
> filters use call_rcu(). However, I moved them to a workqueue recently.
> So before my change I don't think you can remove the BH protection,
> otherwise race with idr_remove()...

Found commit series that you described. Will modify commit message
accordingly.

Thanks,
Vlad

^ permalink raw reply

* [PATCH] netfilter: uapi: includes linux/types.h
From: YueHaibing @ 2018-05-23  7:03 UTC (permalink / raw)
  To: pablo, kadlec; +Cc: linux-kernel, netdev, coreteam, YueHaibing

gcc-7.3.0 report following warning:
./usr/include/linux/netfilter/nf_osf.h:27: found __[us]{8,16,32,64} type without #include <linux/types.h>

includes linux/types.h to fix it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 include/uapi/linux/netfilter/nf_osf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/netfilter/nf_osf.h b/include/uapi/linux/netfilter/nf_osf.h
index 45376ea..d1dbe00 100644
--- a/include/uapi/linux/netfilter/nf_osf.h
+++ b/include/uapi/linux/netfilter/nf_osf.h
@@ -1,6 +1,8 @@
 #ifndef _NF_OSF_H
 #define _NF_OSF_H
 
+#include <linux/types.h>
+
 #define MAXGENRELEN	32
 
 #define NF_OSF_GENRE	(1 << 0)
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Xin Long @ 2018-05-23  7:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: Michael Tuexen, Marcelo Ricardo Leitner, network dev, linux-sctp,
	davem
In-Reply-To: <20180522115151.GA28740@hmswarspite.think-freely.org>

On Tue, May 22, 2018 at 7:51 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Tue, May 22, 2018 at 03:07:57PM +0800, Xin Long wrote:
>> On Mon, May 21, 2018 at 9:48 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>> > On Mon, May 21, 2018 at 02:16:56PM +0200, Michael Tuexen wrote:
>> >> > On 21. May 2018, at 13:39, Neil Horman <nhorman@tuxdriver.com> wrote:
>> >> >
>> >> > On Sun, May 20, 2018 at 10:54:04PM -0300, Marcelo Ricardo Leitner wrote:
>> >> >> On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
>> >> >>> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
>> >> >>>> This feature is actually already supported by sk->sk_reuse which can be
>> >> >>>> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
>> >> >>>> section 8.1.27, like:
>> >> >>>>
>> >> >>>>  - This option only supports one-to-one style SCTP sockets
>> >> >>>>  - This socket option must not be used after calling bind()
>> >> >>>>    or sctp_bindx().
>> >> >>>>
>> >> >>>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
>> >> >>>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
>> >> >>>> work in linux.
>> >> >>>>
>> >> >>>> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
>> >> >>>> just with some extra setup limitations that are neeeded when it is being
>> >> >>>> enabled.
>> >> >>>>
>> >> >>>> "It should be noted that the behavior of the socket-level socket option
>> >> >>>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
>> >> >>>> leaves SO_REUSEADDR as is for the compatibility.
>> >> >>>>
>> >> >>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> >> >>>> ---
>> >> >>>> include/uapi/linux/sctp.h |  1 +
>> >> >>>> net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
>> >> >>>> 2 files changed, 49 insertions(+)
>> >> >>>>
>> >> >>> A few things:
>> >> >>>
>> >> >>> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
>> >> >>> socket option.  I understand that this is an implementation of the option in the
>> >> >>> RFC, but its definately a duplication of a feature, which makes several things
>> >> >>> really messy.
>> >> >>>
>> >> >>> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
>> >> >>> Chief among them is the behavioral interference between this patch and the
>> >> >>> SO_REUSEADDR socket level option, that also sets this feature.  If you set
>> >> >>> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
>> >> >>> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
>> >> >>> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
>> >> >>> reuse for the socket.  We can't do that.
>> >> >>
>> >> >> Given your comments, going a bit further here, one other big
>> >> >> implication is that a port would never be able to be considered to
>> >> >> fully meet SCTP standards regarding reuse because a rogue application
>> >> >> may always abuse of the socket level opt to gain access to the port.
>> >> >>
>> >> >> IOW, the patch allows the application to use such restrictions against
>> >> >> itself and nothing else, which undermines the patch idea.
>> >> >>
>> >> > Agreed.
>> >> >
>> >> >> I lack the knowledge on why the SCTP option was proposed in the RFC. I
>> >> >> guess they had a good reason to add the restriction on 1:1/1:m style.
>> >> >> Does the usage of the current imply in any risk to SCTP sockets? If
>> >> >> yes, that would give some grounds for going forward with the SCTP
>> >> >> option.
>> >> >>
>> >> > I'm also not privy to why the sctp option was proposed, though I expect that the
>> >> > lack of standardization of SO_REUSEPORT probably had something to do with it.
>> >> > As for the reasoning behind restriction to only 1:1 sockets, if I had to guess,
>> >> > I would say it likely because it creates ordering difficulty at the application
>> >> > level.
>> >> >
>> >> > CC-ing Michael Tuxen, who I believe had some input on this RFC.  Hopefully he
>> >> > can shed some light on this.
>> >> Dear all,
>> >>
>> >> the reason this was added is to have a specified way to allow a system to
>> >> behave like a client and server making use of the INIT collision.
>> >>
>> >> For 1-to-many style sockets you can do this by creating a socket, binding it,
>> >> calling listen on it and trying to connect to the peer.
>> >>
>> >> For 1-to-1 style sockets you need two sockets for it. One listener and one
>> >> you use to connect (and close it in case of failure, open a new one...).
>> >>
>> >> It was not clear if one can achieve this with SO_REUSEPORT and/or SO_REUSEADDR
>> >> on all platforms. We left that unspecified.
>> >>
>> >> I hope this makes the intention clearer.
>> >>
>> > I think it makes the intention clearer yes, but it unfortunately does nothing in
>> > my mind to clarify how the implementation should best handle the potential
>> > overlap in functionality.  What I see here is that we have two functional paths
>> > (the SO_REUSEPORT path and the SCTP_PORT_REUSE path), which may or may not
>> > (depending on the OS implementation achieve the same functional goal (allowing
>> > multiple sockets to share a port while allowing one socket to listen and the
>> > other connect to a remote peer).  If both implementations do the same thing on a
>> > given platform, we can either just alias one to another and be done, but if they
>> > don't then we either have to implement both paths, and ensure that the
>> > SO_REUSEPORT path is a no-op/error return for SCTP sockets, or that each path
>> > implements a distinct feature set that is cleaarly documented.
>> >
>> > That said, I think we may be in luck.  Looking at the connect and listen paths,
>> > it appears to me that:
>> >
>> > 1) Sockets ignore SO_REUSEPORT in the connect and listen paths (save for any
>> > autobinding) so it would appear that the intent of the SCTP rfc can be honored
>> > via SO_REUSEPORT on linux.
>> >
>> > 2) SO_REUSEPORT prevents changing state after a bind has occured, so we can honr
>> > that part of the SCTP RFC.
>> >
>> > The only missing part is the restriction that SCTP_REUSE_PORT has which is
>> > unaccounted for is that 1:M sockets aren't allowed to enable port reuse.
>> > However, I think the implication from Michaels description above is that port
>> > reuse on a 1:M socket is implicit because a single socket can connect and listen
>> > in that use case, rather than there being a danger to doing so.
>> >
>> > As such, I would propose that we implement this socket option by simply setting
>> > the sk->sk_reuseport field in the sock structure, and document the fact that
>> > linux does not restrict port reuse from 1:M sockets.
>> Note that, sk->sk_reuseport is not affecting linux SCTP socket at all now.
>> linux SCTP socket doesn't really have SO_REUSEADDR (sk->sk_reuse)
>> support, but use sk->sk_reuse as REUSE_PORT, (yes, it is confusing).
>> Pls refer to sctp_get_port_local().
>>
> No, its not used now, but if you do use it to do something specific to SCTP (via
> the SCTP_REUSE_PORT socket option), you risk aliasing SO_REUSEPORT behavior to
> it, and if it doesn't match what the RFC behavior mandates, thats a problem.
>
>> So I'm not sure using sk->sk_reuseport here means we will drop sk->sk_reuse
>> use in linux SCTP but use sk->sk_reuseport instead, or we will think that socket
>> enables 'port reuse' when either of them is set.
>>
> I don't think we would drop the behavior of sk_reuse here, why would we?  As far
> as I can see, the behavior of SO_REUSEADDR (not SO_REUSEPORT), isn't in
> question, is it?
>
>> Note some users may be already using SO_REUSEADDR to enable the 'port
>> reuse' in linux sctp socket. If we're changing to sk->sk_reuseport, we may face
>> a compatibility problem.
>>
> I don't see how the behavior of SO_REUSEADDR is in question here.  All I'm
> suggesting is that you simplify this patch so that the SCTP_REUSE_PORT socket
> option set sk_reuseport, as that option to my eyes conforms to the sctp rfc
> requirements.  Or am I' missing something?
No, I am :)
sk_reuseport seems more complicated than sk_reuse. I kind of mixed them.
I need to check more beofore continuing. Thanks.

>
> Neil
>
>>
>> >
>> > Thoughts?
>> > Neil
>> >
>>

^ permalink raw reply

* Re: [RFC V4 PATCH 7/8] vhost: packed ring support
From: Wei Xu @ 2018-05-23  7:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann,
	tiwei.bie
In-Reply-To: <e12d4055-6ae6-4d00-ae8b-1acd88633f48@redhat.com>

On Wed, May 23, 2018 at 09:39:28AM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月23日 00:54, Wei Xu wrote:
> >On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:
> >>Signed-off-by: Jason Wang <jasowang@redhat.com>
> >>---
> >>  drivers/vhost/net.c   |   3 +-
> >>  drivers/vhost/vhost.c | 539 ++++++++++++++++++++++++++++++++++++++++++++++----
> >>  drivers/vhost/vhost.h |   8 +-
> >>  3 files changed, 513 insertions(+), 37 deletions(-)
> >>
> >>diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >>index 8304c30..f2a0f5b 100644
> >>--- a/drivers/vhost/vhost.c
> >>+++ b/drivers/vhost/vhost.c
> >>@@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >>  			break;
> >>  		}
> >>  		vq->last_avail_idx = s.num;
> >>+		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> >>+			vq->avail_wrap_counter = s.num >> 31;
> >>  		/* Forget the cached index value. */
> >>  		vq->avail_idx = vq->last_avail_idx;
> >>  		break;
> >>@@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >>  		s.num = vq->last_avail_idx;
> >>  		if (copy_to_user(argp, &s, sizeof s))
> >>  			r = -EFAULT;
> >>+		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> >>+			s.num |= vq->avail_wrap_counter << 31;
> >>  		break;
> >>  	case VHOST_SET_VRING_ADDR:
> >>  		if (copy_from_user(&a, argp, sizeof a)) {
> >'last_used_idx' also needs to be saved/restored here.
> >
> >I have figured out the root cause of broken device after reloading
> >'virtio-net' module, all indices have been reset for a reloading but
> >'last_used_idx' is not properly reset in this case. This confuses
> >handle_rx()/tx().
> >
> >Wei
> >
> 
> Good catch, so we probably need a new ioctl to sync between qemu and vhost.
> 
> Something like VHOST_SET/GET_USED_BASE.

Sure, or can we expand 'vhost_vring_state' to keep them done in a bunch?

> 
> Thanks
> 

^ permalink raw reply

* Re: [PATCH][V2] net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
From: Tariq Toukan @ 2018-05-23  7:26 UTC (permalink / raw)
  To: Colin King, Tariq Toukan, David S . Miller, netdev, linux-rdma
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180522154251.16789-1-colin.king@canonical.com>



On 22/05/2018 6:42 PM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in mlx4_dbg debug message and also
> change the phrasing of the message so that is is more readable
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> 
> ---
> V2: rephrase message, as helpfully suggested by Tariq Toukan
> ---
>   drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c
> index 2edcce98ab2d..65482f004e50 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/intf.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
> @@ -172,7 +172,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
>   		list_add_tail(&dev_ctx->list, &priv->ctx_list);
>   		spin_unlock_irqrestore(&priv->ctx_lock, flags);
>   
> -		mlx4_dbg(dev, "Inrerface for protocol %d restarted with when bonded mode is %s\n",
> +		mlx4_dbg(dev, "Interface for protocol %d restarted with bonded mode %s\n",
>   			 dev_ctx->intf->protocol, enable ?
>   			 "enabled" : "disabled");
>   	}
> 

Thanks Colin.

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v1] netfilter: provide input interface for route lookup for rpfilter
From: Pablo Neira Ayuso @ 2018-05-23  7:26 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, netdev
In-Reply-To: <20180520110338.9376-1-vincent@bernat.im>

On Sun, May 20, 2018 at 01:03:38PM +0200, Vincent Bernat wrote:
> In commit 47b7e7f82802, this bit was removed at the same time the
> RT6_LOOKUP_F_IFACE flag was removed. However, it is needed when
> link-local addresses are used, which is a very common case: when
> packets are routed, neighbor solicitations are done using link-local
> addresses. For example, the following neighbor solicitation is not
> matched by "-m rpfilter":
> 
>     IP6 fe80::5254:33ff:fe00:1 > ff02::1:ff00:3: ICMP6, neighbor
>     solicitation, who has 2001:db8::5254:33ff:fe00:3, length 32
> 
> Commit 47b7e7f82802 doesn't quite explain why we shouldn't use
> RT6_LOOKUP_F_IFACE in the rpfilter case. I suppose the interface check
> later in the function would make it redundant. However, the remaining
> of the routing code is using RT6_LOOKUP_F_IFACE when there is no
> source address (which matches rpfilter's case with a non-unicast
> destination, like with neighbor solicitation).

Applied, thanks Vincent.

^ permalink raw reply

* Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr
From: Jiri Pirko @ 2018-05-23  7:32 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: davem, netdev, jhs, xiyou.wangcong, linux-kernel
In-Reply-To: <1526932984-11544-1-git-send-email-vladbu@mellanox.com>

Mon, May 21, 2018 at 10:03:04PM CEST, vladbu@mellanox.com wrote:
>Initial net_device implementation used ingress_lock spinlock to synchronize
>ingress path of device. This lock was used in both process and bh context.
>In some code paths action map lock was obtained while holding ingress_lock.
>Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
>modified actions to always disable bh, while using action map lock, in
>order to prevent deadlock on ingress_lock in softirq. This lock was removed
>from net_device, so disabling bh, while accessing action map, is no longer
>necessary.
>
>Replace all action idr spinlock usage with regular calls that do not
>disable bh.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Please add my tag to v3, with the description changes requested by Cong.
Acked-by: Jiri Pirko <jiri@mellanox.com>

Thanks!

^ permalink raw reply

* [PATCH net] net/mlx4: Fix irq-unsafe spinlock usage
From: Tariq Toukan @ 2018-05-23  7:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Jack Morgenstein, Tariq Toukan

From: Jack Morgenstein <jackm@dev.mellanox.co.il>

spin_lock/unlock was used instead of spin_un/lock_irq
in a procedure used in process space, on a spinlock
which can be grabbed in an interrupt.

This caused the stack trace below to be displayed (on kernel
4.17.0-rc1 compiled with Lock Debugging enabled):

[  154.661474] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  154.668909] 4.17.0-rc1-rdma_rc_mlx+ #3 Tainted: G          I
[  154.675856] -----------------------------------------------------
[  154.682706] modprobe/10159 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  154.690254] 00000000f3b0e495 (&(&qp_table->lock)->rlock){+.+.}, at: mlx4_qp_remove+0x20/0x50 [mlx4_core]
[  154.700927]
and this task is already holding:
[  154.707461] 0000000094373b5d (&(&cq->lock)->rlock/1){....}, at: destroy_qp_common+0x111/0x560 [mlx4_ib]
[  154.718028] which would create a new lock dependency:
[  154.723705]  (&(&cq->lock)->rlock/1){....} -> (&(&qp_table->lock)->rlock){+.+.}
[  154.731922]
but this new dependency connects a SOFTIRQ-irq-safe lock:
[  154.740798]  (&(&cq->lock)->rlock){..-.}
[  154.740800]
... which became SOFTIRQ-irq-safe at:
[  154.752163]   _raw_spin_lock_irqsave+0x3e/0x50
[  154.757163]   mlx4_ib_poll_cq+0x36/0x900 [mlx4_ib]
[  154.762554]   ipoib_tx_poll+0x4a/0xf0 [ib_ipoib]
...
to a SOFTIRQ-irq-unsafe lock:
[  154.815603]  (&(&qp_table->lock)->rlock){+.+.}
[  154.815604]
... which became SOFTIRQ-irq-unsafe at:
[  154.827718] ...
[  154.827720]   _raw_spin_lock+0x35/0x50
[  154.833912]   mlx4_qp_lookup+0x1e/0x50 [mlx4_core]
[  154.839302]   mlx4_flow_attach+0x3f/0x3d0 [mlx4_core]

Since mlx4_qp_lookup() is called only in process space, we can
simply replace the spin_un/lock calls with spin_un/lock_irq calls.

Fixes: 6dc06c08bef1 ("net/mlx4: Fix the check in attaching steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---

Hi Dave, please queue for -stable >= 4.12.

---
 drivers/net/ethernet/mellanox/mlx4/qp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 3aaf4bad6c5a..427e7a31862c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -393,11 +393,11 @@ struct mlx4_qp *mlx4_qp_lookup(struct mlx4_dev *dev, u32 qpn)
 	struct mlx4_qp_table *qp_table = &mlx4_priv(dev)->qp_table;
 	struct mlx4_qp *qp;
 
-	spin_lock(&qp_table->lock);
+	spin_lock_irq(&qp_table->lock);
 
 	qp = __mlx4_qp_lookup(dev, qpn);
 
-	spin_unlock(&qp_table->lock);
+	spin_unlock_irq(&qp_table->lock);
 	return qp;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] netfilter: uapi: includes linux/types.h
From: Pablo Neira Ayuso @ 2018-05-23  7:49 UTC (permalink / raw)
  To: YueHaibing; +Cc: kadlec, linux-kernel, netdev, coreteam
In-Reply-To: <20180523070326.15968-1-yuehaibing@huawei.com>

On Wed, May 23, 2018 at 03:03:26PM +0800, YueHaibing wrote:
> gcc-7.3.0 report following warning:
> ./usr/include/linux/netfilter/nf_osf.h:27: found __[us]{8,16,32,64} type without #include <linux/types.h>
> 
> includes linux/types.h to fix it.

Thanks.

There's already a fix for this in the nf-next queue.

commit 01cd267bff52619a53fa05c930ea5ed53493d21a
Author: Florian Westphal <fw@strlen.de>
Date:   Tue May 8 10:05:38 2018 +0200

    netfilter: fix fallout from xt/nf osf separation

^ permalink raw reply

* [PATCH net-next v3] net: sched: don't disable bh when accessing action idr
From: Vlad Buslov @ 2018-05-23  8:52 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jhs, xiyou.wangcong, davem, linux-kernel, Vlad Buslov
In-Reply-To: <20180523073239.GC3155@nanopsycho>

Initial net_device implementation used ingress_lock spinlock to synchronize
ingress path of device. This lock was used in both process and bh context.
In some code paths action map lock was obtained while holding ingress_lock.
Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
modified actions to always disable bh, while using action map lock, in
order to prevent deadlock on ingress_lock in softirq. This lock was removed
in commit 555353cfa1ae ("netdev: The ingress_lock member is no longer
needed.").

Another reason to disable bh was filters delete code, that released actions
in rcu callback. This code was changed to release actions from workqueue
context in patch set "net_sched: close the race between call_rcu() and
cleanup_net()".

With these changes it is no longer necessary to continue disable bh while
accessing action map.

Replace all action idr spinlock usage with regular calls that do not
disable bh.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
Changes from V2 to V3:
 - Expanded commit message.

Changes from V1 to V2:
 - Expanded commit message.

 net/sched/act_api.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 72251241665a..3f4cf930f809 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -77,9 +77,9 @@ static void free_tcf(struct tc_action *p)
 
 static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p)
 {
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	idr_remove(&idrinfo->action_idr, p->tcfa_index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	gen_kill_estimator(&p->tcfa_rate_est);
 	free_tcf(p);
 }
@@ -156,7 +156,7 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	struct tc_action *p;
 	unsigned long id = 1;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 
 	s_i = cb->args[0];
 
@@ -191,7 +191,7 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	if (index >= 0)
 		cb->args[0] = index + 1;
 
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	if (n_i) {
 		if (act_flags & TCA_FLAG_LARGE_DUMP_ON)
 			cb->args[1] = n_i;
@@ -261,9 +261,9 @@ static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
 {
 	struct tc_action *p = NULL;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	p = idr_find(&idrinfo->action_idr, index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 
 	return p;
 }
@@ -323,7 +323,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	}
 	spin_lock_init(&p->tcfa_lock);
 	idr_preload(GFP_KERNEL);
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	/* user doesn't specify an index */
 	if (!index) {
 		index = 1;
@@ -331,7 +331,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	} else {
 		err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC);
 	}
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 	idr_preload_end();
 	if (err)
 		goto err3;
@@ -369,9 +369,9 @@ void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a)
 {
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
-	spin_lock_bh(&idrinfo->lock);
+	spin_lock(&idrinfo->lock);
 	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
-	spin_unlock_bh(&idrinfo->lock);
+	spin_unlock(&idrinfo->lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
 
-- 
2.7.5

^ permalink raw reply related

* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Daniel Borkmann @ 2018-05-23  8:57 UTC (permalink / raw)
  To: Y Song, Alexei Starovoitov
  Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups,
	Alban Crequy, tj
In-Reply-To: <CAH3MdRVwmKd84ePvNX+NuAj3TfA_28BObEmzBqGXv=P5_A=8fQ@mail.gmail.com>

On 05/23/2018 06:31 AM, Y Song wrote:
> On Tue, May 22, 2018 at 8:35 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, May 22, 2018 at 08:33:24PM -0700, Y Song wrote:
>>> +       struct cgroup *cgrp = task_dfl_cgroup(current);
>>> +       if (!cgrp)
>>> +               return -EINVAL;
>>
>> why this check is needed?
> 
> No reason :-) Originally I am concerned whether it is possible cgrp
> could be NULL.
> By looking at the code, it SEEMS to me that it could not be NULL, but I am not
> 100% sure (as I am not a cgroup expert). Since you are asking,
> probably means it cannot be NULL, so will remove it in formal upstream patch.

Aside from this the cgrp->kn->id.id is also u64, so overlapping this with
error codes we'll get into a similar issue as with bpf_perf_event_read().

^ permalink raw reply

* Re: [RFC V4 PATCH 7/8] vhost: packed ring support
From: Jason Wang @ 2018-05-23  8:57 UTC (permalink / raw)
  To: Wei Xu; +Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann,
	tiwei.bie
In-Reply-To: <20180523071727.GA13373@wei-ubt>



On 2018年05月23日 15:17, Wei Xu wrote:
> On Wed, May 23, 2018 at 09:39:28AM +0800, Jason Wang wrote:
>>
>> On 2018年05月23日 00:54, Wei Xu wrote:
>>> On Wed, May 16, 2018 at 08:32:20PM +0800, Jason Wang wrote:
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>>   drivers/vhost/net.c   |   3 +-
>>>>   drivers/vhost/vhost.c | 539 ++++++++++++++++++++++++++++++++++++++++++++++----
>>>>   drivers/vhost/vhost.h |   8 +-
>>>>   3 files changed, 513 insertions(+), 37 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index 8304c30..f2a0f5b 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -1358,6 +1382,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>   			break;
>>>>   		}
>>>>   		vq->last_avail_idx = s.num;
>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>> +			vq->avail_wrap_counter = s.num >> 31;
>>>>   		/* Forget the cached index value. */
>>>>   		vq->avail_idx = vq->last_avail_idx;
>>>>   		break;
>>>> @@ -1366,6 +1392,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>   		s.num = vq->last_avail_idx;
>>>>   		if (copy_to_user(argp, &s, sizeof s))
>>>>   			r = -EFAULT;
>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>> +			s.num |= vq->avail_wrap_counter << 31;
>>>>   		break;
>>>>   	case VHOST_SET_VRING_ADDR:
>>>>   		if (copy_from_user(&a, argp, sizeof a)) {
>>> 'last_used_idx' also needs to be saved/restored here.
>>>
>>> I have figured out the root cause of broken device after reloading
>>> 'virtio-net' module, all indices have been reset for a reloading but
>>> 'last_used_idx' is not properly reset in this case. This confuses
>>> handle_rx()/tx().
>>>
>>> Wei
>>>
>> Good catch, so we probably need a new ioctl to sync between qemu and vhost.
>>
>> Something like VHOST_SET/GET_USED_BASE.
> Sure, or can we expand 'vhost_vring_state' to keep them done in a bunch?

It's port of uapi, so we can't.

Thanks

>
>> Thanks
>>

^ permalink raw reply

* Re: [PATCH bpf-next v3 10/10] tools: bpftool: add delimiters to multi-function JITed dumps
From: Daniel Borkmann @ 2018-05-23  9:08 UTC (permalink / raw)
  To: Jakub Kicinski, Sandipan Das
  Cc: ast, netdev, linuxppc-dev, mpe, naveen.n.rao, Quentin Monnet
In-Reply-To: <20180522125544.541c68c8@cakuba>

On 05/22/2018 09:55 PM, Jakub Kicinski wrote:
> On Tue, 22 May 2018 22:46:13 +0530, Sandipan Das wrote:
>> +		if (info.nr_jited_func_lens && info.jited_func_lens) {
>> +			struct kernel_sym *sym = NULL;
>> +			unsigned char *img = buf;
>> +			__u64 *ksyms = NULL;
>> +			__u32 *lens;
>> +			__u32 i;
>> +
>> +			if (info.nr_jited_ksyms) {
>> +				kernel_syms_load(&dd);
>> +				ksyms = (__u64 *) info.jited_ksyms;
>> +			}
>> +
>> +			lens = (__u32 *) info.jited_func_lens;
>> +			for (i = 0; i < info.nr_jited_func_lens; i++) {
>> +				if (ksyms) {
>> +					sym = kernel_syms_search(&dd, ksyms[i]);
>> +					if (sym)
>> +						printf("%s:\n", sym->name);
>> +					else
>> +						printf("%016llx:\n", ksyms[i]);
>> +				}
>> +
>> +				disasm_print_insn(img, lens[i], opcodes, name);
>> +				img += lens[i];
>> +				printf("\n");
>> +			}
>> +		} else {
> 
> The output doesn't seem to be JSON-compatible :(  We try to make sure
> all bpftool command can produce valid JSON when run with -j (or -p)
> switch.
> 
> Would it be possible to make each function a separate JSON object with
> "name" and "insn" array?  Would that work?

Sandipan, could you take a look at this? Given there's json output today we
should definitely try not to break it; presumably this would be one final
respin of your series with this fixed.

Thanks,
Daniel

^ permalink raw reply

* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jakub Kicinski @ 2018-05-23  9:23 UTC (permalink / raw)
  To: Huy Nguyen
  Cc: Saeed Mahameed, David S. Miller, netdev, Jiri Pirko, Or Gerlitz,
	Parav Pandit, Ido Schimmel
In-Reply-To: <1576e986-6e04-63f3-3569-a105493929a6@mellanox.com>

On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >>>> From: Huy Nguyen <huyn@mellanox.com>
> >>>>
> >>>> In this patch, we add dcbnl buffer attribute to allow user
> >>>> change the NIC's buffer configuration such as priority
> >>>> to buffer mapping and buffer size of individual buffer.
> >>>>
> >>>> This attribute combined with pfc attribute allows advance user to
> >>>> fine tune the qos setting for specific priority queue. For example,
> >>>> user can give dedicated buffer for one or more prirorities or user
> >>>> can give large buffer to certain priorities.
> >>>>
> >>>> We present an use case scenario where dcbnl buffer attribute configured
> >>>> by advance user helps reduce the latency of messages of different sizes.
> >>>>
> >>>> Scenarios description:
> >>>> On ConnectX-5, we run latency sensitive traffic with
> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>>> and large message sizes to their own pfc enables priorities as follow.
> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>>>     Priorities 5 & 6 (512KB and 1MB)
> >>>>
> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>>> lossless fixed buffer size of 50% of total available buffer space. The
> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>>> to lossless  buffer mappings are set as follow.
> >>>>     Priorities 1 & 2 on lossless buffer #1
> >>>>     Priorities 3 & 4 on lossless buffer #2
> >>>>     Priorities 5 & 6 on lossless buffer #3
> >>>>
> >>>> We observe improvements in latency for small and medium message sizes
> >>>> as follows. Please note that the large message sizes bandwidth performance is
> >>>> reduced but the total bandwidth remains the same.
> >>>>     256B message size (42 % latency reduction)
> >>>>     4K message size (21% latency reduction)
> >>>>     64K message size (16% latency reduction)
> >>>>
> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> >>> On a cursory look this bares a lot of resemblance to devlink shared
> >>> buffer configuration ABI.  Did you look into using that?
> >>>
> >>> Just to be clear devlink shared buffer ABIs don't require representors
> >>> and "switchdev mode".
> >>> .  
> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
> >> ABI cannot be used:
> >> 1. The devlink shared buffer ABI is written based on the switch cli
> >> which you can find out more
> >> from this link https://community.mellanox.com/docs/DOC-2558.  
> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
> > advanced schemes (present in Spectrum).  The simpler/basic static
> > threshold configurations is exactly what you are doing here, AFAIU.  
> [HQN] Devlink API is tailored specifically for switch.

I hope that is not true, since we (Netronome) are trying to use it for
NIC configuration, too.  We should generalize the API if need be.

> We don't configure threshold configuration explicitly. It is done via
> PFC. Once PFC is enabled on priority, threshold is setup based on our
> proprietary formula that were tested rigorously for performance.

Are you referring to XOFF/XON thresholds?  I don't think the "threshold
type" in devlink API implies we are setting XON/XOFF thresholds
directly :S  If PFC is enabled we may be setting them indirectly,
obviously.

My understanding is that for static threshold type the size parameter
specifies the max amount of memory given pool can consume.

> >> 2. The dcbnl interfaces have been used for QoS settings.  
> > QoS settings != shared buffer configuration.  
> [HQN] I think we have different definition about "shared buffer".
> Please refer to this below switch cli link.
> It explained in detail what is the "shared buffer" in switch means.
> Our NIC does not have "shared buffer" supported.
> https://community.mellanox.com/docs/DOC-2591

Yes, we must have a different definitions of "shared buffer" :)  That
link, however, didn't clarify much for me...  In mlx5 you seem to have a
buffer which is shared between priorities, even if it's not what would
be referred to as shared buffer in switch context.

> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
> > Some customers use DCB, a lot (most?) of them don't.  I don't think
> > the "this is a logical extension of a commonly used API" really
> > stands here.  
> [HQN] DCBNL are being actively used. The whole point of this patch
> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
> configuration.
>
> Ambitious future is to have the switch configure the NIC's buffer
> size and buffer mapping
> via TLV packet and this DCBNL interface. But we won't go too far here.

I think I can understand the motivation, and I think it's a nice thing
to expose!  The only questions are: does it really belong to DCBNL and
can existing API be used?
 
From patch description it seems like your default setup is shared
buffer split 50% (lossy)/50% (all prios) and the example you give
changes that to 25% (lossy)/25%x3 prio groups.

With existing devlink API could this be modelled by three ingress pools
with 2 TCs bound each?

> >> The buffer configuration are not tied to port like switch.  
> > It's tied to a port and TCs, you just have one port but still have 8
> > TCs exactly like a switch...  
> [HQN] No. Our buffer ties to priority not to TCs.

Right, that is a valid point.  Although TCs can be mapped to
priorities.  Some switches may tie buffers to priorities, too.  So
perhaps it's worth extending devlink?

> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
> > IDK how talking about alpha is relevant, it's just one threshold
> > type the API supports.  As far as shared buffer and threshold I
> > don't know if these are switch terms (or how "switch" differs from
> > "NIC" at that level) - I personally find carving shared buffer into
> > pools very intuitive.  
> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
> characteristics with the switch's buffer settings. 

Yes, and if it's not a perfect match we can extend it.

> But this DCB buffer setting is to improve the performance and work
> together with the PFC setting. We would like to keep all the qos
> setting under DCB Netlink as they are designed to be this way.

DCBNL seems to carry standard-based information, which this is not.
mlxsw supports DCBNL, will it also support this buffer configuration
mechanism?

> > Could you give examples of commands/configs one can use with your
> > new ABI?  
> [HQN] The plan is to add the support in lldptool once the kernel code
> is accepted. To test the kernel code,
> I am using small python scripts that works on top of the netlink
> library. It will be like this format which is similar to other
> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
> for buffer 0,1,2,3,4,5,6,7 respectively
> >    How does one query the total size of the buffer to be carved?  
> [HQN] This is not necessary. If the total size is too big, error will
> be return via DCB netlink interface.

Right, I'm not saying it's a bug :)  It's just nice when user can be
told the total size without having to probe for it :)

^ permalink raw reply

* Re: [bpf-next V4 PATCH 0/8] xdp: introduce bulking for ndo_xdp_xmit API
From: Daniel Borkmann @ 2018-05-23  9:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: Christoph Hellwig, BjörnTöpel, Magnus Karlsson,
	makita.toshiaki
In-Reply-To: <152665044141.21055.1276346542020340263.stgit@firesoul>

On 05/18/2018 03:34 PM, Jesper Dangaard Brouer wrote:
> This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.
> 
> In this V4 patchset, I've split-out the patches from 4 to 8 patches.
> I cannot split the driver changes from the NDO change, but I've tried
> to isolated the NDO change together with the driver change as much as
> possible.
> 
> When kernel is compiled with CONFIG_RETPOLINE, every indirect function
> pointer (branch) call hurts performance. For XDP this have a huge
> negative performance impact.
> 
> This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
> also prepares for further optimizations.  The DMA APIs use of indirect
> function pointer calls is the primary source the regression.  It is
> left for a followup patchset, to use bulking calls towards the DMA API
> (via the scatter-gatter calls).
> 
> The other advantage of this API change is that drivers can easier
> amortize the cost of any sync/locking scheme, over the bulk of
> packets.  The assumption of the current API is that the driver
> implemementing the NDO will also allocate a dedicated XDP TX queue for
> every CPU in the system.  Which is not always possible or practical to
> configure. E.g. ixgbe cannot load an XDP program on a machine with
> more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
> is hard to configure as it requires manually increasing the
> queues. E.g. tun driver chooses to use a per XDP frame producer lock
> modulo smp_processor_id over avail queues.
> 
> I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of
> this patchset.  This will be a followup patchset, once we know if this
> will be needed (e.g. for non-map xdp_redirect flush-flag, and if
> AF_XDP chooses to use ndo_xdp_xmit for TX).
> 
> ---
> 
> Jesper Dangaard Brouer (8):
>       bpf: devmap introduce dev_map_enqueue
>       bpf: devmap prepare xdp frames for bulking
>       xdp: add tracepoint for devmap like cpumap have
>       samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit
>       xdp: introduce xdp_return_frame_rx_napi
>       xdp: change ndo_xdp_xmit API to support bulking
>       xdp/trace: extend tracepoint in devmap with an err
>       samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmit

Series applied to bpf-next, thanks Jesper. (Some minor comments in the patches.)

^ permalink raw reply

* Re: [PATCH V3 8/8] dt-bindings: stm32: add compatible for syscon
From: Christophe ROULLIER @ 2018-05-23  9:32 UTC (permalink / raw)
  To: Rob Herring
  Cc: mark.rutland@arm.com, mcoquelin.stm32@gmail.com, Alexandre TORGUE,
	Peppe CAVALLARO, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org,
	andrew@lunn.ch
In-Reply-To: <20180522172238.GA26145@rob-hp-laptop>

On 05/22/2018 07:22 PM, Rob Herring wrote:
> On Mon, May 21, 2018 at 10:07:26AM +0200, Christophe Roullier wrote:
>> This patch describes syscon DT bindings.
>>
>> Signed-off-by: Christophe Roullier <christophe.roullier@st.com>
>> ---
>>   Documentation/devicetree/bindings/arm/stm32.txt | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/arm/stm32.txt b/Documentation/devicetree/bindings/arm/stm32.txt
>> index 6808ed9..e46ebad 100644
>> --- a/Documentation/devicetree/bindings/arm/stm32.txt
>> +++ b/Documentation/devicetree/bindings/arm/stm32.txt
>> @@ -8,3 +8,8 @@ using one of the following compatible strings:
>>     st,stm32f746
>>     st,stm32h743
>>     st,stm32mp157
>> +
>> +Required nodes:
>> +- syscon: the soc bus node must have a system controller node pointing to the
>> +  global control registers, with the compatible string
>> +  "st,stm32mp157-syscfg", "syscon";
> 
> Please don't mix soc/board bindings with other nodes. So perhaps
> stm32-syscon.txt.
> 
> Rob
> 

Hi Rob,

Is it ok for you with this tree file:

Documentation/devicetree/bindings/arm/stm32/stm32.txt
Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
	With stm32-syscon.txt:

---------------------------------------------------
STMicroelectronics STM32 Platforms System Controller

Properties:
  - compatible : should contain two values. First value must be :
		- " st,stm32mp157-syscfg " - for stm32mp157 based SoCs,
		second value must be always "syscon".
  - reg : offset and length of the register set.

Example:
	syscfg: system-config@50020000 {
		compatible = "st,stm32mp157-syscfg", "syscon";
		reg = <0x50020000 0x400>;
	};
---------------------------------------------------

Do we need to update also all MCU family (stm32f4, stm32h7, stm32f7) 
property to be coherent ?

Thanks for your feedback.

Christophe.


^ permalink raw reply

* Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute
From: Jiri Pirko @ 2018-05-23  9:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Huy Nguyen, Saeed Mahameed, David S. Miller, netdev, Or Gerlitz,
	Parav Pandit, Ido Schimmel
In-Reply-To: <20180523022314.783e47fa@cakuba>

Wed, May 23, 2018 at 11:23:14AM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
>> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
>> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
>> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
>> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
>> >>>> From: Huy Nguyen <huyn@mellanox.com>
>> >>>>
>> >>>> In this patch, we add dcbnl buffer attribute to allow user
>> >>>> change the NIC's buffer configuration such as priority
>> >>>> to buffer mapping and buffer size of individual buffer.
>> >>>>
>> >>>> This attribute combined with pfc attribute allows advance user to
>> >>>> fine tune the qos setting for specific priority queue. For example,
>> >>>> user can give dedicated buffer for one or more prirorities or user
>> >>>> can give large buffer to certain priorities.
>> >>>>
>> >>>> We present an use case scenario where dcbnl buffer attribute configured
>> >>>> by advance user helps reduce the latency of messages of different sizes.
>> >>>>
>> >>>> Scenarios description:
>> >>>> On ConnectX-5, we run latency sensitive traffic with
>> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> >>>> and large message sizes to their own pfc enables priorities as follow.
>> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
>> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>> >>>>     Priorities 5 & 6 (512KB and 1MB)
>> >>>>
>> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> >>>> lossless fixed buffer size of 50% of total available buffer space. The
>> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
>> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> >>>> to lossless  buffer mappings are set as follow.
>> >>>>     Priorities 1 & 2 on lossless buffer #1
>> >>>>     Priorities 3 & 4 on lossless buffer #2
>> >>>>     Priorities 5 & 6 on lossless buffer #3
>> >>>>
>> >>>> We observe improvements in latency for small and medium message sizes
>> >>>> as follows. Please note that the large message sizes bandwidth performance is
>> >>>> reduced but the total bandwidth remains the same.
>> >>>>     256B message size (42 % latency reduction)
>> >>>>     4K message size (21% latency reduction)
>> >>>>     64K message size (16% latency reduction)
>> >>>>
>> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
>> >>> On a cursory look this bares a lot of resemblance to devlink shared
>> >>> buffer configuration ABI.  Did you look into using that?
>> >>>
>> >>> Just to be clear devlink shared buffer ABIs don't require representors
>> >>> and "switchdev mode".
>> >>> .  
>> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
>> >> ABI cannot be used:
>> >> 1. The devlink shared buffer ABI is written based on the switch cli
>> >> which you can find out more
>> >> from this link https://community.mellanox.com/docs/DOC-2558.  
>> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
>> > advanced schemes (present in Spectrum).  The simpler/basic static
>> > threshold configurations is exactly what you are doing here, AFAIU.  
>> [HQN] Devlink API is tailored specifically for switch.
>
>I hope that is not true, since we (Netronome) are trying to use it for
>NIC configuration, too.  We should generalize the API if need be.

Sure it is not true. I have no clue why anyone thinks so :/


>
>> We don't configure threshold configuration explicitly. It is done via
>> PFC. Once PFC is enabled on priority, threshold is setup based on our
>> proprietary formula that were tested rigorously for performance.
>
>Are you referring to XOFF/XON thresholds?  I don't think the "threshold
>type" in devlink API implies we are setting XON/XOFF thresholds
>directly :S  If PFC is enabled we may be setting them indirectly,
>obviously.
>
>My understanding is that for static threshold type the size parameter
>specifies the max amount of memory given pool can consume.
>
>> >> 2. The dcbnl interfaces have been used for QoS settings.  
>> > QoS settings != shared buffer configuration.  
>> [HQN] I think we have different definition about "shared buffer".
>> Please refer to this below switch cli link.
>> It explained in detail what is the "shared buffer" in switch means.
>> Our NIC does not have "shared buffer" supported.
>> https://community.mellanox.com/docs/DOC-2591
>
>Yes, we must have a different definitions of "shared buffer" :)  That
>link, however, didn't clarify much for me...  In mlx5 you seem to have a
>buffer which is shared between priorities, even if it's not what would
>be referred to as shared buffer in switch context.

We introduced "shared buffer" in a devlink with "devlink handle" because
the buffer is shared among the whole ASIC, between multiple
ports/netdevs.


>
>> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
>> > Some customers use DCB, a lot (most?) of them don't.  I don't think
>> > the "this is a logical extension of a commonly used API" really
>> > stands here.  
>> [HQN] DCBNL are being actively used. The whole point of this patch
>> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
>> configuration.
>>
>> Ambitious future is to have the switch configure the NIC's buffer
>> size and buffer mapping
>> via TLV packet and this DCBNL interface. But we won't go too far here.
>
>I think I can understand the motivation, and I think it's a nice thing
>to expose!  The only questions are: does it really belong to DCBNL and
>can existing API be used?
> 
>From patch description it seems like your default setup is shared
>buffer split 50% (lossy)/50% (all prios) and the example you give
>changes that to 25% (lossy)/25%x3 prio groups.
>
>With existing devlink API could this be modelled by three ingress pools
>with 2 TCs bound each?
>
>> >> The buffer configuration are not tied to port like switch.  
>> > It's tied to a port and TCs, you just have one port but still have 8
>> > TCs exactly like a switch...  
>> [HQN] No. Our buffer ties to priority not to TCs.
>
>Right, that is a valid point.  Although TCs can be mapped to
>priorities.  Some switches may tie buffers to priorities, too.  So
>perhaps it's worth extending devlink?
>
>> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
>> > IDK how talking about alpha is relevant, it's just one threshold
>> > type the API supports.  As far as shared buffer and threshold I
>> > don't know if these are switch terms (or how "switch" differs from
>> > "NIC" at that level) - I personally find carving shared buffer into
>> > pools very intuitive.  
>> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
>> characteristics with the switch's buffer settings. 
>
>Yes, and if it's not a perfect match we can extend it.
>
>> But this DCB buffer setting is to improve the performance and work
>> together with the PFC setting. We would like to keep all the qos
>> setting under DCB Netlink as they are designed to be this way.
>
>DCBNL seems to carry standard-based information, which this is not.
>mlxsw supports DCBNL, will it also support this buffer configuration
>mechanism?

Ido would provide you more and accurate info. Basically, in mlxsw we use
dcbnl for the things in can cover and was designed for. And for those
things, the netdev is a handle. Config is specific to the netdev. On the
other hand, devlink shared buffer is used for buffer shared between all
netdevs.



>
>> > Could you give examples of commands/configs one can use with your
>> > new ABI?  
>> [HQN] The plan is to add the support in lldptool once the kernel code
>> is accepted. To test the kernel code,
>> I am using small python scripts that works on top of the netlink
>> library. It will be like this format which is similar to other
>> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
>> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
>> for buffer 0,1,2,3,4,5,6,7 respectively
>> >    How does one query the total size of the buffer to be carved?  
>> [HQN] This is not necessary. If the total size is too big, error will
>> be return via DCB netlink interface.
>
>Right, I'm not saying it's a bug :)  It's just nice when user can be
>told the total size without having to probe for it :)

^ permalink raw reply

* Re: [bpf-next V4 PATCH 1/8] bpf: devmap introduce dev_map_enqueue
From: Daniel Borkmann @ 2018-05-23  9:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: Christoph Hellwig, BjörnTöpel, Magnus Karlsson,
	makita.toshiaki
In-Reply-To: <152665047666.21055.16395048652428831814.stgit@firesoul>

On 05/18/2018 03:34 PM, Jesper Dangaard Brouer wrote:
> Functionality is the same, but the ndo_xdp_xmit call is now
> simply invoked from inside the devmap.c code.
> 
> V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/bpf.h        |   14 +++++++++++---
>  include/trace/events/xdp.h |    9 ++++++++-
>  kernel/bpf/devmap.c        |   37 +++++++++++++++++++++++++++++++------
>  net/core/filter.c          |   15 ++-------------
>  4 files changed, 52 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index ed0122b45b63..fc1459bdcafc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -485,14 +485,15 @@ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
>  void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth);
>  
>  /* Map specifics */
> -struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
> +struct xdp_buff;
> +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);

When you have some follow-up patches, would be great if you could clean this
up a bit. At least a newline after the struct declaration would make it a bit
more readable.

>  void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __dev_map_flush(struct bpf_map *map);
> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp);
>  
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __cpu_map_flush(struct bpf_map *map);
> -struct xdp_buff;
>  int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  
> @@ -571,6 +572,14 @@ static inline void __dev_map_flush(struct bpf_map *map)
>  {
>  }
>  
> +struct xdp_buff;
> +struct bpf_dtab_netdev;
> +static inline

In particular here as well.

> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
> +{
> +	return 0;
> +}
> +
>  static inline
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
> @@ -585,7 +594,6 @@ static inline void __cpu_map_flush(struct bpf_map *map)
>  {
>  }
>  
> -struct xdp_buff;
>  static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu,
>  				  struct xdp_buff *xdp,
>  				  struct net_device *dev_rx)
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index 8989a92c571a..96104610d40e 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -138,11 +138,18 @@ DEFINE_EVENT_PRINT(xdp_redirect_template, xdp_redirect_map_err,
>  		  __entry->map_id, __entry->map_index)
>  );
>  
> +#ifndef __DEVMAP_OBJ_TYPE
> +#define __DEVMAP_OBJ_TYPE
> +struct _bpf_dtab_netdev {
> +	struct net_device *dev;
> +};
> +#endif /* __DEVMAP_OBJ_TYPE */

The __DEVMAP_OBJ_TYPE is not used anywhere, what's its purpose?

Also if you define struct _bpf_dtab_netdev this is rather fragile when mapping
to struct bpf_dtab_netdev. Best way of guarding this is to make a BUILD_BUG_ON()
where you compare both offsets in the struct and bail out compilation whenever
this changes.

>  #define devmap_ifindex(fwd, map)				\
>  	(!fwd ? 0 :						\
>  	 (!map ? 0 :						\
>  	  ((map->map_type == BPF_MAP_TYPE_DEVMAP) ?		\
> -	   ((struct net_device *)fwd)->ifindex : 0)))
> +	   ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)))
>  
>  #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)		\
>  	 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map),	\
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 565f9ece9115..808808bf2bf2 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -48,18 +48,21 @@
>   * calls will fail at this point.
>   */
>  #include <linux/bpf.h>
> +#include <net/xdp.h>
>  #include <linux/filter.h>
>  
>  #define DEV_CREATE_FLAG_MASK \
>  	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>  
> +/* objects in the map */

This comment is unnecessary.

>  struct bpf_dtab_netdev {
> -	struct net_device *dev;
> +	struct net_device *dev; /* must be first member, due to tracepoint */
>  	struct bpf_dtab *dtab;
>  	unsigned int bit;
>  	struct rcu_head rcu;
>  };
>  
> +/* bpf map container */

Ditto. Why add it? If it's unclear from the code, then it would probably be
better to clean up the code a bit to make it more obvious. Comments should
explain *why* we do certain things, not *what* the code is doing. Latter is
just a sign that the code itself should be improved potentially. :)

>  struct bpf_dtab {
>  	struct bpf_map map;
>  	struct bpf_dtab_netdev **netdev_map;
> @@ -240,21 +243,43 @@ void __dev_map_flush(struct bpf_map *map)
>   * update happens in parallel here a dev_put wont happen until after reading the
>   * ifindex.
>   */
> -struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
> +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
>  	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> -	struct bpf_dtab_netdev *dev;
> +	struct bpf_dtab_netdev *obj;
>  
>  	if (key >= map->max_entries)
>  		return NULL;
>  
> -	dev = READ_ONCE(dtab->netdev_map[key]);
> -	return dev ? dev->dev : NULL;
> +	obj = READ_ONCE(dtab->netdev_map[key]);
> +	return obj;
> +}
> +
> +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
> +{
> +	struct net_device *dev = dst->dev;
> +	struct xdp_frame *xdpf;
> +	int err;
> +
> +	if (!dev->netdev_ops->ndo_xdp_xmit)
> +		return -EOPNOTSUPP;
> +
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	/* TODO: implement a bulking/enqueue step later */
> +	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +	if (err)
> +		return err;
> +
> +	return 0;

The 'err' is just unnecessary, lets just do:

  return dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);

Later after the other patches this becomes:

  return bq_enqueue(dst, xdpf, dev_rx);

>  }
>  
>  static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
>  {
> -	struct net_device *dev = __dev_map_lookup_elem(map, *(u32 *)key);
> +	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
> +	struct net_device *dev = dev = obj ? obj->dev : NULL;
>  
>  	return dev ? &dev->ifindex : NULL;
>  }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6d0d1560bd70..1447ec94ef74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3061,20 +3061,9 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
>  
>  	switch (map->map_type) {
>  	case BPF_MAP_TYPE_DEVMAP: {
> -		struct net_device *dev = fwd;
> -		struct xdp_frame *xdpf;
> +		struct bpf_dtab_netdev *dst = fwd;
>  
> -		if (!dev->netdev_ops->ndo_xdp_xmit)
> -			return -EOPNOTSUPP;
> -
> -		xdpf = convert_to_xdp_frame(xdp);
> -		if (unlikely(!xdpf))
> -			return -EOVERFLOW;
> -
> -		/* TODO: move to inside map code instead, for bulk support
> -		 * err = dev_map_enqueue(dev, xdp);
> -		 */
> -		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
> +		err = dev_map_enqueue(dst, xdp);
>  		if (err)
>  			return err;
>  		__dev_map_insert_ctx(map, index);
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox