Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2/net-next] ss: initialise variables outside of for loop
From: Stephen Hemminger @ 2016-12-02 22:18 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev
In-Reply-To: <1480679765-9014-1-git-send-email-simon.horman@netronome.com>

On Fri,  2 Dec 2016 12:56:05 +0100
Simon Horman <simon.horman@netronome.com> wrote:

> Initialise for loops outside of for loops. GCC flags this as being
> out of spec unless C99 or C11 mode is used.
> 
> With this change the entire tree appears to compile cleanly with -Wall.
> 
> $ gcc --version
> gcc (Debian 4.9.2-10) 4.9.2
> ...
> $ make
> ...
> ss.c: In function ‘unix_show_sock’:
> ss.c:3128:4: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
> ...
> 
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

Applied.
Note, I used to have -Wall in Makefile but old GCC were broken and would give
aliasing warnings.

^ permalink raw reply

* Re: [PATCH iproute2 V5 0/3] tc: Support for ip tunnel metadata set/unset/classify
From: Stephen Hemminger @ 2016-12-02 22:16 UTC (permalink / raw)
  To: Amir Vadai; +Cc: netdev, David S. Miller, Or Gerlitz, Hadar Har-Zion, Roi Dayan
In-Reply-To: <20161202112515.11705-1-amir@vadai.me>

On Fri,  2 Dec 2016 13:25:12 +0200
Amir Vadai <amir@vadai.me> wrote:

> Hi,
> 
> This short series adds support for matching and setting metadata for ip tunnel
> shared device using the TC system, introduced in kernel 4.9 [1].
> 
> Applied and tested on top of commit b6c7fc61faab ("ss: print new tcp_info
> fields: busy, rwnd-limited, sndbuf-limited times")
> 
> 
> Example usage:
> 
> $ tc filter add dev vxlan0 protocol ip parent ffff: \
>     flower \
>       enc_src_ip 11.11.0.2 \
>       enc_dst_ip 11.11.0.1 \
>       enc_key_id 11 \
>       dst_ip 11.11.11.1 \
>     action mirred egress redirect dev vnet0
> 
> $ tc filter add dev net0 protocol ip parent ffff: \
>     flower \
>       ip_proto 1 \
>       dst_ip 11.11.11.2 \
>     action tunnel_key set \
>       src_ip 11.11.0.1 \
>       dst_ip 11.11.0.2 \
>       id 11 \
>     action mirred egress redirect dev vxlan0
> 
> [1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")
> 
> Thanks,
> Amir
> 
> Changes from V4:
> - Fix rebase conflicts for net-next
> 
> Changes from V3:
> - Fix bad wording in the man page about the use of the 'unset' operation
> 
> Changes from V2:
> - Use const where needed
> - Don't lose return value
> - Introduce rta_getattr_be16() and rta_getattr_be32()
> 
> Changes from V1:
> - Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log
> 	and the man page tc-tunnel_key to reflect the fact that 'unset' operation is
> 	no mandatory.
> 	And describe when it might be needed.
> - Rename the 'release' operation to 'unset'
> 
> Amir Vadai (3):
>   libnetlink: Introduce rta_getattr_be*()
>   tc/cls_flower: Classify packet in ip tunnels
>   tc/act_tunnel: Introduce ip tunnel action
> 
> Amir Vadai (3):
>   libnetlink: Introduce rta_getattr_be*()
>   tc/cls_flower: Classify packet in ip tunnels
>   tc/act_tunnel: Introduce ip tunnel action
> 
>  bridge/fdb.c                         |   4 +-
>  include/libnetlink.h                 |   9 ++
>  include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
>  ip/iplink_geneve.c                   |   2 +-
>  ip/iplink_vxlan.c                    |   2 +-
>  man/man8/tc-flower.8                 |  17 ++-
>  man/man8/tc-tunnel_key.8             | 112 +++++++++++++++
>  tc/Makefile                          |   1 +
>  tc/f_flower.c                        |  84 +++++++++++-
>  tc/m_tunnel_key.c                    | 258 +++++++++++++++++++++++++++++++++++
>  10 files changed, 522 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/tc_act/tc_tunnel_key.h
>  create mode 100644 man/man8/tc-tunnel_key.8
>  create mode 100644 tc/m_tunnel_key.c
> 
 Series applied

^ permalink raw reply

* Re: [PATCH net-next 3/4] tcp: tsq: add shortcut in tcp_tasklet_func()
From: Eric Dumazet @ 2016-12-02 22:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev
In-Reply-To: <1480703159-2327-4-git-send-email-edumazet@google.com>

On Fri, 2016-12-02 at 10:25 -0800, Eric Dumazet wrote:
> Under high stress, I've seen tcp_tasklet_func() consuming
> ~700 usec, handling ~150 tcp sockets.
> 
> By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
> for other cpus/threads entering tcp_write_xmit() to grab it,
> allowing tcp_tasklet_func() to skip sockets that already did
> an xmit cycle.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

...

> @@ -884,7 +884,7 @@ void tcp_wfree(struct sk_buff *skb)
>  		if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
>  			goto out;
>  
> -		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
> +		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCP_TSQ_DEFERRED;

Typo here...

Should be :
		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;

>  		nval = cmpxchg(&tp->tsq_flags, oval, nval);
>  		if (nval != oval)
>  			continue;

^ permalink raw reply

* Re: [iproute PATCH v2 00/18] ss: Minor code review
From: Stephen Hemminger @ 2016-12-02 22:09 UTC (permalink / raw)
  To: Phil Sutter; +Cc: netdev
In-Reply-To: <20161202104002.17310-1-phil@nwl.cc>

On Fri,  2 Dec 2016 11:39:44 +0100
Phil Sutter <phil@nwl.cc> wrote:

> This is a series of misc changes to ss code which happened as fall-out
> when working on a unified output formatter (still unfinished).
> 
> Changes since v1:
> - Rebased onto current upstream, resolved conflicts in patch 4 generated
>   by previously added SCTP socket support.
> 
> Phil Sutter (18):
>   ss: Mark fall through in arg parsing switch()
>   ss: Drop empty lines in UDP output
>   ss: Add missing tab when printing UNIX details
>   ss: Use sockstat->type in all socket types
>   ss: introduce proc_ctx_print()
>   ss: Drop list traversal from unix_stats_print()
>   ss: Eliminate unix_use_proc()
>   ss: Turn generic_proc_open() wrappers into macros
>   ss: Make tmr_name local to tcp_timer_print()
>   ss: Make user_ent_hash_build_init local to user_ent_hash_build()
>   ss: Make some variables function-local
>   ss: Make slabstat_ids local to get_slabstat()
>   ss: Get rid of useless goto in handle_follow_request()
>   ss: Get rid of single-fielded struct snmpstat
>   ss: Make unix_state_map local to unix_show()
>   ss: Make sstate_name local to sock_state_print()
>   ss: Make sstate_namel local to scan_state()
>   ss: unix_show: No need to initialize members of calloc'ed structs
> 
>  misc/ss.c | 532 ++++++++++++++++++++++++++------------------------------------
>  1 file changed, 224 insertions(+), 308 deletions(-)
> 

Applied, thanks

^ permalink raw reply

* Re: [PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.
From: Stephen Hemminger @ 2016-12-02 22:06 UTC (permalink / raw)
  To: Roman Mashak; +Cc: netdev, sathya.perla
In-Reply-To: <1480623644-27533-1-git-send-email-mrv@mojatatu.com>

On Thu,  1 Dec 2016 15:20:44 -0500
Roman Mashak <mrv@mojatatu.com> wrote:

> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  man/man8/tc.8 | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/man/man8/tc.8 b/man/man8/tc.8
> index 8a47a2b..d957ffa 100644
> --- a/man/man8/tc.8
> +++ b/man/man8/tc.8
> @@ -32,7 +32,9 @@ class-id ] qdisc
>  DEV
>  .B [ parent
>  qdisc-id
> -.B | root ] protocol
> +.B | root ] [ handle
> +handle-id ]
> +.B protocol
>  protocol
>  .B prio
>  priority filtertype
> @@ -577,7 +579,7 @@ it is created.
>  
>  .TP
>  get
> -Displays a single filter given the interface, parent ID, priority, protocol and handle ID.
> +Displays a single filter given the interface, qdisc-id, priority, protocol and handle-id.
>  
>  .TP
>  show

The proper syntax for man page usage section is to put keywords in bold and any value
that is variable in italic.

I know this whole man page doesn't do this correctly. But that doesn't mean that new
additions should continue with the mistake.

Please revise and resubmit. Extra bonus points for fixing the other bits.

^ permalink raw reply

* Re: [PATCHv2 net-next 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
From: Vivien Didelot @ 2016-12-02 22:03 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, netdev
In-Reply-To: <20161202211806.GF30716@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>> The port's EgressMode, FrameMode and EtherType are really tied together
>> to compose the mode of the port.
>
> Setting the EtherType is somewhat separate. It is only needed on ports
> using EDSA. And that can only happen on a CPU port. Humm, actually, i
> set it when i should not. But putting this in a wrapper actually hides
> this.

Wrong. The datasheet says:

    > This Ether Type is used for many features depending upon the mode
    > of the port (as defined by the port’s EgressMode and FrameMode
    > bits – in Port Control, port offset 0x04).

It says that in Normal Network mode, this register can be used to trap,
mirror, etc. Also used in Provider and EDSA modes.

That is why it would be better to wrap them together to ensure correct
values when configuring a port's mode.

>
>> Could you add an helper in chip.c like:
>> 
>>     static int mv88e6xxx_set_port_mode(struct mv88e6xxx_chip *chip, int port,
>>                                        enum mv88e6xxx_frame_mode frame_mode,
>>                                        u16 egress_mode, bool egress_unknown,
>>                                        u16 ethertype)
>>     {
>>         int err;
>> 
>>         if (chip->info->ops->port_set_frame_mode) {
>>             err = chip->info->ops->port_set_frame_mode(chip, port, frame_mode);
>>             if (err)
>>                 return err;
>>         }
>
> Ignoring that it is not implemented here is wrong. It must be
> implemented, or the device is not going to work. It is a question of,
> do we want an oops, or return an error code.

Since that is done at setup time, returning an error is enough IMO to
inform the DSA layer that something went wrong.

Thanks,

        Vivien

^ permalink raw reply

* [PATCH net-next v5] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Erik Nordmark @ 2016-12-02 22:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, hannes, Erik Nordmark, Bob Gilligan

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

---

v2: renamed sysctl and made it default to true, plus minor code review fixes
v3: respun with later net-next; fixed whitespace issues
v4: fixed kbuild test robot for route.c; added Reviewed-by
v5: using %pM for printk of nonce

 Documentation/networking/ip-sysctl.txt |  9 +++++++++
 include/linux/ipv6.h                   |  1 +
 include/net/if_inet6.h                 |  1 +
 include/net/ndisc.h                    |  5 ++++-
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 22 +++++++++++++++++++++-
 net/ipv6/ndisc.c                       | 29 ++++++++++++++++++++++++++---
 net/ipv6/route.c                       |  2 +-
 8 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5af48dd..d9ef566 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1729,6 +1729,15 @@ drop_unsolicited_na - BOOLEAN
 
 	By default this is turned off.
 
+enhanced_dad - BOOLEAN
+	Include a nonce option in the IPv6 neighbor solicitation messages used for
+	duplicate address detection per RFC7527. A received DAD NS will only signal
+	a duplicate address if the nonce is different. This avoids any false
+	detection of duplicates due to loopback of the NS messages that we send.
+	The nonce option will be sent on an interface unless both of
+	conf/{all,interface}/enhanced_dad are set to FALSE.
+	Default: TRUE
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 3f95233..671d014 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -68,6 +68,7 @@ struct ipv6_devconf {
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	__s32		seg6_require_hmac;
 #endif
+	__u32		enhanced_dad;
 
 	struct ctl_table_header *sysctl_header;
 };
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index b0576cb..0fa4c32 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -55,6 +55,7 @@ struct inet6_ifaddr {
 	__u8			stable_privacy_retry;
 
 	__u16			scope;
+	__u64			dad_nonce;
 
 	unsigned long		cstamp;	/* created timestamp */
 	unsigned long		tstamp; /* updated timestamp */
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index be1fe228..d562a2f 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -31,6 +31,7 @@ enum {
 	ND_OPT_PREFIX_INFO = 3,		/* RFC2461 */
 	ND_OPT_REDIRECT_HDR = 4,	/* RFC2461 */
 	ND_OPT_MTU = 5,			/* RFC2461 */
+	ND_OPT_NONCE = 14,              /* RFC7527 */
 	__ND_OPT_ARRAY_MAX,
 	ND_OPT_ROUTE_INFO = 24,		/* RFC4191 */
 	ND_OPT_RDNSS = 25,		/* RFC5006 */
@@ -121,6 +122,7 @@ struct ndisc_options {
 #define nd_opts_pi_end			nd_opt_array[__ND_OPT_PREFIX_INFO_END]
 #define nd_opts_rh			nd_opt_array[ND_OPT_REDIRECT_HDR]
 #define nd_opts_mtu			nd_opt_array[ND_OPT_MTU]
+#define nd_opts_nonce			nd_opt_array[ND_OPT_NONCE]
 #define nd_802154_opts_src_lladdr	nd_802154_opt_array[ND_OPT_SOURCE_LL_ADDR]
 #define nd_802154_opts_tgt_lladdr	nd_802154_opt_array[ND_OPT_TARGET_LL_ADDR]
 
@@ -398,7 +400,8 @@ static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, cons
 int ndisc_rcv(struct sk_buff *skb);
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr);
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce);
 
 void ndisc_send_rs(struct net_device *dev,
 		   const struct in6_addr *saddr, const struct in6_addr *daddr);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 53561be..eaf65dc 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -181,6 +181,7 @@ enum {
 	DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
 	DEVCONF_SEG6_ENABLED,
 	DEVCONF_SEG6_REQUIRE_HMAC,
+	DEVCONF_ENHANCED_DAD,
 	DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c387dc..c1e124b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -242,6 +242,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -292,6 +293,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 /* Check if a valid qdisc is available */
@@ -3735,12 +3737,21 @@ static void addrconf_dad_kick(struct inet6_ifaddr *ifp)
 {
 	unsigned long rand_num;
 	struct inet6_dev *idev = ifp->idev;
+	u64 nonce;
 
 	if (ifp->flags & IFA_F_OPTIMISTIC)
 		rand_num = 0;
 	else
 		rand_num = prandom_u32() % (idev->cnf.rtr_solicit_delay ? : 1);
 
+	nonce = 0;
+	if (idev->cnf.enhanced_dad ||
+	    dev_net(idev->dev)->ipv6.devconf_all->enhanced_dad) {
+		do
+			get_random_bytes(&nonce, 6);
+		while (nonce == 0);
+	}
+	ifp->dad_nonce = nonce;
 	ifp->dad_probes = idev->cnf.dad_transmits;
 	addrconf_mod_dad_work(ifp, rand_num);
 }
@@ -3918,7 +3929,8 @@ static void addrconf_dad_work(struct work_struct *w)
 
 	/* send a neighbour solicitation for our addr */
 	addrconf_addr_solict_mult(&ifp->addr, &mcaddr);
-	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any);
+	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any,
+		      ifp->dad_nonce);
 out:
 	in6_ifa_put(ifp);
 	rtnl_unlock();
@@ -4962,6 +4974,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	array[DEVCONF_SEG6_REQUIRE_HMAC] = cnf->seg6_require_hmac;
 #endif
+	array[DEVCONF_ENHANCED_DAD] = cnf->enhanced_dad;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -6070,6 +6083,13 @@ int addrconf_sysctl_ignore_routes_with_linkdown(struct ctl_table *ctl,
 	},
 #endif
 	{
+		.procname       = "enhanced_dad",
+		.data           = &ipv6_devconf.enhanced_dad,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		/* sentinel */
 	}
 };
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index d8e6714..7ebac63 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -233,6 +233,7 @@ struct ndisc_options *ndisc_parse_options(const struct net_device *dev,
 		case ND_OPT_SOURCE_LL_ADDR:
 		case ND_OPT_TARGET_LL_ADDR:
 		case ND_OPT_MTU:
+		case ND_OPT_NONCE:
 		case ND_OPT_REDIRECT_HDR:
 			if (ndopts->nd_opt_array[nd_opt->nd_opt_type]) {
 				ND_PRINTK(2, warn,
@@ -568,7 +569,8 @@ static void ndisc_send_unsol_na(struct net_device *dev)
 }
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr)
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce)
 {
 	struct sk_buff *skb;
 	struct in6_addr addr_buf;
@@ -588,6 +590,8 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 	if (inc_opt)
 		optlen += ndisc_opt_addr_space(dev,
 					       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0)
+		optlen += 8;
 
 	skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
 	if (!skb)
@@ -605,6 +609,13 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 		ndisc_fill_addr_option(skb, ND_OPT_SOURCE_LL_ADDR,
 				       dev->dev_addr,
 				       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0) {
+		u8 *opt = skb_put(skb, 8);
+
+		opt[0] = ND_OPT_NONCE;
+		opt[1] = 8 >> 3;
+		memcpy(opt + 2, &nonce, 6);
+	}
 
 	ndisc_send_skb(skb, daddr, saddr);
 }
@@ -693,12 +704,12 @@ static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb)
 				  "%s: trying to ucast probe in NUD_INVALID: %pI6\n",
 				  __func__, target);
 		}
-		ndisc_send_ns(dev, target, target, saddr);
+		ndisc_send_ns(dev, target, target, saddr, 0);
 	} else if ((probes -= NEIGH_VAR(neigh->parms, APP_PROBES)) < 0) {
 		neigh_app_ns(neigh);
 	} else {
 		addrconf_addr_solict_mult(target, &mcaddr);
-		ndisc_send_ns(dev, target, &mcaddr, saddr);
+		ndisc_send_ns(dev, target, &mcaddr, saddr, 0);
 	}
 }
 
@@ -742,6 +753,7 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 	int dad = ipv6_addr_any(saddr);
 	bool inc;
 	int is_router = -1;
+	u64 nonce = 0;
 
 	if (skb->len < sizeof(struct nd_msg)) {
 		ND_PRINTK(2, warn, "NS: packet too short\n");
@@ -786,6 +798,8 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 			return;
 		}
 	}
+	if (ndopts.nd_opts_nonce)
+		memcpy(&nonce, (u8 *)(ndopts.nd_opts_nonce + 1), 6);
 
 	inc = ipv6_addr_is_multicast(daddr);
 
@@ -794,6 +808,15 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 have_ifp:
 		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
 			if (dad) {
+				if (nonce != 0 && ifp->dad_nonce == nonce) {
+					u8 *np = (u8 *)&nonce;
+					/* Matching nonce if looped back */
+					ND_PRINTK(2, notice,
+						  "%s: IPv6 DAD loopback for address %pI6c nonce %pM ignored\n",
+						  ifp->idev->dev->name,
+						  &ifp->addr, np);
+					goto out;
+				}
 				/*
 				 * We are colliding with another node
 				 * who is doing DAD
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b317bb1..aac7818 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -527,7 +527,7 @@ static void rt6_probe_deferred(struct work_struct *w)
 		container_of(w, struct __rt6_probe_work, work);
 
 	addrconf_addr_solict_mult(&work->target, &mcaddr);
-	ndisc_send_ns(work->dev, &work->target, &mcaddr, NULL);
+	ndisc_send_ns(work->dev, &work->target, &mcaddr, NULL, 0);
 	dev_put(work->dev);
 	kfree(work);
 }
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCHv2 net-next 2/4] net: dsa: mv88e6xxx: Monitor and Management tables
From: Vivien Didelot @ 2016-12-02 21:53 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, netdev
In-Reply-To: <20161202205656.GD30716@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> On Fri, Dec 02, 2016 at 02:32:39PM -0500, Vivien Didelot wrote:
>> Hi Andrew,
>> 
>> Andrew Lunn <andrew@lunn.ch> writes:
>> 
>> > @@ -3184,6 +3186,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
>> >  	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
>> >  	.stats_get_strings = mv88e6095_stats_get_strings,
>> >  	.stats_get_stats = mv88e6095_stats_get_stats,
>> > +	.g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
>> > +	.g1_set_egress_port = mv88e6095_g1_set_egress_port,
>> >  };
>> 
>> I like the implementation in this version better. But please explain me
>> why you are prefixing these operations with g1_?
>
> The prefix gives some basic grouping. port_ indicates it operates on a
> port, and is likely to be found in port.c. stats_ indicates it
> operates on statistics, ppu that is operates on the phy polling unit.

Yes, port_* operations operate on ports. But the port.c file is there to
implement the function of "Port Registers". "Port" can be confusing, but
it refers to the SMI internal device at address 0xsomething.

"port_", "ppu_", "stats_", in the mv88e6xxx_ops structure just give
implicit namespaces for the **features**, not their location!

> We are going to have some things which don't fall into a simple
> category, like these two. But it would however be nice to group them,
> so i picked which register bank they are in. These operations are
> always in g1. It is a useful hint as to where to find the different
> variants.

Absolutely not!

    .set_egress_port = mv88e6095_g1_set_egress_port,

                                 ^
                                 That is the useful hint!

At the higher level of chip.c, we don't care about where is implemented
the switch MAC setter. We just have to call the correctly defined
.set_switch_mac routine.

However if you do care to know, its _ops.set_switch_mac pointer will
tell you (_g1 vs _g2 prefix).

>> But let's imagine we can set the CPU port in some Global 2 registers.
>> You are going to wrap this in chip.c with something like:
>> 
>>     int mv88e6xxx_set_cpu_port(struct mv88e6xxx_chip *chip, int port)
>>     {
>>         if (chip->info->ops->g2_set_cpu_port)
>>             return chip->info->ops->g2_set_cpu_port(chip, port);
>>         else if (chip->info->ops->g1_set_cpu_port)
>>             return chip->info->ops->g1_set_cpu_port(chip, port);
>>         else
>>             return -EOPNOTSUPP;
>>     }
>
> I answered in one of my other emails. Frames with reserved MAC
> addresses can be forwarded to the CPU. For most devices, this is a g2
> operation. However, for 6390, it is a g1. In that case, my code does
> not use a prefix. Not having a prefix, when all the others do, also
> gives you information. It means the ops are spread around and you need
> to make a bigger effort to go find them.

Again, absolutely not. This is your interpretation of having a prefix or
not. A chip has only one way to access a feature, not two. Since you
seem to be focused on the Rsvd2CPU feature, here's an example with it:

What's the point of writing this:

    /* Consider the given MAC as MGMT */
    int mv88e6xxx_reserve_mac(struct mv88e6xxx_chip *chip, u8 *addr)
    {
            if (mac_is_0x(addr)) {
                    if (chip->info->ops->g1_set_rsvd2cpu0)
                            return chip->info->ops->g1_set_rsvd2cpu0(...);
                    else if (chip->info->ops->g2_set_rsvd2cpu0)
                            return chip->info->ops->g2_set_rsvd2cpu0(...);
            } else if (mac_is_2x(addr)) {
                    if (chip->info->ops->g1_set_rsvd2cpu2)
                            return chip->info->ops->g1_set_rsvd2cpu2(...);
                    else if (chip->info->ops->g2_set_rsvd2cpu2)
                            return chip->info->ops->g2_set_rsvd2cpu2(...);
            }

            return mv88e6xxx_atu_load(chip, addr, MGMT);
    }

Compared to this:

    /* Consider the given MAC as MGMT */
    int mv88e6xxx_reserve_mac(struct mv88e6xxx_chip *chip, u8 *addr)
    {
            if (mac_is_0x(addr)) {
                    if (chip->info->ops->set_rsvd2cpu0)
                            return chip->info->ops->set_rsvd2cpu0(...);
            } else if (mac_is_2x(addr)) {
                    if (chip->info->ops->set_rsvd2cpu2)
                            return chip->info->ops->set_rsvd2cpu2(...);
            }

            return mv88e6xxx_atu_load(chip, addr, MGMT);
    }

Your higher level API (chip.c) doesn't need to know where is implemented
a given feature. It just needs to know if it supports it or not.

Thanks,

        Vivien

^ permalink raw reply

* [PATCH] adm80211: add checks for dma mapping errors
From: Alexey Khoroshilov @ 2016-12-02 21:52 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Alexey Khoroshilov, linux-wireless, netdev, linux-kernel,
	ldv-project

The driver does not check if mapping dma memory succeed.
The patch adds the checks and failure handling.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
---
 drivers/net/wireless/admtek/adm8211.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/admtek/adm8211.c b/drivers/net/wireless/admtek/adm8211.c
index 70ecd82d674d..2b4a3eb38dfa 100644
--- a/drivers/net/wireless/admtek/adm8211.c
+++ b/drivers/net/wireless/admtek/adm8211.c
@@ -413,6 +413,13 @@ static void adm8211_interrupt_rci(struct ieee80211_hw *dev)
 						       skb_tail_pointer(newskb),
 						       RX_PKT_SIZE,
 						       PCI_DMA_FROMDEVICE);
+				if (pci_dma_mapping_error(priv->pdev,
+					   priv->rx_buffers[entry].mapping)) {
+					priv->rx_buffers[entry].skb = NULL;
+					dev_kfree_skb(newskb);
+					skb = NULL;
+					/* TODO: update rx dropped stats */
+				}
 			} else {
 				skb = NULL;
 				/* TODO: update rx dropped stats */
@@ -1450,6 +1457,12 @@ static int adm8211_init_rings(struct ieee80211_hw *dev)
 						  skb_tail_pointer(rx_info->skb),
 						  RX_PKT_SIZE,
 						  PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(priv->pdev, rx_info->mapping)) {
+			dev_kfree_skb(rx_info->skb);
+			rx_info->skb = NULL;
+			break;
+		}
+
 		desc->buffer1 = cpu_to_le32(rx_info->mapping);
 		desc->status = cpu_to_le32(RDES0_STATUS_OWN | RDES0_STATUS_SQL);
 	}
@@ -1613,7 +1626,7 @@ static void adm8211_calc_durations(int *dur, int *plcp, size_t payload_len, int
 }
 
 /* Transmit skb w/adm8211_tx_hdr (802.11 header created by hardware) */
-static void adm8211_tx_raw(struct ieee80211_hw *dev, struct sk_buff *skb,
+static int adm8211_tx_raw(struct ieee80211_hw *dev, struct sk_buff *skb,
 			   u16 plcp_signal,
 			   size_t hdrlen)
 {
@@ -1625,6 +1638,8 @@ static void adm8211_tx_raw(struct ieee80211_hw *dev, struct sk_buff *skb,
 
 	mapping = pci_map_single(priv->pdev, skb->data, skb->len,
 				 PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(priv->pdev, mapping))
+		return -ENOMEM;
 
 	spin_lock_irqsave(&priv->lock, flags);
 
@@ -1657,6 +1672,8 @@ static void adm8211_tx_raw(struct ieee80211_hw *dev, struct sk_buff *skb,
 
 	/* Trigger transmit poll */
 	ADM8211_CSR_WRITE(TDR, 0);
+
+	return 0;
 }
 
 /* Put adm8211_tx_hdr on skb and transmit */
@@ -1710,7 +1727,10 @@ static void adm8211_tx(struct ieee80211_hw *dev,
 
 	txhdr->retry_limit = info->control.rates[0].count;
 
-	adm8211_tx_raw(dev, skb, plcp_signal, hdrlen);
+	if (adm8211_tx_raw(dev, skb, plcp_signal, hdrlen)) {
+		/* Drop packet */
+		ieee80211_free_txskb(dev, skb);
+	}
 }
 
 static int adm8211_alloc_rings(struct ieee80211_hw *dev)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH next] dctcp: update cwnd on congestion event
From: Florian Westphal @ 2016-12-02 21:49 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Florian Westphal, Netdev, Lawrence Brakmo, Andrew Shewmaker,
	Glenn Judd, Daniel Borkmann, Yuchung Cheng, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <CADVnQymNZ+FQ5xJ92HuSkheAJfOTUyh-PsA11bxRWERZkD5zdQ@mail.gmail.com>

Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Nov 14, 2016 at 10:42 AM, Florian Westphal <fw@strlen.de> wrote:
> >
> > draft-ietf-tcpm-dctcp-02 says:
> >
> > ... when the sender receives an indication of congestion
> > (ECE), the sender SHOULD update cwnd as follows:
> >
> >          cwnd = cwnd * (1 - DCTCP.Alpha / 2)
> >
> > So, lets do this and reduce cwnd more smoothly (and faster), as per
> > current congestion estimate.
> 
> AFAICT this is doing a multiplicative decrease of cwnd on every ACK
> that has an ECE bit.
> 
> If I am reading the code correctly, then I would have two concerns:
> 
> 1) Has that been tested? That seems like an extremely dramatic
> decrease in cwnd. For example, if the cwnd is 80, and there are 40
> ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
> calculations seem to suggest that after just 11 ACKs the cwnd would be
> down to a minimal value of 2:
> 
> ack 1 cwnd=60
> ack 2 cwnd=45
> ack 3 cwnd=33
[..]

You are assuming alpha = 0.5?
Then, yes, looks correct.  Since some of these acks will most likely
also end an observation window acks might also cause change to alpha.

> 2) That seems to contradict another passage in the draft (v 02 or 03). Consider
>      https://tools.ietf.org/html/draft-ietf-tcpm-dctcp-03
> where it says
> 
>    Just as specified in [RFC3168], DCTCP does not react to congestion
>    indications more than once for every window of data.
> 
> So the draft seems to advocate not reacting to congestion indications
> more than once per window. Yet this patch reacts on every ECE-marked
> ACK within a window.
> 
> Am I reading something incorrectly?

No, I will raise this on tcpm next monday (if you want you
can of course do this yourself).

Would be easy to make it so this cwnd update only happens once in each
observation cycle, but it would be even better if this would get input
from draft authors.

Thanks Neal!

^ permalink raw reply

* Re: [PATCHv2 net-next 2/4] net: dsa: mv88e6xxx: Monitor and Management tables
From: Andrew Lunn @ 2016-12-02 20:56 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: David Miller, netdev
In-Reply-To: <87mvgecejs.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

On Fri, Dec 02, 2016 at 02:32:39PM -0500, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn <andrew@lunn.ch> writes:
> 
> > @@ -3184,6 +3186,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
> >  	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
> >  	.stats_get_strings = mv88e6095_stats_get_strings,
> >  	.stats_get_stats = mv88e6095_stats_get_stats,
> > +	.g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
> > +	.g1_set_egress_port = mv88e6095_g1_set_egress_port,
> >  };
> 
> I like the implementation in this version better. But please explain me
> why you are prefixing these operations with g1_?

The prefix gives some basic grouping. port_ indicates it operates on a
port, and is likely to be found in port.c. stats_ indicates it
operates on statistics, ppu that is operates on the phy polling unit.

We are going to have some things which don't fall into a simple
category, like these two. But it would however be nice to group them,
so i picked which register bank they are in. These operations are
always in g1. It is a useful hint as to where to find the different
variants.

> But let's imagine we can set the CPU port in some Global 2 registers.
> You are going to wrap this in chip.c with something like:
> 
>     int mv88e6xxx_set_cpu_port(struct mv88e6xxx_chip *chip, int port)
>     {
>         if (chip->info->ops->g2_set_cpu_port)
>             return chip->info->ops->g2_set_cpu_port(chip, port);
>         else if (chip->info->ops->g1_set_cpu_port)
>             return chip->info->ops->g1_set_cpu_port(chip, port);
>         else
>             return -EOPNOTSUPP;
>     }

I answered in one of my other emails. Frames with reserved MAC
addresses can be forwarded to the CPU. For most devices, this is a g2
operation. However, for 6390, it is a g1. In that case, my code does
not use a prefix. Not having a prefix, when all the others do, also
gives you information. It means the ops are spread around and you need
to make a bigger effort to go find them.

   Andrew

^ permalink raw reply

* Re: [PATCHv2 net-next 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
From: Andrew Lunn @ 2016-12-02 21:18 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: David Miller, netdev
In-Reply-To: <87h96mcadj.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

> The port's EgressMode, FrameMode and EtherType are really tied together
> to compose the mode of the port.

Setting the EtherType is somewhat separate. It is only needed on ports
using EDSA. And that can only happen on a CPU port. Humm, actually, i
set it when i should not. But putting this in a wrapper actually hides
this.

> Could you add an helper in chip.c like:
> 
>     static int mv88e6xxx_set_port_mode(struct mv88e6xxx_chip *chip, int port,
>                                        enum mv88e6xxx_frame_mode frame_mode,
>                                        u16 egress_mode, bool egress_unknown,
>                                        u16 ethertype)
>     {
>         int err;
> 
>         if (chip->info->ops->port_set_frame_mode) {
>             err = chip->info->ops->port_set_frame_mode(chip, port, frame_mode);
>             if (err)
>                 return err;
>         }

Ignoring that it is not implemented here is wrong. It must be
implemented, or the device is not going to work. It is a question of,
do we want an oops, or return an error code.

New version coming.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 0/4] tcp: tsq: performance series
From: Eric Dumazet @ 2016-12-02 20:57 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <1480703159-2327-1-git-send-email-edumazet@google.com>

On Fri, Dec 2, 2016 at 10:25 AM, Eric Dumazet <edumazet@google.com> wrote:
> Under very high TX stress, CPU handling NIC TX completions can spend
> considerable amount of cycles handling TSQ (TCP Small Queues) logic.
>
> This patch series avoids some atomic operations, but more important
> patch is the 3rd one, allowing other cpus processing ACK packets and
> calling tcp_write_xmit() to grab TCP_TSQ_DEFERRED so that
> tcp_tasklet_func() can skip already processed sockets.
>
> This avoid lots of lock acquisitions and cache lines accesses,
> particularly under load.
>

Please do not merge this version.

I probably messed something, I need to make more tests.

Thanks.

^ permalink raw reply

* Re: [PATCHv2 net-next 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
From: Vivien Didelot @ 2016-12-02 21:02 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1480701779-30633-5-git-send-email-andrew@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> +static int mv88e6xxx_setup_port_dsa(struct mv88e6xxx_chip *chip, int port,
> +				    int upstream_port)
> +{
> +	int err;
> +
> +	err = chip->info->ops->port_set_frame_mode(
> +		chip, port, MV88E6XXX_FRAME_MODE_DSA);
> +	if (err)
> +		return err;
> +
> +	err = chip->info->ops->port_set_egress_unknowns(
> +		chip, port, port == upstream_port);
> +	if (err)
> +		return err;
> +
> +	if (chip->info->ops->port_set_ether_type)
> +		return chip->info->ops->port_set_ether_type(
> +			chip, port, ETH_P_EDSA);
> +
> +	return 0;
> +}
> +
> +static int mv88e6xxx_setup_port_cpu(struct mv88e6xxx_chip *chip, int port)
> +{
> +	int err;
> +
> +	switch (chip->info->tag_protocol) {
> +	case DSA_TAG_PROTO_EDSA:
> +		err = chip->info->ops->port_set_frame_mode(
> +			chip, port, MV88E6XXX_FRAME_MODE_ETHERTYPE);
> +		if (err)
> +			return err;
> +
> +		err = mv88e6xxx_port_set_egress_mode(
> +			chip, port, PORT_CONTROL_EGRESS_ADD_TAG);
> +		if (err)
> +			return err;
> +
> +		if (chip->info->ops->port_set_ether_type)
> +			err = chip->info->ops->port_set_ether_type(
> +				chip, port, ETH_P_EDSA);
> +		break;
> +
> +	case DSA_TAG_PROTO_DSA:
> +		err = chip->info->ops->port_set_frame_mode(
> +			chip, port, MV88E6XXX_FRAME_MODE_DSA);
> +		if (err)
> +			return err;
> +
> +		err = mv88e6xxx_port_set_egress_mode(
> +			chip, port, PORT_CONTROL_EGRESS_UNMODIFIED);
> +		break;
> +	default:
> +		err = -EINVAL;
> +	}
> +
> +	if (err)
> +		return err;
> +
> +	return chip->info->ops->port_set_egress_unknowns(chip, port, true);
> +}
> +
> +static int mv88e6xxx_setup_port_normal(struct mv88e6xxx_chip *chip, int port)
> +{
> +	int err;
> +
> +	err = chip->info->ops->port_set_frame_mode(
> +		chip, port, MV88E6XXX_FRAME_MODE_NORMAL);
> +	if (err)
> +		return err;
> +
> +	return chip->info->ops->port_set_egress_unknowns(chip, port, false);
> +}

The port's EgressMode, FrameMode and EtherType are really tied together
to compose the mode of the port. Could you add an helper in chip.c like:

    static int mv88e6xxx_set_port_mode(struct mv88e6xxx_chip *chip, int port,
                                       enum mv88e6xxx_frame_mode frame_mode,
                                       u16 egress_mode, bool egress_unknown,
                                       u16 ethertype)
    {
        int err;

        if (chip->info->ops->port_set_frame_mode) {
            err = chip->info->ops->port_set_frame_mode(chip, port, frame_mode);
            if (err)
                return err;
        }

        err = mv88e6xxx_port_set_egress_mode(chip, port, egress_mode);
        if (err)
            return err;
        
        if (chip->info->ops->port_set_egress_unknown) {
            err = chip->info->ops->port_set_egress_unknown(chip, port, egress_unknown);
            if (err)
                return err;
        }
        
        if (chip->info->ops->port_set_ether_type) {
            err = chip->info->ops->port_set_ether_type(chip, port, ethertype);
            if (err)
                return err;
        }

        return 0;
    }

So that we correctly check for ops before calling them, and make
mv88e6xxx_setup_port_{dsa,cpu,normal} a bit more concise.

> +
>  static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
>  {
>  	struct dsa_switch *ds = chip->ds;
> @@ -2473,44 +2542,25 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
>  	 * If this is the upstream port for this switch, enable
>  	 * forwarding of unknown unicasts and multicasts.
>  	 */
> -	reg = 0;
> -	if (mv88e6xxx_6352_family(chip) || mv88e6xxx_6351_family(chip) ||
> -	    mv88e6xxx_6165_family(chip) || mv88e6xxx_6097_family(chip) ||
> -	    mv88e6xxx_6095_family(chip) || mv88e6xxx_6065_family(chip) ||
> -	    mv88e6xxx_6185_family(chip) || mv88e6xxx_6320_family(chip))
> -		reg = PORT_CONTROL_IGMP_MLD_SNOOP |
> +	reg = PORT_CONTROL_IGMP_MLD_SNOOP |
>  		PORT_CONTROL_USE_TAG | PORT_CONTROL_USE_IP |
>  		PORT_CONTROL_STATE_FORWARDING;
> +	err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
> +	if (err)
> +		return err;
> +
>  	if (dsa_is_cpu_port(ds, port)) {
> -		if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA)
> -			reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA |
> -				PORT_CONTROL_FORWARD_UNKNOWN_MC;
> -		else
> -			reg |= PORT_CONTROL_DSA_TAG;
> -		reg |= PORT_CONTROL_EGRESS_ADD_TAG |
> -			PORT_CONTROL_FORWARD_UNKNOWN;
> -	}
> -	if (dsa_is_dsa_port(ds, port)) {
> -		if (mv88e6xxx_6095_family(chip) ||
> -		    mv88e6xxx_6185_family(chip))
> -			reg |= PORT_CONTROL_DSA_TAG;
> -		if (mv88e6xxx_6352_family(chip) ||
> -		    mv88e6xxx_6351_family(chip) ||
> -		    mv88e6xxx_6165_family(chip) ||
> -		    mv88e6xxx_6097_family(chip) ||
> -		    mv88e6xxx_6320_family(chip)) {
> -			reg |= PORT_CONTROL_FRAME_MODE_DSA;
> +		err = mv88e6xxx_setup_port_cpu(chip, port);
> +	} else {
> +		if (dsa_is_dsa_port(ds, port)) {
> +			err = mv88e6xxx_setup_port_dsa(chip, port,
> +						       dsa_upstream_port(ds));
> +		} else {
> +			err = mv88e6xxx_setup_port_normal(chip, port);
>  		}
> -
> -		if (port == dsa_upstream_port(ds))
> -			reg |= PORT_CONTROL_FORWARD_UNKNOWN |
> -				PORT_CONTROL_FORWARD_UNKNOWN_MC;
> -	}
> -	if (reg) {
> -		err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
> -		if (err)
> -			return err;
>  	}

The statement is weird. Can you please do:

        if (dsa_is_cpu_port(ds, port)) {
            // CPU port setup
        } else if (dsa_is_dsa_port(ds, port)) {
            // DSA port setup
        } else {
            // Normal port setup
        }
> +	if (err)
> +		return err;

> +	int (*port_set_frame_mode)(struct mv88e6xxx_chip *chip, int port,
> +				   enum mv88e6xxx_frame_mode mode);
> +	int (*port_set_egress_unknowns)(struct mv88e6xxx_chip *chip, int port,
> +					bool on);

"unknowns" sounds odd. "floods" is what all datasheet (except 6185)
use. That'd be better I think. Or at least, s/unknowns/unknown/...

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH next] dctcp: update cwnd on congestion event
From: Neal Cardwell @ 2016-12-02 21:01 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Netdev, Lawrence Brakmo, Andrew Shewmaker, Glenn Judd,
	Daniel Borkmann, Yuchung Cheng, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <1479138121-32294-1-git-send-email-fw@strlen.de>

On Mon, Nov 14, 2016 at 10:42 AM, Florian Westphal <fw@strlen.de> wrote:
>
> draft-ietf-tcpm-dctcp-02 says:
>
> ... when the sender receives an indication of congestion
> (ECE), the sender SHOULD update cwnd as follows:
>
>          cwnd = cwnd * (1 - DCTCP.Alpha / 2)
>
> So, lets do this and reduce cwnd more smoothly (and faster), as per
> current congestion estimate.

AFAICT this is doing a multiplicative decrease of cwnd on every ACK
that has an ECE bit.

If I am reading the code correctly, then I would have two concerns:

1) Has that been tested? That seems like an extremely dramatic
decrease in cwnd. For example, if the cwnd is 80, and there are 40
ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
calculations seem to suggest that after just 11 ACKs the cwnd would be
down to a minimal value of 2:

ack 1 cwnd=60
ack 2 cwnd=45
ack 3 cwnd=33
ack 4 cwnd=24
ack 5 cwnd=18
ack 6 cwnd=13
ack 7 cwnd=9
ack 8 cwnd=6
ack 9 cwnd=4
ack 10 cwnd=3
ack 11 cwnd=2

2) That seems to contradict another passage in the draft (v 02 or 03). Consider
     https://tools.ietf.org/html/draft-ietf-tcpm-dctcp-03
where it says

   Just as specified in [RFC3168], DCTCP does not react to congestion
   indications more than once for every window of data.

So the draft seems to advocate not reacting to congestion indications
more than once per window. Yet this patch reacts on every ECE-marked
ACK within a window.

Am I reading something incorrectly?

cheers,
neal

^ permalink raw reply

* Re: [PATCHv2 net-next 3/4] net: dsa: mv88e6xxx: Move the tagging protocol into info
From: Andrew Lunn @ 2016-12-02 21:02 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: David Miller, netdev
In-Reply-To: <87k2bice5n.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

On Fri, Dec 02, 2016 at 02:41:08PM -0500, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn <andrew@lunn.ch> writes:
> 
> > @@ -3749,6 +3756,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
> >  		.global1_addr = 0x1b,
> >  		.age_time_coeff = 15000,
> >  		.g1_irqs = 9,
> > +		.tag_protocol = DSA_TAG_PROTO_EDSA,
> >  		.flags = MV88E6XXX_FLAGS_FAMILY_6352,
> >  		.ops = &mv88e6172_ops,
> >  	},
> 
> Since some chips support several protocols, we will have to turn
> tag_protocol into a bitmask and introduce something like:

Why? We have made an implementation choice, this chip will be used in
this way. There is no strong reason to use it the other way. There is
a strong reason not to allow it to be configured, because it makes the
driver more complex and the DSA layer more complex, and no other
driver requires this complexity.

KISS.

	Andrew

^ permalink raw reply

* [net-next PATCH v4 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: John Fastabend @ 2016-12-02 20:51 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

virtio_net XDP support expects receive buffers to be contiguous.
If this is not the case we enable a slowpath to allow connectivity
to continue but at a significan performance overhead associated with
linearizing data. To make it painfully aware to users that XDP is
running in a degraded mode we throw an xdp buffer error.

To linearize packets we allocate a page and copy the segments of
the data, including the header, into it. After this the page can be
handled by XDP code flow as normal.

Then depending on the return code the page is either freed or sent
to the XDP xmit path. There is no attempt to optimize this path.

This case is being handled simple as a precaution in case some
unknown backend were to generate packets in this form. To test this
I had to hack qemu and force it to generate these packets. I do not
expect this case to be generated by "real" backends.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   77 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 137caba..13f463d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -456,6 +456,64 @@ static struct sk_buff *receive_big(struct net_device *dev,
 	return NULL;
 }
 
+/* The conditions to enable XDP should preclude the underlying device from
+ * sending packets across multiple buffers (num_buf > 1). However per spec
+ * it does not appear to be illegal to do so but rather just against convention.
+ * So in order to avoid making a system unresponsive the packets are pushed
+ * into a page and the XDP program is run. This will be extremely slow and we
+ * push a warning to the user to fix this as soon as possible. Fixing this may
+ * require resolving the underlying hardware to determine why multiple buffers
+ * are being received or simply loading the XDP program in the ingress stack
+ * after the skb is built because there is no advantage to running it here
+ * anymore.
+ */
+static struct page *xdp_linearize_page(struct receive_queue *rq,
+				       u16 num_buf,
+				       struct page *p,
+				       int offset,
+				       unsigned int *len)
+{
+	struct page *page = alloc_page(GFP_ATOMIC);
+	unsigned int page_off = 0;
+
+	if (!page)
+		return NULL;
+
+	memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
+	page_off += *len;
+
+	while (--num_buf) {
+		unsigned int buflen;
+		unsigned long ctx;
+		void *buf;
+		int off;
+
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &buflen);
+		if (unlikely(!ctx))
+			goto err_buf;
+
+		/* guard against a misconfigured or uncooperative backend that
+		 * is sending packet larger than the MTU.
+		 */
+		if ((page_off + buflen) > PAGE_SIZE)
+			goto err_buf;
+
+		buf = mergeable_ctx_to_buf_address(ctx);
+		p = virt_to_head_page(buf);
+		off = buf - page_address(p);
+
+		memcpy(page_address(page) + page_off,
+		       page_address(p) + off, buflen);
+		page_off += buflen;
+	}
+
+	*len = page_off;
+	return page;
+err_buf:
+	__free_pages(page, 0);
+	return NULL;
+}
+
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct virtnet_info *vi,
 					 struct receive_queue *rq,
@@ -476,6 +534,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct page *xdp_page;
 		u32 act;
 
 		/* No known backend devices should send packets with
@@ -485,7 +544,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		 */
 		if (unlikely(num_buf > 1)) {
 			bpf_warn_invalid_xdp_buffer();
-			goto err_xdp;
+
+			/* linearize data for XDP */
+			xdp_page = xdp_linearize_page(rq, num_buf,
+						      page, offset, &len);
+			if (!xdp_page)
+				goto err_xdp;
+			offset = 0;
+		} else {
+			xdp_page = page;
 		}
 
 		/* Transient failure which in theory could occur if
@@ -496,15 +563,21 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
 			goto err_xdp;
 
-		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
+		act = do_xdp_prog(vi, xdp_prog, xdp_page, offset, len);
 		switch (act) {
 		case XDP_PASS:
+			if (unlikely(xdp_page != page))
+				__free_pages(xdp_page, 0);
 			break;
 		case XDP_TX:
+			if (unlikely(xdp_page != page))
+				goto err_xdp;
 			rcu_read_unlock();
 			goto xdp_xmit;
 		case XDP_DROP:
 		default:
+			if (unlikely(xdp_page != page))
+				__free_pages(xdp_page, 0);
 			goto err_xdp;
 		}
 	}

^ permalink raw reply related

* [net-next PATCH v4 5/6] virtio_net: add XDP_TX support
From: John Fastabend @ 2016-12-02 20:51 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.

Before sending the packet the header is zeroed.  Also XDP is expected
to handle checksum correctly so no checksum offload  support is
provided.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   63 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b67203e..137caba 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -330,12 +330,43 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
+static void virtnet_xdp_xmit(struct virtnet_info *vi,
+			     unsigned int qnum, struct xdp_buff *xdp)
+{
+	struct send_queue *sq = &vi->sq[qnum];
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+	unsigned int num_sg, len;
+	void *xdp_sent;
+	int err;
+
+	/* Free up any pending old buffers before queueing new ones. */
+	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		struct page *page = virt_to_head_page(xdp_sent);
+
+		put_page(page);
+	}
+
+	/* Zero header and leave csum up to XDP layers */
+	hdr = xdp->data;
+	memset(hdr, 0, vi->hdr_len);
+
+	num_sg = 1;
+	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+	err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
+				   xdp->data, GFP_ATOMIC);
+	if (unlikely(err))
+		put_page(virt_to_head_page(xdp->data));
+	else
+		virtqueue_kick(sq->vq);
+}
+
 static u32 do_xdp_prog(struct virtnet_info *vi,
 		       struct bpf_prog *xdp_prog,
 		       struct page *page, int offset, int len)
 {
 	int hdr_padded_len;
 	struct xdp_buff xdp;
+	unsigned int qp;
 	u32 act;
 	u8 *buf;
 
@@ -353,9 +384,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
 	switch (act) {
 	case XDP_PASS:
 		return XDP_PASS;
+	case XDP_TX:
+		qp = vi->curr_queue_pairs -
+			vi->xdp_queue_pairs +
+			smp_processor_id();
+		xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
+		virtnet_xdp_xmit(vi, qp, &xdp);
+		return XDP_TX;
 	default:
 		bpf_warn_invalid_xdp_action(act);
-	case XDP_TX:
 	case XDP_ABORTED:
 	case XDP_DROP:
 		return XDP_DROP;
@@ -391,8 +428,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
 		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
 			goto err_xdp;
 		act = do_xdp_prog(vi, xdp_prog, page, 0, len);
-		if (act == XDP_DROP)
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_DROP:
+		default:
 			goto err_xdp;
+		}
 	}
 	rcu_read_unlock();
 
@@ -407,6 +452,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
 err:
 	dev->stats.rx_dropped++;
 	give_pages(rq, page);
+xdp_xmit:
 	return NULL;
 }
 
@@ -425,6 +471,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	struct bpf_prog *xdp_prog;
 	unsigned int truesize;
 
+	head_skb = NULL;
+
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
@@ -449,8 +497,16 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			goto err_xdp;
 
 		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
-		if (act == XDP_DROP)
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_DROP:
+		default:
 			goto err_xdp;
+		}
 	}
 	rcu_read_unlock();
 
@@ -528,6 +584,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 err_buf:
 	dev->stats.rx_dropped++;
 	dev_kfree_skb(head_skb);
+xdp_xmit:
 	return NULL;
 }
 

^ permalink raw reply related

* [net-next PATCH v4 4/6] virtio_net: add dedicated XDP transmit queues
From: John Fastabend @ 2016-12-02 20:50 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.

However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 80b1cfc..b67203e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -114,6 +114,9 @@ struct virtnet_info {
 	/* # of queue pairs currently used by the driver */
 	u16 curr_queue_pairs;
 
+	/* # of XDP queue pairs currently used by the driver */
+	u16 xdp_queue_pairs;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -1552,7 +1555,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct bpf_prog *old_prog;
-	int i;
+	u16 xdp_qp = 0, curr_qp;
+	int i, err;
 
 	if ((dev->features & NETIF_F_LRO) && prog) {
 		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
@@ -1569,12 +1573,34 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return -EINVAL;
 	}
 
+	curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
+	if (prog)
+		xdp_qp = nr_cpu_ids;
+
+	/* XDP requires extra queues for XDP_TX */
+	if (curr_qp + xdp_qp > vi->max_queue_pairs) {
+		netdev_warn(dev, "request %i queues but max is %i\n",
+			    curr_qp + xdp_qp, vi->max_queue_pairs);
+		return -ENOMEM;
+	}
+
+	err = virtnet_set_queues(vi, curr_qp + xdp_qp);
+	if (err) {
+		dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
+		return err;
+	}
+
 	if (prog) {
 		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
-		if (IS_ERR(prog))
+		if (IS_ERR(prog)) {
+			virtnet_set_queues(vi, curr_qp);
 			return PTR_ERR(prog);
+		}
 	}
 
+	vi->xdp_queue_pairs = xdp_qp;
+	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
+
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
 		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);

^ permalink raw reply related

* [net-next PATCH v4 3/6] virtio_net: Add XDP support
From: John Fastabend @ 2016-12-02 20:50 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

From: John Fastabend <john.fastabend@gmail.com>

This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. Second the MTU must be less than a page
size to avoid having to handle XDP across multiple pages.

If mergeable receive is enabled this patch only supports the case
where header and data are in the same buf which we can check when
a packet is received by looking at num_buf. If the num_buf is
greater than 1 and a XDP program is loaded the packet is dropped
and a warning is thrown. When any_header_sg is set this does not
happen and both header and data is put in a single buffer as expected
so we check this when XDP programs are loaded.  Subsequent patches
will process the packet in a degraded mode to ensure connectivity
and correctness is not lost even if backend pushes packets into
multiple buffers.

If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.

This patch was tested with qemu with vhost=on and vhost=off where
mergeable and big_packet modes were forced via hard coding feature
negotiation. Multiple buffers per packet was forced via a small
test patch to vhost.c in the vhost=on qemu mode.

Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |  175 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 170 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d814e7cb..80b1cfc 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_net.h>
+#include <linux/bpf.h>
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
@@ -81,6 +82,8 @@ struct receive_queue {
 
 	struct napi_struct napi;
 
+	struct bpf_prog __rcu *xdp_prog;
+
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -324,6 +327,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
+static u32 do_xdp_prog(struct virtnet_info *vi,
+		       struct bpf_prog *xdp_prog,
+		       struct page *page, int offset, int len)
+{
+	int hdr_padded_len;
+	struct xdp_buff xdp;
+	u32 act;
+	u8 *buf;
+
+	buf = page_address(page) + offset;
+
+	if (vi->mergeable_rx_bufs)
+		hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_padded_len = sizeof(struct padded_vnet_hdr);
+
+	xdp.data = buf + hdr_padded_len;
+	xdp.data_end = xdp.data + (len - vi->hdr_len);
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	switch (act) {
+	case XDP_PASS:
+		return XDP_PASS;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_TX:
+	case XDP_ABORTED:
+	case XDP_DROP:
+		return XDP_DROP;
+	}
+}
+
 static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
 {
 	struct sk_buff * skb = buf;
@@ -340,14 +375,32 @@ static struct sk_buff *receive_big(struct net_device *dev,
 				   void *buf,
 				   unsigned int len)
 {
+	struct bpf_prog *xdp_prog;
 	struct page *page = buf;
-	struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
+	struct sk_buff *skb;
 
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->xdp_prog);
+	if (xdp_prog) {
+		struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
+		u32 act;
+
+		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
+			goto err_xdp;
+		act = do_xdp_prog(vi, xdp_prog, page, 0, len);
+		if (act == XDP_DROP)
+			goto err_xdp;
+	}
+	rcu_read_unlock();
+
+	skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
 	if (unlikely(!skb))
 		goto err;
 
 	return skb;
 
+err_xdp:
+	rcu_read_unlock();
 err:
 	dev->stats.rx_dropped++;
 	give_pages(rq, page);
@@ -365,11 +418,42 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+	struct sk_buff *head_skb, *curr_skb;
+	struct bpf_prog *xdp_prog;
+	unsigned int truesize;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->xdp_prog);
+	if (xdp_prog) {
+		u32 act;
+
+		/* No known backend devices should send packets with
+		 * more than a single buffer when XDP conditions are
+		 * met. However it is not strictly illegal so the case
+		 * is handled as an exception and a warning is thrown.
+		 */
+		if (unlikely(num_buf > 1)) {
+			bpf_warn_invalid_xdp_buffer();
+			goto err_xdp;
+		}
 
-	struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
-					       truesize);
-	struct sk_buff *curr_skb = head_skb;
+		/* Transient failure which in theory could occur if
+		 * in-flight packets from before XDP was enabled reach
+		 * the receive path after XDP is loaded. In practice I
+		 * was not able to create this condition.
+		 */
+		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
+			goto err_xdp;
+
+		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
+		if (act == XDP_DROP)
+			goto err_xdp;
+	}
+	rcu_read_unlock();
+
+	truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+	head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
+	curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
@@ -423,6 +507,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
+err_xdp:
+	rcu_read_unlock();
 err_skb:
 	put_page(page);
 	while (--num_buf) {
@@ -1328,6 +1414,13 @@ static int virtnet_set_channels(struct net_device *dev,
 	if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
 		return -EINVAL;
 
+	/* For now we don't support modifying channels while XDP is loaded
+	 * also when XDP is loaded all RX queues have XDP programs so we only
+	 * need to check a single RX queue.
+	 */
+	if (vi->rq[0].xdp_prog)
+		return -EINVAL;
+
 	get_online_cpus();
 	err = virtnet_set_queues(vi, queue_pairs);
 	if (!err) {
@@ -1454,6 +1547,69 @@ static int virtnet_set_features(struct net_device *netdev,
 	return 0;
 }
 
+static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+	int i;
+
+	if ((dev->features & NETIF_F_LRO) && prog) {
+		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
+		return -EINVAL;
+	}
+
+	if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
+		netdev_warn(dev, "XDP expects header/data in single page\n");
+		return -EINVAL;
+	}
+
+	if (dev->mtu > max_sz) {
+		netdev_warn(dev, "XDP requires MTU less than %lu\n", max_sz);
+		return -EINVAL;
+	}
+
+	if (prog) {
+		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
+		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
+
+	return 0;
+}
+
+static bool virtnet_xdp_query(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		if (vi->rq[i].xdp_prog)
+			return true;
+	}
+	return false;
+}
+
+static int virtnet_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return virtnet_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = virtnet_xdp_query(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1471,6 +1627,7 @@ static int virtnet_set_features(struct net_device *netdev,
 	.ndo_busy_poll		= virtnet_busy_poll,
 #endif
 	.ndo_set_features	= virtnet_set_features,
+	.ndo_xdp		= virtnet_xdp,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -1532,12 +1689,20 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 
 static void free_receive_bufs(struct virtnet_info *vi)
 {
+	struct bpf_prog *old_prog;
 	int i;
 
+	rtnl_lock();
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		while (vi->rq[i].pages)
 			__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
+
+		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
+		RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
+		if (old_prog)
+			bpf_prog_put(old_prog);
 	}
+	rtnl_unlock();
 }
 
 static void free_receive_page_frags(struct virtnet_info *vi)

^ permalink raw reply related

* [net-next PATCH v4 2/6] net: xdp: add invalid buffer warning
From: John Fastabend @ 2016-12-02 20:50 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

This adds a warning for drivers to use when encountering an invalid
buffer for XDP. For normal cases this should not happen but to catch
this in virtual/qemu setups that I may not have expected from the
emulation layer having a standard warning is useful.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/filter.h |    1 +
 net/core/filter.c      |    6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..90dfc3c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -595,6 +595,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 void bpf_warn_invalid_xdp_action(u32 act);
+void bpf_warn_invalid_xdp_buffer(void);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262..7926dd0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2783,6 +2783,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+void bpf_warn_invalid_xdp_buffer(void)
+{
+	WARN_ONCE(1, "Illegal XDP buffer encountered, expect throughput degradation\n");
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
+
 static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 					int src_reg, int ctx_off,
 					struct bpf_insn *insn_buf,

^ permalink raw reply related

* [net-next PATCH v4 1/6] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-12-02 20:49 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer
In-Reply-To: <20161202204804.4331.61904.stgit@john-Precision-Tower-5810>

This adds support for dynamically setting the LRO feature flag. The
message to control guest features in the backend uses the
CTRL_GUEST_OFFLOADS msg type.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a21d93a..d814e7cb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device *dev)
 	.set_settings = virtnet_set_settings,
 };
 
+static int virtnet_set_features(struct net_device *netdev,
+				netdev_features_t features)
+{
+	struct virtnet_info *vi = netdev_priv(netdev);
+	struct virtio_device *vdev = vi->vdev;
+	struct scatterlist sg;
+	u64 offloads = 0;
+
+	if (features & NETIF_F_LRO)
+		offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
+			    (1 << VIRTIO_NET_F_GUEST_TSO6);
+
+	if (features & NETIF_F_RXCSUM)
+		offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
+		sg_init_one(&sg, &offloads, sizeof(uint64_t));
+		if (!virtnet_send_command(vi,
+					  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+					  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+					  &sg)) {
+			dev_warn(&netdev->dev,
+				 "Failed to set guest offloads by virtnet command.\n");
+			return -EINVAL;
+		}
+	} else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
+		   !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+		dev_warn(&netdev->dev,
+			 "No support for setting offloads pre version_1.\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1435,6 +1470,7 @@ static void virtnet_init_settings(struct net_device *dev)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	.ndo_busy_poll		= virtnet_busy_poll,
 #endif
+	.ndo_set_features	= virtnet_set_features,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -1815,6 +1851,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
 		dev->features |= NETIF_F_RXCSUM;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
+	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+		dev->features |= NETIF_F_LRO;
+		dev->hw_features |= NETIF_F_LRO;
+	}
+
 	dev->vlan_features = dev->features;
 
 	/* MTU range: 68 - 65535 */
@@ -2057,7 +2099,8 @@ static int virtnet_restore(struct virtio_device *vdev)
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
-	VIRTIO_NET_F_MTU
+	VIRTIO_NET_F_MTU, \
+	VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,

^ permalink raw reply related

* [net-next PATCH v4 0/6] XDP for virtio_net
From: John Fastabend @ 2016-12-02 20:49 UTC (permalink / raw)
  To: daniel, mst, shm, davem, tgraf, alexei.starovoitov
  Cc: john.r.fastabend, netdev, bblanco, john.fastabend, brouer

This implements virtio_net for the mergeable buffers and big_packet
modes. I tested this with vhost_net running on qemu and did not see
any issues. For testing num_buf > 1 I added a hack to vhost driver
to only use 100 bytes per buffer so that packets were pushed across
multiple buffers.

There are some restrictions for XDP to be enabled and work well
(see patch 3) for more details.

  1. LRO must be off
  2. MTU must be less than PAGE_SIZE
  3. queues must be available to dedicate to XDP
  4. num_bufs received in mergeable buffers must be 1
  5. big_packet mode must have all data on single page

Please review any comments/feedback welcome as always.

---

John Fastabend (6):
      net: virtio dynamically disable/enable LRO
      net: xdp: add invalid buffer warning
      virtio_net: Add XDP support
      virtio_net: add dedicated XDP transmit queues
      virtio_net: add XDP_TX support
      virtio_net: xdp, add slowpath case for non contiguous buffers

 drivers/net/virtio_net.c |  376 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/filter.h   |    1 
 net/core/filter.c        |    6 +
 3 files changed, 377 insertions(+), 6 deletions(-)

^ permalink raw reply

* Re: [PATCH] net: wireless: realtek: constify rate_control_ops structures
From: Larry Finger @ 2016-12-02 20:39 UTC (permalink / raw)
  To: Bhumika Goyal, julia.lawall, chaoming_li, kvalo, linux-wireless,
	netdev, linux-kernel
In-Reply-To: <1480672254-4986-1-git-send-email-bhumirks@gmail.com>

On 12/02/2016 03:50 AM, Bhumika Goyal wrote:
> The structures rate_control_ops are only passed as an argument to the
> functions ieee80211_rate_control_{register/unregister}. This argument is
> of type const, so rate_control_ops having this property can also be
> declared as const.
> Done using Coccinelle:
>
> @r1 disable optional_qualifier @
> identifier i;
> position p;
> @@
> static struct rate_control_ops i@p = {...};
>
> @ok1@
> identifier r1.i;
> position p;
> @@
> ieee80211_rate_control_register(&i@p)
>
> @ok2@
> identifier r1.i;
> position p;
> @@
> ieee80211_rate_control_unregister(&i@p)
>
> @bad@
> position p!={r1.p,ok1.p,ok2.p};
> identifier r1.i;
> @@
> i@p
>
> @depends on !bad disable optional_qualifier@
> identifier r1.i;
> @@
> static
> +const
> struct rate_control_ops i={...};
>
> @depends on !bad disable optional_qualifier@
> identifier r1.i;
> @@
> +const
> struct rate_control_ops i;
>
> File size before:
>    text	   data	    bss	    dec	    hex	filename
>    1991	    104	      0	   2095	    82f wireless/realtek/rtlwifi/rc.o
>
> File size after:
>    text	   data	    bss	    dec	    hex	filename
>    2095	      0	      0	   2095	    wireless/realtek/rtlwifi/rc.o
>
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
> ---
>  drivers/net/wireless/realtek/rtlwifi/rc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/realtek/rtlwifi/rc.c b/drivers/net/wireless/realtek/rtlwifi/rc.c
> index ce8621a..107c13c 100644
> --- a/drivers/net/wireless/realtek/rtlwifi/rc.c
> +++ b/drivers/net/wireless/realtek/rtlwifi/rc.c
> @@ -284,7 +284,7 @@ static void rtl_rate_free_sta(void *rtlpriv,
>  	kfree(rate_priv);
>  }
>
> -static struct rate_control_ops rtl_rate_ops = {
> +static const struct rate_control_ops rtl_rate_ops = {
>  	.name = "rtl_rc",
>  	.alloc = rtl_rate_alloc,
>  	.free = rtl_rate_free,
>

The content of your patch is OK; however, your subject is not. By convention, 
"net: wireless: realtek:" is assumed. We do, however, include "rtlwifi:" to 
indicate which part of drivers/net/wireless/realtek/ is referenced.

NACK

Larry

^ permalink raw reply

* Re: [PATCH net-next 2/2] net/sched: cls_flower: Support matching on ICMP type and code
From: Simon Horman @ 2016-12-02 20:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, netdev, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <20161202191712.GA32484@penelope.horms.nl>

Hi Jiri,

On Fri, Dec 02, 2016 at 08:17:13PM +0100, Simon Horman wrote:
> On Fri, Dec 02, 2016 at 07:38:48PM +0100, Jiri Pirko wrote:
> > Fri, Dec 02, 2016 at 07:05:51PM CET, simon.horman@netronome.com wrote:
> > >Support matching on ICMP type and code.

...

> > This hunk looks like it should be squashed to the previous patch.
> 
> I included it in this patch as it is where these helpers are used
> for the first time. I can shuffle it into the first patch if you prefer;
> I agree it does make sense to put all the dissector changes there.

I moved things around as you suggested and posted v2.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox