Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/6] net: dsa: mv88e6xxx: add helper to hardware reset
From: Andrew Lunn @ 2016-11-30 23:21 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-3-vivien.didelot@savoirfairelinux.com>

On Wed, Nov 30, 2016 at 05:59:26PM -0500, Vivien Didelot wrote:
> Add an helper to toggle the eventual GPIO connected to the reset pin.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Andrew Lunn @ 2016-11-30 23:26 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-4-vivien.didelot@savoirfairelinux.com>

> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> index ab52c37..9e51405 100644
> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>  	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>  			 u16 val);
>  
> +	/* Switch Software Reset */
> +	int (*reset)(struct mv88e6xxx_chip *chip);
> +

Hi Vivien

In my huge patch series of 6390, i've been using a g1_ prefix for
functionality which is in global 1, g2_ for global 2, etc.  This has
worked for everything so far with the exception of setting which
reserved MAC addresses should be sent to the CPU. Most devices have it
in g2, but 6390 has it in g1.

Please could you add the prefix.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
From: Andrew Lunn @ 2016-11-30 23:38 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-6-vivien.didelot@savoirfairelinux.com>

> +static int mv88e6xxx_wait_switch_ready(struct mv88e6xxx_chip *chip)
> +{
> +	const unsigned long timeout = jiffies + 1 * HZ;
> +	bool ready;
> +	int err;
> +
> +	/* Wait up to 1 second for switch to be ready.
> +	 * The switch is ready when all units inside the device (ATU, VTU, etc.)
> +	 * have finished their initialization and are ready to accept frames.
> +	 */
> +	while (time_before(jiffies, timeout)) {
> +		err = mv88e6xxx_g1_init_ready(chip, &ready);
> +		if (err)
> +			return err;
> +
> +		if (ready)
> +			break;
> +
> +		usleep_range(1000, 2000);
> +	}
> +
> +	if (time_after(jiffies, timeout))
> +		return -ETIMEDOUT;

As we have seen in the past, this sort of loop is broken if we end up
sleeping for a long time. Please take the opportunity to replace it
with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait()

> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
> +{
> +	u16 val;
> +	int err;
> +
> +	/* Check the value of the InitReady bit 11 */
> +	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
> +	if (err)
> +		return err;
> +
> +	*ready = !!(val & GLOBAL_STATUS_INIT_READY);

I would actually do the wait here.

> +
> +	return 0;
> +}
> +

Thanks

  Andrew

^ permalink raw reply

* Re: [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Alexei Starovoitov @ 2016-11-30 23:43 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <950bbc4f82150683dd87e26dbd41412c26a38eba.1480522144.git.tgraf@suug.ch>

On Wed, Nov 30, 2016 at 05:10:10PM +0100, Thomas Graf wrote:
> Registers new BPF program types which correspond to the LWT hooks:
>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> 
> The separate program types are required to differentiate between the
> capabilities each LWT hook allows:
> 
>  * Programs attached to dst_input() or dst_output() are restricted and
>    may only read the data of an skb. This prevent modification and
>    possible invalidation of already validated packet headers on receive
>    and the construction of illegal headers while the IP headers are
>    still being assembled.
> 
>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>    content as well as prepending an L2 header via a newly introduced
>    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
>    invoked after the IP header has been assembled completely.
> 
> All BPF programs receive an skb with L3 headers attached and may return
> one of the following error codes:
> 
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid in lwtunnel_xmit() context)
> 
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Looks great.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next v4 4/4] bpf: Add tests and samples for LWT-BPF
From: Alexei Starovoitov @ 2016-11-30 23:46 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <81603830e3684a9a48d5c18687e47a9a0df6b1bf.1480522144.git.tgraf@suug.ch>

On Wed, Nov 30, 2016 at 05:10:11PM +0100, Thomas Graf wrote:
> Adds a series of tests to verify the functionality of attaching
> BPF programs at LWT hooks.
> 
> Also adds a sample which collects a histogram of packet sizes which
> pass through an LWT hook.
> 
> $ ./lwt_len_hist.sh
> Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
>  87380  16384  16384    10.00    39857.69
>        1 -> 1        : 0        |                                      |
>        2 -> 3        : 0        |                                      |
>        4 -> 7        : 0        |                                      |
>        8 -> 15       : 0        |                                      |
>       16 -> 31       : 0        |                                      |
>       32 -> 63       : 22       |                                      |
>       64 -> 127      : 98       |                                      |
>      128 -> 255      : 213      |                                      |
>      256 -> 511      : 1444251  |********                              |
>      512 -> 1023     : 660610   |***                                   |
>     1024 -> 2047     : 535241   |**                                    |
>     2048 -> 4095     : 19       |                                      |
>     4096 -> 8191     : 180      |                                      |
>     8192 -> 16383    : 5578023  |************************************* |
>    16384 -> 32767    : 632099   |***                                   |
>    32768 -> 65535    : 6575     |                                      |
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  samples/bpf/Makefile            |   4 +
>  samples/bpf/bpf_helpers.h       |   4 +
>  samples/bpf/lwt_len_hist.sh     |  37 ++++
>  samples/bpf/lwt_len_hist_kern.c |  82 +++++++++
>  samples/bpf/lwt_len_hist_user.c |  76 ++++++++
>  samples/bpf/test_lwt_bpf.c      | 253 +++++++++++++++++++++++++
>  samples/bpf/test_lwt_bpf.sh     | 399 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 855 insertions(+)
>  create mode 100755 samples/bpf/lwt_len_hist.sh
>  create mode 100644 samples/bpf/lwt_len_hist_kern.c
>  create mode 100644 samples/bpf/lwt_len_hist_user.c
>  create mode 100644 samples/bpf/test_lwt_bpf.c
>  create mode 100755 samples/bpf/test_lwt_bpf.sh
...
> +static inline void __fill_garbage(struct __sk_buff *skb)
> +{
> +	uint64_t f = 0xFFFFFFFFFFFFFFFF;
> +
> +	bpf_skb_store_bytes(skb, 0, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 8, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 16, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 24, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 32, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 40, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 48, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 56, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 64, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 72, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 80, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 88, &f, sizeof(f), 0);
> +}
> +
> +SEC("fill_garbage")
> +int do_fill_garbage(struct __sk_buff *skb)
> +{
> +	__fill_garbage(skb);
> +	printk("Set initial 96 bytes of header to FF\n");
> +	return BPF_OK;
> +}
> +
> +SEC("fill_garbage_and_redirect")
> +int do_fill_garbage_and_redirect(struct __sk_buff *skb)
> +{
> +	int ifindex = DST_IFINDEX;
> +	__fill_garbage(skb);
> +	printk("redirected to %d\n", ifindex);
> +	return bpf_redirect(ifindex, 0);
> +}

Two 'garbage' tests! Pure garbage ;)

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Erik Nordmark @ 2016-11-30 23:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, hannes, Erik Nordmark, Bob Gilligan

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
---

v2: renamed sysctl and made it default to true, plus minor code review fixes
v3: respun with later net-next; fixed whitespace issues

 Documentation/networking/ip-sysctl.txt |  9 +++++++++
 include/linux/ipv6.h                   |  1 +
 include/net/if_inet6.h                 |  1 +
 include/net/ndisc.h                    |  5 ++++-
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 22 +++++++++++++++++++++-
 net/ipv6/ndisc.c                       | 31 ++++++++++++++++++++++++++++---
 7 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5af48dd..d9ef566 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1729,6 +1729,15 @@ drop_unsolicited_na - BOOLEAN
 
 	By default this is turned off.
 
+enhanced_dad - BOOLEAN
+	Include a nonce option in the IPv6 neighbor solicitation messages used for
+	duplicate address detection per RFC7527. A received DAD NS will only signal
+	a duplicate address if the nonce is different. This avoids any false
+	detection of duplicates due to loopback of the NS messages that we send.
+	The nonce option will be sent on an interface unless both of
+	conf/{all,interface}/enhanced_dad are set to FALSE.
+	Default: TRUE
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 3f95233..671d014 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -68,6 +68,7 @@ struct ipv6_devconf {
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	__s32		seg6_require_hmac;
 #endif
+	__u32		enhanced_dad;
 
 	struct ctl_table_header *sysctl_header;
 };
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index b0576cb..0fa4c32 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -55,6 +55,7 @@ struct inet6_ifaddr {
 	__u8			stable_privacy_retry;
 
 	__u16			scope;
+	__u64			dad_nonce;
 
 	unsigned long		cstamp;	/* created timestamp */
 	unsigned long		tstamp; /* updated timestamp */
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index be1fe228..d562a2f 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -31,6 +31,7 @@ enum {
 	ND_OPT_PREFIX_INFO = 3,		/* RFC2461 */
 	ND_OPT_REDIRECT_HDR = 4,	/* RFC2461 */
 	ND_OPT_MTU = 5,			/* RFC2461 */
+	ND_OPT_NONCE = 14,              /* RFC7527 */
 	__ND_OPT_ARRAY_MAX,
 	ND_OPT_ROUTE_INFO = 24,		/* RFC4191 */
 	ND_OPT_RDNSS = 25,		/* RFC5006 */
@@ -121,6 +122,7 @@ struct ndisc_options {
 #define nd_opts_pi_end			nd_opt_array[__ND_OPT_PREFIX_INFO_END]
 #define nd_opts_rh			nd_opt_array[ND_OPT_REDIRECT_HDR]
 #define nd_opts_mtu			nd_opt_array[ND_OPT_MTU]
+#define nd_opts_nonce			nd_opt_array[ND_OPT_NONCE]
 #define nd_802154_opts_src_lladdr	nd_802154_opt_array[ND_OPT_SOURCE_LL_ADDR]
 #define nd_802154_opts_tgt_lladdr	nd_802154_opt_array[ND_OPT_TARGET_LL_ADDR]
 
@@ -398,7 +400,8 @@ static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, cons
 int ndisc_rcv(struct sk_buff *skb);
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr);
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce);
 
 void ndisc_send_rs(struct net_device *dev,
 		   const struct in6_addr *saddr, const struct in6_addr *daddr);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 53561be..eaf65dc 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -181,6 +181,7 @@ enum {
 	DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
 	DEVCONF_SEG6_ENABLED,
 	DEVCONF_SEG6_REQUIRE_HMAC,
+	DEVCONF_ENHANCED_DAD,
 	DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c387dc..c1e124b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -242,6 +242,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -292,6 +293,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 /* Check if a valid qdisc is available */
@@ -3735,12 +3737,21 @@ static void addrconf_dad_kick(struct inet6_ifaddr *ifp)
 {
 	unsigned long rand_num;
 	struct inet6_dev *idev = ifp->idev;
+	u64 nonce;
 
 	if (ifp->flags & IFA_F_OPTIMISTIC)
 		rand_num = 0;
 	else
 		rand_num = prandom_u32() % (idev->cnf.rtr_solicit_delay ? : 1);
 
+	nonce = 0;
+	if (idev->cnf.enhanced_dad ||
+	    dev_net(idev->dev)->ipv6.devconf_all->enhanced_dad) {
+		do
+			get_random_bytes(&nonce, 6);
+		while (nonce == 0);
+	}
+	ifp->dad_nonce = nonce;
 	ifp->dad_probes = idev->cnf.dad_transmits;
 	addrconf_mod_dad_work(ifp, rand_num);
 }
@@ -3918,7 +3929,8 @@ static void addrconf_dad_work(struct work_struct *w)
 
 	/* send a neighbour solicitation for our addr */
 	addrconf_addr_solict_mult(&ifp->addr, &mcaddr);
-	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any);
+	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any,
+		      ifp->dad_nonce);
 out:
 	in6_ifa_put(ifp);
 	rtnl_unlock();
@@ -4962,6 +4974,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	array[DEVCONF_SEG6_REQUIRE_HMAC] = cnf->seg6_require_hmac;
 #endif
+	array[DEVCONF_ENHANCED_DAD] = cnf->enhanced_dad;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -6070,6 +6083,13 @@ int addrconf_sysctl_ignore_routes_with_linkdown(struct ctl_table *ctl,
 	},
 #endif
 	{
+		.procname       = "enhanced_dad",
+		.data           = &ipv6_devconf.enhanced_dad,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		/* sentinel */
 	}
 };
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index d8e6714..eb35f73 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -233,6 +233,7 @@ struct ndisc_options *ndisc_parse_options(const struct net_device *dev,
 		case ND_OPT_SOURCE_LL_ADDR:
 		case ND_OPT_TARGET_LL_ADDR:
 		case ND_OPT_MTU:
+		case ND_OPT_NONCE:
 		case ND_OPT_REDIRECT_HDR:
 			if (ndopts->nd_opt_array[nd_opt->nd_opt_type]) {
 				ND_PRINTK(2, warn,
@@ -568,7 +569,8 @@ static void ndisc_send_unsol_na(struct net_device *dev)
 }
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr)
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce)
 {
 	struct sk_buff *skb;
 	struct in6_addr addr_buf;
@@ -588,6 +590,8 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 	if (inc_opt)
 		optlen += ndisc_opt_addr_space(dev,
 					       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0)
+		optlen += 8;
 
 	skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
 	if (!skb)
@@ -605,6 +609,13 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 		ndisc_fill_addr_option(skb, ND_OPT_SOURCE_LL_ADDR,
 				       dev->dev_addr,
 				       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0) {
+		u8 *opt = skb_put(skb, 8);
+
+		opt[0] = ND_OPT_NONCE;
+		opt[1] = 8 >> 3;
+		memcpy(opt + 2, &nonce, 6);
+	}
 
 	ndisc_send_skb(skb, daddr, saddr);
 }
@@ -693,12 +704,12 @@ static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb)
 				  "%s: trying to ucast probe in NUD_INVALID: %pI6\n",
 				  __func__, target);
 		}
-		ndisc_send_ns(dev, target, target, saddr);
+		ndisc_send_ns(dev, target, target, saddr, 0);
 	} else if ((probes -= NEIGH_VAR(neigh->parms, APP_PROBES)) < 0) {
 		neigh_app_ns(neigh);
 	} else {
 		addrconf_addr_solict_mult(target, &mcaddr);
-		ndisc_send_ns(dev, target, &mcaddr, saddr);
+		ndisc_send_ns(dev, target, &mcaddr, saddr, 0);
 	}
 }
 
@@ -742,6 +753,7 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 	int dad = ipv6_addr_any(saddr);
 	bool inc;
 	int is_router = -1;
+	u64 nonce = 0;
 
 	if (skb->len < sizeof(struct nd_msg)) {
 		ND_PRINTK(2, warn, "NS: packet too short\n");
@@ -786,6 +798,8 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 			return;
 		}
 	}
+	if (ndopts.nd_opts_nonce)
+		memcpy(&nonce, (u8 *)(ndopts.nd_opts_nonce + 1), 6);
 
 	inc = ipv6_addr_is_multicast(daddr);
 
@@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 have_ifp:
 		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
 			if (dad) {
+				if (nonce != 0 && ifp->dad_nonce == nonce) {
+					u8 *np = (u8 *)&nonce;
+					/* Matching nonce if looped back */
+					ND_PRINTK(2, notice,
+						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
+						  ifp->idev->dev->name,
+						  &ifp->addr,
+						  np[0], np[1], np[2], np[3],
+						  np[4], np[5]);
+					goto out;
+				}
 				/*
 				 * We are colliding with another node
 				 * who is doing DAD
-- 
1.8.1.4

^ permalink raw reply related

* Package has been sent.
From: Linda Guo @ 2016-11-30 23:52 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

Your shipment(s) is scheduled for delivery

Scheduled Delivery Date: 12/02/2016

Shipper: Chambers Group.

Kindly view the attached document for shipment/delivery details and
tracking procedure. You can also request a delivery change (e.g.
reschedule or reroute) from the tracking detail.

Approximate Delivery Time: between 3:00 PM and 7:00 PM DHL Service:
DHL 2nd Day Air.

We are pleased to provide you with delivery that fits your life.
----------------------------------------------------------------------------------------------------------------------------------------------------------

© 2014 Parcel Service of the World. DHL, the DHL brandmark, and the
color brown are trademarks of DHL Service of America, Inc. All rights
reserved. All trademarks, trade names, or service marks that appear in
connection with DHL services are the property of their respective
owners. For more information on DHL's privacy practices, refer to the
DHL Privacy Notice. Please do not reply directly to this e-mail. DHL
will not receive any reply message. For questions or comments, contact
DHL.

Privacy Notice: This communication contains proprietary information
and may be confidential. If you are not the intended recipient, the
reading, copying, disclosure or other use of the contents of this
e-mail is strictly prohibited and you are instructed to please delete
this e-mail immediately.

[-- Attachment #2: DHL.pdf --]
[-- Type: application/pdf, Size: 54297 bytes --]

^ permalink raw reply

* Re: [PATCH net] tipc: check minimum bearer MTU
From: Ben Hutchings @ 2016-12-01  0:05 UTC (permalink / raw)
  To: Michal Kubecek, Jon Maloy, Ying Xue
  Cc: David S. Miller, tipc-discussion, netdev, linux-kernel,
	Qian Zhang
In-Reply-To: <20161130102424.GC19119@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Wed, 2016-11-30 at 11:24 +0100, Michal Kubecek wrote:
> On Wed, Nov 30, 2016 at 10:57:02AM +0100, Michal Kubecek wrote:
> > Qian Zhang (张谦) reported a potential socket buffer overflow in
> > tipc_msg_build() which is also known as CVE-2016-8632: due to
> > insufficient checks, a buffer overflow can occur if MTU is too short for
> > even tipc headers. As anyone can set device MTU in a user/net namespace,
> > this issue can be abused by a regular user.
> > 
> > As agreed in the discussion on Ben Hutchings' original patch, we should
> > check the MTU at the moment a bearer is attached rather than for each
> > processed packet. We also need to repeat the check when bearer MTU is
> > adjusted to new device MTU. UDP case also needs a check to avoid
> > overflow when calculating bearer MTU.
> > 
> > Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
> > Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> > Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
> 
> Self-NACK.
> 
> Im sorry, while testing this, I overlooked that an attempt to change
> MTU of an underlying device to low value issues a warning but it
> succeeds anyway.
[...]

I'm not sure that TIPC should block the MTU change, anyway.  For IPv4
and IPv6 we disable the protocol on a device if its MTU is reduced
below the minimum.  I think TIPC should behave the same way.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by
stupidity.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Robert Shearman @ 2016-12-01  0:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, Alexander Duyck
In-Reply-To: <CAKgT0UcK-+t3DK=TtxY14y6XB8nqtgdv57E53QkOAjmdB36-vw@mail.gmail.com>

On 29/11/16 23:14, Alexander Duyck wrote:
> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman <rshearma@brocade.com> wrote:
>> With certain distributions of routes it can take a long time to add
>> and delete further routes at scale. For example, with a route for
>> 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
>> from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
>> is update_suffix:
>>
>>       40.39%  [kernel]                      [k] update_suffix
>>        8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup
>
> Well yeah, it will be expensive when you have over 512K entries in a
> single node.  I'm assuming that is the case based on the fact that
> 200K routes will double up in the trie node due to broadcast and the
> route ending up in adjacent key vectors.

The example scenario was in fact using a large scale of just routes 
rather than addresses so there are no broadcast leafs (nor local leafs):

         +-- 8.0.0.0/6 18 2 52436 suffix/20
            |-- 8.0.0.0
               /24 universe UNICAST
            |-- 8.0.1.0
               /24 universe UNICAST
            |-- 8.0.2.0
               /24 universe UNICAST

(the "suffix/20" is for test purposes as per below). In this case the 
8.0.0.0/6 node has a child array of size 2^18 = 262144.

>
>> With these changes, the time is reduced to 4s for the same scale and
>> distribution of routes.
>>
>> The issue is that update_suffix does an O(n) walk on the children of a
>> node and the with a dense distribtion of routes the number of children
>> in a node tends towards the number of nodes in the tree.
>
> Other than the performance I was just wondering if you did any other
> validation to confirm that nothing else differs.  Basically it would
> just be a matter of verifying that /proc/net/fib_trie is the same
> before and after your patch.

Yes, to verify these changes I applied some local changes to dump the 
slen field of trie nodes from /proc/net/fib_trie. I presumed that the 
format of /proc/net/fib_trie is a public API that we can't break so I 
intend to submit this as a patch. I verified by inspection that the 
suffix length listed in the nodes was as expected when exercising the 
insert and remove functions for routes with both larger and smaller 
suffixes than in the current nodes, and then for the inflate, halve and 
flush cases.

I've now verified that a diff of /proc/net/fib_trie before and after my 
patch with all the routes added of my example scenario shows no changes.

>
...
>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>
> So I am not entirely convinced this is a regression, I was wondering
> what your numbers looked like before the patch you are saying this
> fixes?  I recall I fixed a number of issues leading up to this so I am
> just curious.

At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove 
checks for index >= tnode_child_length from tnode_get_child") which is 
the parent of 5405afd1a306:

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m3.451s
user	0m0.184s
sys	0m2.004s


At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add 
tracking value for suffix length"):

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m48.624s
user	0m0.360s
sys	0m46.960s

So does that warrant a fixes tag for this patch?

>
> My thought is this seems more like a performance optimization than a
> fix.  If that is the case this probably belongs in net-next.
>
> It seems to me we might be able to simplify update_suffix rather than
> going through all this change.  That might be something that is more
> acceptable for net.  Have you looked at simply adding code so that
> there is a break inside update_suffix if (slen == tn->slen)?  We don't
> have to call it for if a leaf is added so it might make sense to add
> that check.

That doesn't really help since the search always starts from the 
smallest suffix length and thus could potentially require visiting a 
large number of children before finding the node that makes slen == 
tn->slen.

In the example above, 10.37.96.0/20 would be child number 140640 in node 
8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that 
it requires 70320 iterations round to find 10.37.96.0/20 which 
contributes the largest suffix length to the node.

Now it may be possible to improve the algorithm by starting the search 
from the largest suffix length towards the smallest suffix length 
instead of the current smallest to largest, but there would still be 
distributions of routes that would lead to needing to visit a large 
number of nodes only to find that nothing has changed.

>
>> ---
>>  net/ipv4/fib_trie.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 62 insertions(+), 24 deletions(-)
>>
>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> index 026f309c51e9..701cae8af44a 100644
>> --- a/net/ipv4/fib_trie.c
>> +++ b/net/ipv4/fib_trie.c
>> @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n)
>>         return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
>>  }
>>
>> +static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
>> +{
>> +       while (tn->slen < l->slen) {
>> +               tn->slen = l->slen;
>> +               tn = node_parent(tn);
>> +               if (!tn)
>> +                       break;
>
> I don't think the NULL check is necessary, at least it wasn't with the old code.

It wasn't necessary before because the root node had the largest 
possible suffix length of KEYLENGTH. This isn't the case for the nodes 
this is now called on since they're either nodes without parents or 
sub-tries that end up in node without a parent, where they're 
initialised to have the smallest possible suffix length for the node 
(equal to their position).

>
>> +       }
>> +}
>> +
>>  /* Add a child at position i overwriting the old value.
>>   * Update the value of full_children and empty_children.
>> + *
>> + * The suffix length of the parent node and the rest of the tree is
>> + * updated (if required) when adding/replacing a node, but is caller's
>> + * responsibility on removal.
>>   */
>>  static void put_child(struct key_vector *tn, unsigned long i,
>>                       struct key_vector *n)
>> @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i,
>>         else if (!wasfull && isfull)
>>                 tn_info(tn)->full_children++;
>>
>> -       if (n && (tn->slen < n->slen))
>> -               tn->slen = n->slen;
>> +       if (n)
>> +               node_push_suffix(tn, n);
>
> This is just creating redundant work if we have to perform a resize.

The only real redundant work is the assignment of slen in tn, since the 
propagation up the trie has to happen regardless and if a subsequent 
resize results in changes to the trie then the propagation will stop at 
tn's parent since the suffix length of the tn's parent will not be less 
than tn's suffix length.

>
>>         rcu_assign_pointer(tn->tnode[i], n);
>>  }
...
>> -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
>> +static void node_set_suffix(struct key_vector *tp, unsigned char slen)
>>  {
>> -       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>> -               if (update_suffix(tp) > l->slen)
>> +       if (slen > tp->slen)
>> +               tp->slen = slen;
>> +}
>> +
>> +static void node_pull_suffix(struct key_vector *tn)
>> +{
>> +       struct key_vector *tp;
>> +       unsigned char slen;
>> +
>> +       slen = update_suffix(tn);
>> +       tp = node_parent(tn);
>> +       while ((tp->slen > tp->pos) && (tp->slen > slen)) {
>> +               if (update_suffix(tp) > slen)
>>                         break;
>>                 tp = node_parent(tp);
>>         }
>>  }
>>
>
> I really hate all the renaming.  Can't you just leave these as
> leaf_pull_suffix and leaf_push _suffix.  I'm fine with us dropping the
> leaf of the suffix but the renaming just makes adds a bunch of noise.
> Really it might work better if you broke the replacement of the
> key_vector as a leaf with the suffix length into a separate patch,
> then you could do the rename as a part of that.

Ok, I'll leave the naming of leaf_push_suffix alone. Note that 
leaf_pull_suffix hasn't been renamed, the below in the diff is just an 
artifact of how diff decided to present the changes along with the 
moving of leaf_push_suffix.

>
>> -static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
>> +static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
>>  {
>> -       /* if this is a new leaf then tn will be NULL and we can sort
>> -        * out parent suffix lengths as a part of trie_rebalance
>> -        */
>> -       while (tn->slen < l->slen) {
>> -               tn->slen = l->slen;
>> -               tn = node_parent(tn);
>> +       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>> +               if (update_suffix(tp) > l->slen)
>> +                       break;
>> +               tp = node_parent(tp);
>>         }
>>  }
>
> If possible it would work better if you could avoid moving functions
> around as a result of your changes.

Ok, I can add a forward declaration instead.

>
>> @@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
>>         /* if we added to the tail node then we need to update slen */
>>         if (l->slen < new->fa_slen) {
>>                 l->slen = new->fa_slen;
>> -               leaf_push_suffix(tp, l);
>> +               node_push_suffix(tp, l);
>>         }
>>
>>         return 0;
...
>> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table *tb)
>>                         if (IS_TRIE(pn))
>>                                 break;
>>
>> +                       /* push the suffix length to the parent node,
>> +                        * since a previous leaf removal may have
>> +                        * caused it to change
>> +                        */
>> +                       if (pn->slen > pn->pos) {
>> +                               unsigned char slen = update_suffix(pn);
>> +
>> +                               node_set_suffix(node_parent(pn), slen);
>> +                       }
>> +
>
> Why bother with the local variable?  You could just pass
> update_suffix(pn) as the parameter to your node_set_suffix function.

To avoid a long line on the node_set_suffix call or splitting it across 
lines, but I'll remove the local variable as you suggest and the same below.

>
>>                         /* resize completed node */
>>                         pn = resize(t, pn);
>>                         cindex = get_index(pkey, pn);
>> @@ -1849,6 +1877,16 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
>>                         if (IS_TRIE(pn))
>>                                 break;
>>
>> +                       /* push the suffix length to the parent node,
>> +                        * since a previous leaf removal may have
>> +                        * caused it to change
>> +                        */
>> +                       if (pn->slen > pn->pos) {
>> +                               unsigned char slen = update_suffix(pn);
>> +
>> +                               node_set_suffix(node_parent(pn), slen);
>> +                       }
>> +
>
> Same here.
>
>>                         /* resize completed node */
>>                         pn = resize(t, pn);
>>                         cindex = get_index(pkey, pn);
>> --
>> 2.1.4
>>

Thanks,
Rob

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  0:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tom Herbert, Willem de Bruijn
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
In-Reply-To: <1480534200.18162.203.camel@edumazet-glaptop3.roam.corp.google.com>

Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.

When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().

The work-flow is :

1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.

2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); 

if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
     netif_schedule_queue(dev_queue);

This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)

So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.

This problem is magnified when XPS is used, if one mono-threaded application deals with
thousands of TCP sockets.

We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Tom Herbert @ 2016-12-01  1:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480552059.18162.239.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Another issue I found during my tests last days, is a problem with BQL,
> and more generally when driver stops/starts the queue.
>
> When under stress and BQL stops the queue, driver TX completion does a
> lot of work, and servicing CPU also takes over further qdisc_run().
>
Hmm, hard to say if this is problem or a feature I think ;-). One way
to "solve" this would be to use IRQ round robin, that would spread the
load of networking across CPUs, but that gives us no additional
parallelism and reduces locality-- it's generally considered a bad
idea. The question might be: is it better to continuously ding one CPU
with all the networking work or try to artificially spread it out?
Note this is orthogonal to MQ also, so we should already have multiple
CPUs doing netif_schedule_queue for queues they manage.

Do you have a test or application that shows this is causing pain?

> The work-flow is :
>
> 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> unmap things, queue skbs for freeing.
>
> 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>
> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>      netif_schedule_queue(dev_queue);
>
> This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> (They absolutely have no chance to grab it)
>
> So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> and RX work.
>
> This problem is magnified when XPS is used, if one mono-threaded application deals with
> thousands of TCP sockets.
>
Do you know of an application doing this? The typical way XPS and most
of the other steering mechanisms are configured assume that workloads
tend towards a normal distribution. Such mechanisms can become
problematic under asymmetric loads, but then I would expect these are
likely dedicated servers so that the mechanisms can be tuned
accordingly. For instance, XPS can be configured to map more than one
queue to a CPU. Alternatively, IIRC Windows has some functionality to
tune networking for the load (spin up queues, reconfigure things
similar to XPS/RPS, etc.)-- that's promising up the point that we need
a lot of heuristics and measurement; but then we lose determinism and
risk edge case where we get completely unsatisfied results (one of my
concerns with the recent attempt to put configuration in the kernel).

> We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> to allow another cpu to service the qdisc and spare us.
>
Wouldn't this need to be an active operation? That is to queue the
qdisc on another CPU's output_queue?

Tom

>
>

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the arm-soc tree
From: Stephen Rothwell @ 2016-12-01  1:31 UTC (permalink / raw)
  To: David Miller, Networking, Olof Johansson, Arnd Bergmann, ARM
  Cc: Rob Rice, Florian Fainelli, linux-next, linux-kernel, Jon Mason

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  arch/arm64/boot/dts/broadcom/ns2.dtsi

between commit:

  e79249143f46 ("arm64: dts: Add Broadcom Northstar2 device tree entries for PDC driver.")

from the arm-soc tree and commit:

  dddc3c9d7d02 ("arm64: dts: NS2: add AMAC ethernet support")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/arm64/boot/dts/broadcom/ns2.dtsi
index 863503d78f57,773ed593da4d..000000000000
--- a/arch/arm64/boot/dts/broadcom/ns2.dtsi
+++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi
@@@ -197,42 -191,18 +197,54 @@@
  
  		#include "ns2-clock.dtsi"
  
 +		pdc0: iproc-pdc0@612c0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612c0000 0x445>;  /* PDC FS0 regs */
 +			interrupts = <GIC_SPI 187 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc1: iproc-pdc1@612e0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612e0000 0x445>;  /* PDC FS1 regs */
 +			interrupts = <GIC_SPI 189 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc2: iproc-pdc2@61300000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61300000 0x445>;  /* PDC FS2 regs */
 +			interrupts = <GIC_SPI 191 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc3: iproc-pdc3@61320000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61320000 0x445>;  /* PDC FS3 regs */
 +			interrupts = <GIC_SPI 193 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
+ 		enet: ethernet@61000000 {
+ 			compatible = "brcm,ns2-amac";
+ 			reg = <0x61000000 0x1000>,
+ 			      <0x61090000 0x1000>,
+ 			      <0x61030000 0x100>;
+ 			reg-names = "amac_base", "idm_base", "nicpm_base";
+ 			interrupts = <GIC_SPI 341 IRQ_TYPE_LEVEL_HIGH>;
+ 			phy-handle = <&gphy0>;
+ 			phy-mode = "rgmii";
+ 			status = "disabled";
+ 		};
+ 
  		dma0: dma@61360000 {
  			compatible = "arm,pl330", "arm,primecell";
  			reg = <0x61360000 0x1000>;

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2016-12-01  1:36 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: linux-next, linux-kernel, Cyrille Pitchen, Zach Brown

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/cadence/macb.c

between commit:

  a0b44eea372b ("net: macb: fix the RX queue reset in macb_rx()")

from the net tree and commit:

  b410d13e10db ("net: macb: Use variables with defaults for tx/rx ring sizes instead of hardcoded values")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/cadence/macb.c
index ec09fcece711,0e489bb82456..000000000000
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@@ -974,8 -990,7 +990,8 @@@ static inline void macb_init_rx_ring(st
  		bp->rx_ring[i].ctrl = 0;
  		addr += bp->rx_buffer_size;
  	}
- 	bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP);
+ 	bp->rx_ring[bp->rx_ring_size - 1].addr |= MACB_BIT(RX_WRAP);
 +	bp->rx_tail = 0;
  }
  
  static int macb_rx(struct macb *bp, int budget)
@@@ -1617,7 -1735,9 +1737,7 @@@ static void macb_init_rings(struct mac
  	}
  	bp->queues[0].tx_head = 0;
  	bp->queues[0].tx_tail = 0;
- 	bp->queues[0].tx_ring[TX_RING_SIZE - 1].ctrl |= MACB_BIT(TX_WRAP);
+ 	bp->queues[0].tx_ring[bp->tx_ring_size - 1].ctrl |= MACB_BIT(TX_WRAP);
 -
 -	bp->rx_tail = 0;
  }
  
  static void macb_reset_hw(struct macb *bp)

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2016-12-01  1:41 UTC (permalink / raw)
  To: David Miller, Networking; +Cc: linux-next, linux-kernel, Jiri Pirko, Roi Dayan

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/sched/cls_flower.c

between commit:

  725cbb62e7ad ("sched: cls_flower: remove from hashtable only in case skip sw flag is not set")

from the net tree and commit:

  13fa876ebd03 ("net/sched: cls_flower: merge filter delete/destroy common code")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/sched/cls_flower.c
index 904442421db3,e8dd09af0d0c..000000000000
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@@ -273,24 -272,14 +276,32 @@@ static void fl_hw_update_stats(struct t
  	dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
  }
  
 +static void fl_destroy_sleepable(struct work_struct *work)
 +{
 +	struct cls_fl_head *head = container_of(work, struct cls_fl_head,
 +						work);
 +	if (head->mask_assigned)
 +		rhashtable_destroy(&head->ht);
 +	kfree(head);
 +	module_put(THIS_MODULE);
 +}
 +
 +static void fl_destroy_rcu(struct rcu_head *rcu)
 +{
 +	struct cls_fl_head *head = container_of(rcu, struct cls_fl_head, rcu);
 +
 +	INIT_WORK(&head->work, fl_destroy_sleepable);
 +	schedule_work(&head->work);
 +}
 +
+ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
+ {
+ 	list_del_rcu(&f->list);
+ 	fl_hw_destroy_filter(tp, (unsigned long)f);
+ 	tcf_unbind_filter(tp, &f->res);
+ 	call_rcu(&f->rcu, fl_destroy_filter);
+ }
+ 
  static bool fl_destroy(struct tcf_proto *tp, bool force)
  {
  	struct cls_fl_head *head = rtnl_dereference(tp->root);
@@@ -299,14 -288,12 +310,11 @@@
  	if (!force && !list_empty(&head->filters))
  		return false;
  
- 	list_for_each_entry_safe(f, next, &head->filters, list) {
- 		fl_hw_destroy_filter(tp, (unsigned long)f);
- 		list_del_rcu(&f->list);
- 		call_rcu(&f->rcu, fl_destroy_filter);
- 	}
+ 	list_for_each_entry_safe(f, next, &head->filters, list)
+ 		__fl_delete(tp, f);
 -	RCU_INIT_POINTER(tp->root, NULL);
 -	if (head->mask_assigned)
 -		rhashtable_destroy(&head->ht);
 -	kfree_rcu(head, rcu);
 +
 +	__module_get(THIS_MODULE);
 +	call_rcu(&head->rcu, fl_destroy_rcu);
  	return true;
  }
  
@@@ -761,13 -782,9 +804,10 @@@ static int fl_delete(struct tcf_proto *
  	struct cls_fl_head *head = rtnl_dereference(tp->root);
  	struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
  
 -	rhashtable_remove_fast(&head->ht, &f->ht_node,
 -			       head->ht_params);
 +	if (!tc_skip_sw(f->flags))
 +		rhashtable_remove_fast(&head->ht, &f->ht_node,
 +				       head->ht_params);
- 	list_del_rcu(&f->list);
- 	fl_hw_destroy_filter(tp, (unsigned long)f);
- 	tcf_unbind_filter(tp, &f->res);
- 	call_rcu(&f->rcu, fl_destroy_filter);
+ 	__fl_delete(tp, f);
  	return 0;
  }
  

^ permalink raw reply

* [PATCH net v2 3/3] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>

This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
dst_output() is called. It is no longer necessary to do it for each tunnel.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
 net/ipv6/ip6_tunnel.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0a4759b..d76674e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1181,7 +1181,6 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	if (err)
 		return err;

-	skb->protocol = htons(ETH_P_IPV6);
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	ipv6h = ipv6_hdr(skb);
-- 
2.10.2

^ permalink raw reply related

* [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet

When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
v2: place the assignment before the netfilter hook

 net/ipv4/ip_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 105908d..877bdb0 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -107,6 +107,8 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (unlikely(!skb))
 		return 0;

+	skb->protocol = htons(ETH_P_IP);
+
 	return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
 		       net, sk, skb, NULL, skb_dst(skb)->dev,
 		       dst_output);
-- 
2.10.2

^ permalink raw reply related

* [PATCH net v2 2/3] ipv6: Set skb->protocol properly for local output
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>

When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
v2: place the assignment before the netfilter hook

 net/ipv6/output_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 7cca8ac..cd42523 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -155,6 +155,8 @@ int __ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (unlikely(!skb))
 		return 0;
 
+	skb->protocol = htons(ETH_P_IPV6);
+
 	return nf_hook(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
 		       net, sk, skb, NULL, skb_dst(skb)->dev,
 		       dst_output);
-- 
2.10.2

^ permalink raw reply related

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  2:32 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <CALx6S35UZcD+kXTTGeML4DdDYaUoArSfEPeUoTuFEeRWKsM7sQ@mail.gmail.com>

On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Another issue I found during my tests last days, is a problem with BQL,
> > and more generally when driver stops/starts the queue.
> >
> > When under stress and BQL stops the queue, driver TX completion does a
> > lot of work, and servicing CPU also takes over further qdisc_run().
> >
> Hmm, hard to say if this is problem or a feature I think ;-). One way
> to "solve" this would be to use IRQ round robin, that would spread the
> load of networking across CPUs, but that gives us no additional
> parallelism and reduces locality-- it's generally considered a bad
> idea. The question might be: is it better to continuously ding one CPU
> with all the networking work or try to artificially spread it out?
> Note this is orthogonal to MQ also, so we should already have multiple
> CPUs doing netif_schedule_queue for queues they manage.
> 
> Do you have a test or application that shows this is causing pain?

Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.

super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.

> 
> > The work-flow is :
> >
> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> > unmap things, queue skbs for freeing.
> >
> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
> >
> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> >      netif_schedule_queue(dev_queue);
> >
> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> > (They absolutely have no chance to grab it)
> >
> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> > and RX work.
> >
> > This problem is magnified when XPS is used, if one mono-threaded application deals with
> > thousands of TCP sockets.
> >
> Do you know of an application doing this? The typical way XPS and most
> of the other steering mechanisms are configured assume that workloads
> tend towards a normal distribution. Such mechanisms can become
> problematic under asymmetric loads, but then I would expect these are
> likely dedicated servers so that the mechanisms can be tuned
> accordingly. For instance, XPS can be configured to map more than one
> queue to a CPU. Alternatively, IIRC Windows has some functionality to
> tune networking for the load (spin up queues, reconfigure things
> similar to XPS/RPS, etc.)-- that's promising up the point that we need
> a lot of heuristics and measurement; but then we lose determinism and
> risk edge case where we get completely unsatisfied results (one of my
> concerns with the recent attempt to put configuration in the kernel).

We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.

Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.

Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.

Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.

Other flows that would share the locked CPU experience high P99
latencies.

> 
> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> > to allow another cpu to service the qdisc and spare us.
> >
> Wouldn't this need to be an active operation? That is to queue the
> qdisc on another CPU's output_queue?

I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.

Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Florian Westphal @ 2016-12-01  2:44 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S34qPqXa7s1eHmk9V-k6xb=36dfiQvx3JruaNnqg4v8r9g@mail.gmail.com>

Tom Herbert <tom@herbertland.com> wrote:
> Posting for discussion....

Warning: You are not going to like this reply...

> Now that XDP seems to be nicely gaining traction

Yes, I regret to see that.  XDP seems useful to create impressive
benchmark numbers (and little else).

I will send a separate email to keep that flamebait part away from
this thread though.

[..]

> addresses the performance gap for stateless packet processing). The
> problem statement is analogous to that which we had for XDP, namely
> can we create a mode in the kernel that offer the same performance
> that is seen with L4 protocols over kernel bypass

Why?  If you want to bypass the kernel, then DO IT.

There is nothing wrong with DPDK.  The ONLY problem is that the kernel
does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.

But even without that its not difficult to get DPDK running.

(T)XDP seems born from spite, not technical rationale.
IMO everyone would be better off if we'd just have something netmap-esqe
in the network core (also see below).

> I imagine there are a few reasons why userspace TCP stacks can get
> good performance:
> 
> - Spin polling (we already can do this in kernel)
> - Lockless, I would assume that threads typically have exclusive
> access to a queue pair for a connection
> - Minimal TCP/IP stack code
> - Zero copy TX/RX
> - Light weight structures for queuing
> - No context switches
> - Fast data path for in order, uncongested flows
> - Silo'ing between application and device queues

I only see two cases:

1. Many applications running (standard Os model) that need to
send/receive data
-> Linux Network Stack

2. Single dedicated application that does all rx/tx

-> no queueing needed (can block network rx completely if receiver
is slow)
-> no allocations needed at runtime at all
-> no locking needed (single produce, single consumer)

If you have #2 and you need to be fast etc then full userspace
bypass is fine.  We will -- no matter what we do in kernel -- never
be able to keep up with the speed you can get with that
because we have to deal with #1.  (Plus the ease of use/freedom of doing
userspace programming).  And yes, I think that #2 is something we
should address solely by providing netmap or something similar.

But even considering #1 there are ways to speed stack up:

I'd kill RPF/RPS so we don't have IPI anymore and skb stays
on same cpu up to where it gets queued (ofo or rx queue).

Then we could tell driver what happened with the skb it gave us, e.g.
we can tell driver it can do immediate page/dma reuse, for example
in pure ack case as opposed to skb sitting in ofo or receive queue.

(RPS/RFS functionality could still be provided via one of the gazillion
 hooks we now have in the stack for those that need/want it), so I do
not think we would lose functionality.

>   - Call into TCP/IP stack with page data directly from driver-- no
> skbuff allocation or interface. This is essentially provided by the
> XDP API although we would need to generalize the interface to call
> stack functions (I previously posted patches for that). We will also
> need a new action, XDP_HELD?, that indicates the XDP function held the
> packet (put on a socket for instance).

Seems this will not work at all with the planned page pool thing when
pages start to be held indefinitely.

You can also never get even close to userspace offload stacks once you
need/do this; allocations in hotpath are too expensive.

[..]

>   - When we transmit, it would be nice to go straight from TCP
> connection to an XDP device queue and in particular skip the qdisc
> layer. This follows the principle of low latency being first criteria.

It will never be lower than userspace offloads so anyone with serious
"low latency" requirement (trading) will use that instead.

Whats your target audience?

> longer latencies in effect which likely means TXDP isn't appropriate
> in such a cases. BQL is also out, however we would want the TX
> batching of XDP.

Right, congestion control and buffer bloat are totally overrated .. 8-(

So far I haven't seen anything that would need XDP at all.

What makes it technically impossible to apply these miracles to the
stack...?

E.g. "mini-skb": Even if we assume that this provides a speedup
(where does that come from? should make no difference if a 32 or
 320 byte buffer gets allocated).

If we assume that its the zeroing of sk_buff (but iirc it made little
to no difference), could add

unsigned long skb_extensions[1];

to sk_buff, then move everything not needed for tcp fastpath
(e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
below that

Then convert accesses to accessors and init it on demand.

One could probably also split cb[] into a smaller fastpath area
and another one at the end that won't be touched at allocation time.

> Miscellaneous

> contemplating that connections/sockets can be bound to particularly
> CPUs and that any operations (socket operations, timers, receive
> processing) must occur on that CPU. The CPU would be the one where RX
> happens. Note this implies perfect silo'ing, everything for driver RX
> to application processing happens inline on the CPU. The stack would
> not cross CPUs for a connection while in this mode.

Again don't see how this relates to xdp.  Could also be done with
current stack if we make rps/rfs pluggable since nothing else
currently pushes skb to another cpu (except when scheduler is involved
via tc mirred, netfilter userspace queueing etc) but that is always
explicit (i.e. skb leaves softirq protection).

Can we please fix and improve what we already have rather than creating
yet another NIH thing that will have to be maintained forever?

Thanks.

^ permalink raw reply

* RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: wangyunjian @ 2016-12-01  2:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, caihe
In-Reply-To: <20161130152004-mutt-send-email-mst@kernel.org>


>-----Original Message-----
>From: Michael S. Tsirkin [mailto:mst@redhat.com] 
>Sent: Wednesday, November 30, 2016 9:41 PM
>To: wangyunjian
>Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>
>On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
>
>How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for loop. 
The soft lockup might happened. The err is -EBADFD.

meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
call trace:
[<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
[<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
[<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
[<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[<fffffff810a5c7f>]kthread+0xcf/0xe0
[<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
[<fffffff81648898>]ret_from_fork+0x58/0x90
[<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140

>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>> 
>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>> ---
>>  drivers/vhost/net.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>  			pr_debug("Discarded rx packet: "
>>  				 " len %d, expected %zd\n", err, sock_len);
>>  			vhost_discard_vq_desc(vq, headcount);
>> +			/* Don't continue to do, when meet errors. */
>> +			if (err < 0)
>> +				goto out;
>
>You might get e.g. EAGAIN and I think you need to retry
>in this case.
>
>>  			continue;
>>  		}
>>  		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  2:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:

> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
> 
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?

Hmm... this is silly. Code already implements a different bit.

qdisc_run() seems to run more often from net_tx_action(), I have to find
out why.

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:13 UTC (permalink / raw)
  To: wangyunjian, Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7B68@szxeml561-mbx.china.huawei.com>



On 2016年12月01日 10:48, wangyunjian wrote:
>> -----Original Message-----
>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> Sent: Wednesday, November 30, 2016 9:41 PM
>> To: wangyunjian
>> Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>> Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>
>> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>> When we meet an error(err=-EBADFD) recvmsg,
>> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>> return 0 in this case, breaking the loop?
> We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> The soft lockup might happened. The err is -EBADFD.

How did you do the attaching/detaching? AFAIK, the -EBADFD can only 
happens when you deleting tun device during vhost_net transmission.

>
> meesage log:
> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> call trace:
> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> [<fffffff810a5c7f>]kthread+0xcf/0xe0
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> [<fffffff81648898>]ret_from_fork+0x58/0x90
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
>
>>> the error handling in vhost
>>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>>
>>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>>> ---
>>>   drivers/vhost/net.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index 5dc128a..edc470b 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>>   			pr_debug("Discarded rx packet: "
>>>   				 " len %d, expected %zd\n", err, sock_len);
>>>   			vhost_discard_vq_desc(vq, headcount);
>>> +			/* Don't continue to do, when meet errors. */
>>> +			if (err < 0)
>>> +				goto out;
>> You might get e.g. EAGAIN and I think you need to retry
>> in this case.
>>
>>>   			continue;
>>>   		}
>>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>>> -- 
>>> 1.9.5.msysgit.1
>>>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin, Yunjian Wang; +Cc: netdev, linux-kernel, caihe
In-Reply-To: <20161130152004-mutt-send-email-mst@kernel.org>



On 2016年11月30日 21:40, Michael S. Tsirkin wrote:
> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> return 0 in this case, breaking the loop?
>
>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>
>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>> ---
>>   drivers/vhost/net.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>   			pr_debug("Discarded rx packet: "
>>   				 " len %d, expected %zd\n", err, sock_len);
>>   			vhost_discard_vq_desc(vq, headcount);
>> +			/* Don't continue to do, when meet errors. */
>> +			if (err < 0)
>> +				goto out;
> You might get e.g. EAGAIN and I think you need to retry
> in this case.

Yes.

>>   			continue;
>>   		}
>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  3:21 UTC (permalink / raw)
  To: wangyunjian
  Cc: jasowang@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7B68@szxeml561-mbx.china.huawei.com>

On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> 
> >-----Original Message-----
> >From: Michael S. Tsirkin [mailto:mst@redhat.com] 
> >Sent: Wednesday, November 30, 2016 9:41 PM
> >To: wangyunjian
> >Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> >
> >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >> When we meet an error(err=-EBADFD) recvmsg,
> >
> >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> >return 0 in this case, breaking the loop?
> 
> We started many guest VMs while attaching/detaching some virtio-net nics for loop. 
> The soft lockup might happened. The err is -EBADFD.
> 

OK, I'd like to figure out what happened here. why don't
we get 0 when we peek at the head?

EBADFD is from here:
        struct tun_struct *tun = __tun_get(tfile);
...
        if (!tun)
                return -EBADFD;

but then:
static int tun_peek_len(struct socket *sock)
{       

...

        struct tun_struct *tun;
...
        
        tun = __tun_get(tfile);
        if (!tun) 
                return 0;


so peek len should return 0.

then while will exit:
        while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
...


> meesage log:
> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> call trace:
> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> [<fffffff810a5c7f>]kthread+0xcf/0xe0
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> [<fffffff81648898>]ret_from_fork+0x58/0x90
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140

So somehow you keep seeing something in tun when we peek.
IMO this is the problem we should try to fix.
Could you try debugging the root cause here?

> >> the error handling in vhost
> >> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> >> 
> >> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> >> ---
> >>  drivers/vhost/net.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 5dc128a..edc470b 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> >>  			pr_debug("Discarded rx packet: "
> >>  				 " len %d, expected %zd\n", err, sock_len);
> >>  			vhost_discard_vq_desc(vq, headcount);
> >> +			/* Don't continue to do, when meet errors. */
> >> +			if (err < 0)
> >> +				goto out;
> >
> >You might get e.g. EAGAIN and I think you need to retry
> >in this case.
> >
> >>  			continue;
> >>  		}
> >>  		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> >> -- 
> >> 1.9.5.msysgit.1
> >>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, wangyunjian
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <20161201051207-mutt-send-email-mst@kernel.org>



On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>> -----Original Message-----
>>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>> Sent: Wednesday, November 30, 2016 9:41 PM
>>> To: wangyunjian
>>> Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>>> Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>
>>> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>> When we meet an error(err=-EBADFD) recvmsg,
>>> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>> return 0 in this case, breaking the loop?
>> We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>> The soft lockup might happened. The err is -EBADFD.
>>
> OK, I'd like to figure out what happened here. why don't
> we get 0 when we peek at the head?
>
> EBADFD is from here:
>          struct tun_struct *tun = __tun_get(tfile);
> ...
>          if (!tun)
>                  return -EBADFD;
>
> but then:
> static int tun_peek_len(struct socket *sock)
> {
>
> ...
>
>          struct tun_struct *tun;
> ...
>          
>          tun = __tun_get(tfile);
>          if (!tun)
>                  return 0;
>
>
> so peek len should return 0.
>
> then while will exit:
>          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> ...
>

Consider this case: user do ip link del link tap0 before recvmsg() but 
after tun_peek_len() ?

>> meesage log:
>> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
>> call trace:
>> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
>> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
>> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
>> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
>> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
>> [<fffffff810a5c7f>]kthread+0xcf/0xe0
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
>> [<fffffff81648898>]ret_from_fork+0x58/0x90
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> So somehow you keep seeing something in tun when we peek.
> IMO this is the problem we should try to fix.
> Could you try debugging the root cause here?
>
>>>> the error handling in vhost
>>>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>>>
>>>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>>>> ---
>>>>   drivers/vhost/net.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 5dc128a..edc470b 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>>>   			pr_debug("Discarded rx packet: "
>>>>   				 " len %d, expected %zd\n", err, sock_len);
>>>>   			vhost_discard_vq_desc(vq, headcount);
>>>> +			/* Don't continue to do, when meet errors. */
>>>> +			if (err < 0)
>>>> +				goto out;
>>> You might get e.g. EAGAIN and I think you need to retry
>>> in this case.
>>>
>>>>   			continue;
>>>>   		}
>>>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>>>> -- 
>>>> 1.9.5.msysgit.1
>>>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox