Netdev List
 help / color / mirror / Atom feed
* 46775 netdev
From: cbordinaro @ 2016-11-12  9:40 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: MESSAGE_645923_netdev.zip --]
[-- Type: application/zip, Size: 3670 bytes --]

^ permalink raw reply

* I Hope You Get My Message This Time
From: Mr Friedrich Mayrhofer @ 2016-11-12  7:29 UTC (permalink / raw)



-- 
This is the second time i am sending you this mail.

I, Friedrich Mayrhofer Donate $ 1,000,000.00 to You, Email  Me  
personally for more details.

Regards.
Friedrich Mayrhofer

^ permalink raw reply

* I Hope You Get My Message This Time
From: Mr Friedrich Mayrhofer @ 2016-11-12  6:41 UTC (permalink / raw)



-- 
This is the second time i am sending you this mail.

I, Friedrich Mayrhofer Donate $ 1,000,000.00 to You, Email  Me  
personally for more details.

Regards.
Friedrich Mayrhofer

^ permalink raw reply

* Re: [PATCH iproute2 0/2] tc: flower: Support matching on SCTP ports
From: Stephen Hemminger @ 2016-11-12  7:32 UTC (permalink / raw)
  To: Simon Horman; +Cc: David Miller, netdev
In-Reply-To: <1478176001-29174-1-git-send-email-simon.horman@netronome.com>

On Thu,  3 Nov 2016 13:26:39 +0100
Simon Horman <simon.horman@netronome.com> wrote:

> Hi,
> 
> this short series adds support for matching on SCTP ports in the same way
> that matching on TCP and UDP ports is already supported. It corresponds to
> a net-next patch to add the same support to the kernel.
> 
> Example usage:
> 
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 protocol ip parent ffff: \
>     flower indev eth0 ip_proto sctp dst_port 80 \
>     action drop
> 
> 
> Simon Horman (2):
>   tc: update headers for TCA_FLOWER_KEY_SCTP_*
>   tc: flower: Support matching on SCTP ports
> 
>  include/linux/pkt_cls.h |  5 ++++
>  tc/f_flower.c           | 65 +++++++++++++++++++++++--------------------------
>  2 files changed, 36 insertions(+), 34 deletions(-)
> 

Applied, thanks.

^ permalink raw reply

* Re: [PATCH iproute2] tc: flower: Fix usage message
From: Stephen Hemminger @ 2016-11-12  7:28 UTC (permalink / raw)
  To: Paul Blakey; +Cc: netdev, Or Gerlitz
In-Reply-To: <1478099398-21639-1-git-send-email-paulb@mellanox.com>

On Wed,  2 Nov 2016 17:09:58 +0200
Paul Blakey <paulb@mellanox.com> wrote:

> Remove left over usage from removal of eth_type argument.
> 
> Fixes: 488b41d020fb ('tc: flower no need to specify the ethertype')
> Signed-off-by: Paul Blakey <paulb@mellanox.com>
> ---

Applied, thanks.
Then I changed usage message to pass checkpatch long line nags.

^ permalink raw reply

* Re: [PATCH v4] iproute2: macvlan: add "source" mode
From: Stephen Hemminger @ 2016-11-12  7:12 UTC (permalink / raw)
  To: Michael Braun; +Cc: netdev, projekt-wlan, steweg
In-Reply-To: <1477565076-18968-1-git-send-email-michael-dev@fami-braun.de>

On Thu, 27 Oct 2016 12:44:36 +0200
Michael Braun <michael-dev@fami-braun.de> wrote:

> Adjusting iproute2 utility to support new macvlan link type mode called
> "source".
> 
> Example of commands that can be applied:
>   ip link add link eth0 name macvlan0 type macvlan mode source
>   ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr del 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr flush
>   ip -details link show dev macvlan0
> 
> Based on previous work of Stefan Gula <steweg@gmail.com>
> 
> Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
> 
> Cc: steweg@gmail.com
> 
> v4:
>  - add MACADDR_SET support
>  - skip FLAG_UNICAST / FLAG_UNICAST_ALL as this is not upstream
>  - fix man page

The patch looks good, but needs to be cleaned up.

Does not apply to current iproute2 git, and also has minor checkpatch issue.

--- ip/iplink_macvlan.c
+++ ip/iplink_macvlan.c
@@ -46,7 +51,14 @@ static void explain(struct link_util *lu)
 
 static int mode_arg(const char *arg)
 {
-        fprintf(stderr, "Error: argument of \"mode\" must be \"private\", \"vepa\", \"bridge\" or \"passthru\", not \"%s\"\n",
+	fprintf(stderr, "Error: argument of \"mode\" must be \"private\", \"vepa\", \"bridge\", \"passthru\" or \"source\", not \"%s\"\n",
+		arg);
+	return -1;
+}
+
+static int flag_arg(const char *arg)
+{
+	fprintf(stderr, "Error: argument of \"flag\" must be \"nopromisc\" or \"null\", not \"%s\"\n",
 		arg);
 	return -1;
 }


WARNING: unnecessary whitespace before a quoted newline
#59: FILE: ip/iplink_macvlan.c:35:
+		"MODE_FLAG: null | nopromisc \n"

^ permalink raw reply

* Re: [PATCH] iproute2: ss: escape all null bytes in abstract unix domain socket
From: Stephen Hemminger @ 2016-11-12  7:17 UTC (permalink / raw)
  To: Isaac Boukris; +Cc: davem, netdev, linux-kernel
In-Reply-To: <1477768820-1295-1-git-send-email-iboukris@gmail.com>

On Sat, 29 Oct 2016 22:20:19 +0300
Isaac Boukris <iboukris@gmail.com> wrote:

> Abstract unix domain socket may embed null characters,
> these should be translated to '@' when printed by ss the
> same way the null prefix is currently being translated.
> 
> Signed-off-by: Isaac Boukris <iboukris@gmail.com>

Applied

^ permalink raw reply

* Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor
From: Stephen Hemminger @ 2016-11-12  7:15 UTC (permalink / raw)
  To: Stefan Schmidt; +Cc: netdev, linux-wpan
In-Reply-To: <1477647723-14641-1-git-send-email-stefan@datenfreihafen.org>

On Fri, 28 Oct 2016 11:42:03 +0200
Stefan Schmidt <stefan@datenfreihafen.org> wrote:

> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

Applied 

^ permalink raw reply

* Re: [PATCH net-next] bpf: fix range arithmetic for bpf map access
From: Alexei Starovoitov @ 2016-11-12  3:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: jannh, ast, daniel, davem, netdev
In-Reply-To: <1478900859-7807-1-git-send-email-jbacik@fb.com>

On Fri, Nov 11, 2016 at 04:47:39PM -0500, Josef Bacik wrote:
> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
> invalid accesses to bpf map entries.  Fix this up by doing a few things
> 
> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
> life and just adds extra complexity.
> 
> 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
> minimum value to 0 for positive AND's.
> 
> 3) Don't do operations on the ranges if they are set to the limits, as they are
> by definition undefined, and allowing arithmetic operations on those values
> could make them appear valid when they really aren't.
> 
> This fixes the testcase provided by Jann as well as a few other theoretical
> problems.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  include/linux/bpf_verifier.h |  3 +-
>  kernel/bpf/verifier.c        | 70 +++++++++++++++++++++++++++++---------------
>  2 files changed, 49 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index ac5b393..15ceb7f 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -22,7 +22,8 @@ struct bpf_reg_state {
>  	 * Used to determine if any memory access using this register will
>  	 * result in a bad access.
>  	 */
> -	u64 min_value, max_value;
> +	s64 min_value;
> +	u64 max_value;
>  	u32 id;
>  	union {
>  		/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 89f787c..709fe0e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -234,8 +234,8 @@ static void print_verifier_state(struct bpf_verifier_state *state)
>  				reg->map_ptr->value_size,
>  				reg->id);
>  		if (reg->min_value != BPF_REGISTER_MIN_RANGE)
> -			verbose(",min_value=%llu",
> -				(unsigned long long)reg->min_value);
> +			verbose(",min_value=%lld",
> +				(long long)reg->min_value);
>  		if (reg->max_value != BPF_REGISTER_MAX_RANGE)
>  			verbose(",max_value=%llu",
>  				(unsigned long long)reg->max_value);
> @@ -778,7 +778,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
>  			 * index'es we need to make sure that whatever we use
>  			 * will have a set floor within our range.
>  			 */
> -			if ((s64)reg->min_value < 0) {
> +			if (reg->min_value < 0) {
>  				verbose("R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
>  					regno);
>  				return -EACCES;
> @@ -1490,7 +1490,8 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
>  {
>  	if (reg->max_value > BPF_REGISTER_MAX_RANGE)
>  		reg->max_value = BPF_REGISTER_MAX_RANGE;
> -	if ((s64)reg->min_value < BPF_REGISTER_MIN_RANGE)
> +	if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
> +	    reg->min_value > BPF_REGISTER_MAX_RANGE)
>  		reg->min_value = BPF_REGISTER_MIN_RANGE;
>  }
>  
> @@ -1498,7 +1499,8 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
>  				    struct bpf_insn *insn)
>  {
>  	struct bpf_reg_state *regs = env->cur_state.regs, *dst_reg;
> -	u64 min_val = BPF_REGISTER_MIN_RANGE, max_val = BPF_REGISTER_MAX_RANGE;
> +	s64 min_val = BPF_REGISTER_MIN_RANGE;
> +	u64 max_val = BPF_REGISTER_MAX_RANGE;
>  	u8 opcode = BPF_OP(insn->code);
>  
>  	dst_reg = &regs[insn->dst_reg];
> @@ -1532,22 +1534,43 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
>  		return;
>  	}
>  
> +	/* If one of our values was at the end of our ranges then we can't just
> +	 * do our normal operations to the register, we need to set the values
> +	 * to the min/max since they are undefined.
> +	 */
> +	if (min_val == BPF_REGISTER_MIN_RANGE)
> +		dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
> +	if (max_val == BPF_REGISTER_MAX_RANGE)
> +		dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
> +
>  	switch (opcode) {
>  	case BPF_ADD:
> -		dst_reg->min_value += min_val;
> -		dst_reg->max_value += max_val;
> +		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
> +			dst_reg->min_value += min_val;
> +		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
> +			dst_reg->max_value += max_val;
>  		break;
>  	case BPF_SUB:
> -		dst_reg->min_value -= min_val;
> -		dst_reg->max_value -= max_val;
> +		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
> +			dst_reg->min_value -= min_val;
> +		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
> +			dst_reg->max_value -= max_val;
>  		break;
>  	case BPF_MUL:
> -		dst_reg->min_value *= min_val;
> -		dst_reg->max_value *= max_val;
> +		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
> +			dst_reg->min_value *= min_val;

looks to be few issues here with negative values as well.
If dst_reg range [-2, 5] and right hand side range is [-2, 10],
then above will be computed as -2 * -2 == 4
but even if we do -1 * abs(dst_reg->min) * abs(min), it's still
incorrect, since dst_reg could be 5 and multiplied by -2 (== -10),
it will be less than above simple math on min values...
so I'd suggest to disable negative values everywhere.

> +		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
> +			dst_reg->max_value *= max_val;
>  		break;
>  	case BPF_AND:
> -		/* & is special since it could end up with 0 bits set. */
> -		dst_reg->min_value &= min_val;
> +		/* Disallow AND'ing of negative numbers, ain't nobody got time
> +		 * for that.  Otherwise the minimum is 0 and the max is the max
> +		 * value we could AND against.
> +		 */
> +		if (min_val < 0)
> +			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
> +		else
> +			dst_reg->min_value = 0;
>  		dst_reg->max_value = max_val;
>  		break;
>  	case BPF_LSH:
> @@ -1557,24 +1580,25 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
>  		 */
>  		if (min_val > ilog2(BPF_REGISTER_MAX_RANGE))
>  			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
> -		else
> +		else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
>  			dst_reg->min_value <<= min_val;
>  
>  		if (max_val > ilog2(BPF_REGISTER_MAX_RANGE))
>  			dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
> -		else
> +		else if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
>  			dst_reg->max_value <<= max_val;
>  		break;
>  	case BPF_RSH:
> -		dst_reg->min_value >>= min_val;
> -		dst_reg->max_value >>= max_val;
> -		break;
> -	case BPF_MOD:
> -		/* % is special since it is an unsigned modulus, so the floor
> -		 * will always be 0.
> +		/* RSH by a negative number is undefined, and the BPF_RSH is an
> +		 * unsigned shift, so make the appropriate casts.
>  		 */
> -		dst_reg->min_value = 0;
> -		dst_reg->max_value = max_val - 1;
> +		if (min_val < 0)
> +			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
> +		else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
> +			dst_reg->min_value =
> +				(u64)(dst_reg->min_value) >> min_val;

when min_val is negative both >> and << are undefined,
so we need to avoid negative values for these cases as well.

> +		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
> +			dst_reg->max_value >>= max_val;

and for max_val too we need to make sure that max_val >= 0.

To address all of it I'm thinking it will be easier to set
BPF_REGISTER_MIN_RANGE to -1.
I don't think we can kill tracking of min_val completely
and assume valid min starts at zero, since we need either min
tracking or boolean flag that indicates negative overflow and
min tracking is imo cleaner (though valid min will always be >=0
and invalid min is -1)

Also this patch has to go to 'net' tree, so rebasing with net-next
wasn't necessary.

^ permalink raw reply

* Re: TCP performance problems - GSO/TSO, MSS, 8139cp related
From: David Miller @ 2016-11-12  2:52 UTC (permalink / raw)
  To: linux; +Cc: dwmw2, netdev, qemu-devel
In-Reply-To: <20161111223307.GF1041@n2100.armlinux.org.uk>

From: Russell King - ARM Linux <linux@armlinux.org.uk>
Date: Fri, 11 Nov 2016 22:33:08 +0000

> "The new buffer management algorithm provides capabilities of Microsoft
> Large-Send offload" and as yet I haven't found anything that describes
> what this is or how it works.

For once I will give Microsoft a big shout out here.

This, and everything a Microsoft networking driver interfaces to, is
_very_ much documented in extreme detail in the Microsoft NDIS
(Network Driver Interface Specification).

Microsoft's networking driver interfaces and expectations are
documented 1,000 times better than that of Linux.

^ permalink raw reply

* Re: Source address fib invalidation on IPv6
From: Jason A. Donenfeld @ 2016-11-12  2:18 UTC (permalink / raw)
  To: David Ahern; +Cc: Netdev, LKML, WireGuard mailing list
In-Reply-To: <31e050e2-0499-a77e-f698-86e58ad2fa6b@cumulusnetworks.com>

Hi David,

On Fri, Nov 11, 2016 at 11:14 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> What do you mean by 'valid dst'? ipv6 returns net->ipv6.ip6_null_entry on lookup failures so yes dst is non-NULL but that does not mean the lookup succeeded.

What I mean is that it returns an ordinary dst, as if that souce
address _hadn't_ been removed from the interface, even though I just
removed it. Is this buggy behavior? If so, let me know and I'll try to
track it down. The expected behavior, as far as I can see, would be
the same that ip_route_output_flow has -- returning -EINVAL when the
saddr isn't valid. At the moment, when the saddr is invalid,
ipv6_stub->ipv6_dst_lookup returns 0 and &dst contains a real entry.

Regards,
Jason

^ permalink raw reply

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Cong Wang @ 2016-11-12  0:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rolf Neugebauer, LKML, Linux Kernel Network Developers,
	Justin Cormack, Ian Campbell
In-Reply-To: <20161112002347.GL4127@linux.vnet.ibm.com>

On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> modified to check for net_mutex being held in addition to the current
> checks for RTNL being held?
>

Good point!

Like commit be3fc413da9eb17cce0991f214ab0, checking
for net_mutex for this case seems to be an optimization, I assume
synchronize_rcu_expedited() and synchronize_rcu() have the same
behavior...

diff --git a/net/core/dev.c b/net/core/dev.c
index eaad4c2..3415b6b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7762,7 +7762,7 @@ EXPORT_SYMBOL(free_netdev);
 void synchronize_net(void)
 {
        might_sleep();
-       if (rtnl_is_locked())
+       if (rtnl_is_locked() || lockdep_is_held(&net_mutex))
                synchronize_rcu_expedited();
        else
                synchronize_rcu();

^ permalink raw reply related

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Paul E. McKenney @ 2016-11-12  0:23 UTC (permalink / raw)
  To: Rolf Neugebauer
  Cc: Cong Wang, LKML, Linux Kernel Network Developers, Justin Cormack,
	Ian Campbell
In-Reply-To: <CA+pO-2ddoJYvhVoV3Bo+-3K7V63eY9sbKzKwYrut65GZXs6_1g@mail.gmail.com>

On Fri, Nov 11, 2016 at 01:11:01PM +0000, Rolf Neugebauer wrote:
> On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote:
> >> (Cc'ing Paul)
> >>
> >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer
> >> <rolf.neugebauer@docker.com> wrote:
> >> > Hi
> >> >
> >> > We noticed some long delays starting docker containers on some newer
> >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is
> >> > fine). We narrowed this down to the creation of a network namespace
> >> > being delayed directly after removing another one (details and
> >> > reproduction below). We have seen delays of up to 60s on some systems.
> >> >
> >> > - The delay is proportional to the number of CPUs (online or offline).
> >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up
> >> > to 240 offline vCPUs even if one configures the VM with only, say 2
> >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the
> >> > kernel config.
> >> >
> >> > - The delay is also dependent on some tunnel network interfaces being
> >> > present (which we had compiled in in one of our kernel configs).
> >> >
> >> > - We can reproduce this issue with stock kernels from
> >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs
> >> > as well as other hypervisors like qemu and hyperkit where we have good
> >> > control over the number of CPUs.
> >> >
> >> > A simple test is:
> >> > modprobe ipip
> >> > moprobe  ip_gre
> >> > modprobe ip_vti
> >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo
> >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo
> >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar
> >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar
> >> >
> >> > with an output like:
> >> > add netns foo ===> 0:00.00
> >> > del netns foo ===> 0:00.01
> >> > add netns bar ===> 0:08.53
> >> > del netns bar ===> 0:00.01
> >> >
> >> > This is on a 4.9-rc4 kernel from the above URL configured with
> >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached).
> >> >
> >> > Below is a dump of the work queues while the second 'ip add netns' is
> >> > hanging. The state of the work queues does not seem to change while
> >> > the command is delayed and the pattern shown is consistent across
> >> > different kernel versions.
> >> >
> >> > Is this a known issue and/or is someone working on a fix?
> >>
> >> Not to me.
> >>
> >>
> >> >
> >> > [  610.356272] sysrq: SysRq : Show Blocked State
> >> > [  610.356742]   task                        PC stack   pid father
> >> > [  610.357252] kworker/u480:1  D    0  1994      2 0x00000000
> >> > [  610.357752] Workqueue: netns cleanup_net
> >> > [  610.358239]  ffff9892f1065800 0000000000000000 ffff9892ee1e1e00
> >> > ffff9892f8e59340
> >> > [  610.358705]  ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3
> >> > ffffbf0104b5ba60
> >> > [  610.359168]  00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70
> >> > ffff9892ee1e1e00
> >> > [  610.359677] Call Trace:
> >> > [  610.360169]  [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> >> > [  610.360723]  [<ffffffffbe4872d6>] schedule+0x36/0x80
> >> > [  610.361194]  [<ffffffffbe48a9ca>] schedule_timeout+0x22a/0x3f0
> >> > [  610.361789]  [<ffffffffbe486dfb>] ? __schedule+0x23b/0x6e0
> >> > [  610.362260]  [<ffffffffbe487d24>] wait_for_completion+0xb4/0x140
> >> > [  610.362736]  [<ffffffffbdcb05a0>] ? wake_up_q+0x80/0x80
> >> > [  610.363306]  [<ffffffffbdceb528>] __wait_rcu_gp+0xc8/0xf0
> >> > [  610.363782]  [<ffffffffbdceea5c>] synchronize_sched+0x5c/0x80
> >> > [  610.364137]  [<ffffffffbdcf0010>] ? call_rcu_bh+0x20/0x20
> >> > [  610.364742]  [<ffffffffbdceb440>] ?
> >> > trace_raw_output_rcu_utilization+0x60/0x60
> >> > [  610.365337]  [<ffffffffbe3696bc>] synchronize_net+0x1c/0x30
> >>
> >> This is a worker which holds the net_mutex and is waiting for
> >> a RCU grace period to elapse.

Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
modified to check for net_mutex being held in addition to the current
checks for RTNL being held?

							Thanx, Paul

> >> > [  610.365846]  [<ffffffffbe369803>] netif_napi_del+0x23/0x80
> >> > [  610.367494]  [<ffffffffc057f6f8>] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel]
> >> > [  610.368007]  [<ffffffffbe372c10>] netdev_run_todo+0x230/0x330
> >> > [  610.368454]  [<ffffffffbe37eb4e>] rtnl_unlock+0xe/0x10
> >> > [  610.369001]  [<ffffffffc057f4df>] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel]
> >> > [  610.369500]  [<ffffffffc058b92c>] ipip_exit_net+0x2c/0x30 [ipip]
> >> > [  610.369997]  [<ffffffffbe362688>] ops_exit_list.isra.4+0x38/0x60
> >> > [  610.370636]  [<ffffffffbe363674>] cleanup_net+0x1c4/0x2b0
> >> > [  610.371130]  [<ffffffffbdc9e4ac>] process_one_work+0x1fc/0x4b0
> >> > [  610.371812]  [<ffffffffbdc9e7ab>] worker_thread+0x4b/0x500
> >> > [  610.373074]  [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> >> > [  610.373622]  [<ffffffffbdc9e760>] ? process_one_work+0x4b0/0x4b0
> >> > [  610.374100]  [<ffffffffbdca4b09>] kthread+0xd9/0xf0
> >> > [  610.374574]  [<ffffffffbdca4a30>] ? kthread_park+0x60/0x60
> >> > [  610.375198]  [<ffffffffbe48c2b5>] ret_from_fork+0x25/0x30
> >> > [  610.375678] ip              D    0  2149   2148 0x00000000
> >> > [  610.376185]  ffff9892f0a99000 0000000000000000 ffff9892f0a66900
> >> > [  610.376185]  ffff9892f8e59340
> >> > [  610.376185]  ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3
> >> > [  610.376753]  00000005fecffd76
> >> > [  610.376762]  00ff9892f11d9820 ffff9892f8e59340 ffff989200000000
> >> > ffff9892f0a66900
> >> > [  610.377274] Call Trace:
> >> > [  610.377789]  [<ffffffffbe486df3>] ? __schedule+0x233/0x6e0
> >> > [  610.378306]  [<ffffffffbe4872d6>] schedule+0x36/0x80
> >> > [  610.378992]  [<ffffffffbe48756e>] schedule_preempt_disabled+0xe/0x10
> >> > [  610.379514]  [<ffffffffbe489199>] __mutex_lock_slowpath+0xb9/0x130
> >> > [  610.380031]  [<ffffffffbde0fce2>] ? __kmalloc+0x162/0x1e0
> >> > [  610.380556]  [<ffffffffbe48922f>] mutex_lock+0x1f/0x30
> >> > [  610.381135]  [<ffffffffbe3637ff>] copy_net_ns+0x9f/0x170
> >> > [  610.381647]  [<ffffffffbdca5e6b>] create_new_namespaces+0x11b/0x200
> >> > [  610.382249]  [<ffffffffbdca60fa>] unshare_nsproxy_namespaces+0x5a/0xb0
> >> > [  610.382818]  [<ffffffffbdc82dcd>] SyS_unshare+0x1cd/0x360
> >> > [  610.383319]  [<ffffffffbe48c03b>] entry_SYSCALL_64_fastpath+0x1e/0xad
> >>
> >> This process is apparently waiting for the net_mutex held by the previous one.
> >>
> >> Either RCU implementation is broken or something else is missing.
> >> Do you have more stack traces of related processes? For example,
> >> rcu_tasks_kthread. And if anything you can help to narrow down the problem,
> >> it would be great.
> >
> > Did you set the rcu_normal boot parameter?  Doing so would have this effect.
> >
> > (It is intended for real-time users who don't like expedited grace periods.)
> 
> rcu_normal is not set on the kernel command line and
> /sys/kernel/rcu_normal and /sys/kernel/rcu_expedited  both show 0.
> 
> 
> >
> >                                                         Thanx, Paul
> >
> 

^ permalink raw reply

* Re: [PATCH net-next v2 2/7] vxlan: simplify exception handling
From: Pravin Shelar @ 2016-11-11 23:24 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161111121436.56b214ab@griffin>

On Fri, Nov 11, 2016 at 3:14 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu, 10 Nov 2016 11:21:19 -0800, Pravin Shelar wrote:
>> One additional variable is not bad but look at what has happened in
>> vxlan_xmit_one(). There are already more than 20 variables defined. It
>> is hard to read code in this case.
>
> I agree that the function is horrible.
>
> What I was thinking about was separating the vxlan data and control
> plane. The vxlan data plane would perform encapsulation and
> decapsulation based on lwtunnel infrastructure and the rest of the
> "classical" vxlan would be just one of the users of that. Basically
> replacing vxlan_rdst by ip_tunnel_info, among other things.
>
> That would make the vxlan code much much cleaner.
>
I have patch which does something similar for geneve. But it is tricky
to do it for vxlan.

>> anyways I can add another variable to the function. I do not feel that
>> strongly about this.
>
> Me neither, actually. I prefer another variable but I won't oppose the
> patchset just based on that if you choose differently.
>
I have updated patches already. I will post it soon.

^ permalink raw reply

* Re: TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Russell King - ARM Linux @ 2016-11-11 22:44 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, qemu-devel
In-Reply-To: <1478899423.3892.7.camel@infradead.org>

On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> It's also *fairly* unlikely that the kernel in the guest has developed
> a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> that qemu isn't properly emulating those bits. But at first glance at
> the code, it looks like *that's* been there for the last decade too...

I take issue with that, having looked at the qemu rtl8139 code:

                if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
                {
                    int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;

                    DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
                        "frame data %d specified MSS=%d\n", ETH_MTU,
                        ip_data_len, saved_size - ETH_HLEN, large_send_mss);

That's the only reference to "large_send_mss" there, other than that,
the MSS value that gets stuck into the field by 8139cp.c is completely
unused.  Instead, qemu does this:

                eth_payload_data = saved_buffer + ETH_HLEN;
                eth_payload_len  = saved_size   - ETH_HLEN;

                ip = (ip_header*)eth_payload_data;

                    hlen = IP_HEADER_LENGTH(ip);
                    ip_data_len = be16_to_cpu(ip->ip_len) - hlen;

                    tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
                    int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);

                    /* ETH_MTU = ip header len + tcp header len + payload */
                    int tcp_data_len = ip_data_len - tcp_hlen;
                    int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;

                    for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
                    {

It uses a fixed value of ETH_MTU to calculate the size of the TCP
data chunks, and this is not surprisingly the well known:

#define ETH_MTU     1500

Qemu seems to be buggy - it ignores the MSS value, and always tries to
send 1500 byte frames.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Russell King - ARM Linux @ 2016-11-11 22:33 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, qemu-devel
In-Reply-To: <1478899423.3892.7.camel@infradead.org>

On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> On Fri, 2016-11-11 at 21:05 +0000, Russell King - ARM Linux wrote:
> > 
> > 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP (6), length 60)
> >     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 0,nop,wscale 7], length 0
> 
> ... (MSS 1452)
> 
> > 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP (6), length 1500)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, length: 1448
> > 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto TCP (6), length 1484)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP
> 
> Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
> value of 'mss' when this happens?

Well, I'm not going to fiddle in such a way with a public box... that
would be utter madness.  I'll fiddle with mvneta locally on 4.9-rc
instead - and yes, I know that's not the F23 4.4 kernel, so doesn't
really tell us very much.

I _could_ ask bryce to setup another VM on ZenV for me to play with,
but we'll have to wait for bryce to be around for that... I don't
want to break zenv or zeniv. :)

> All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
> bit, and shove it in the descriptor ring. There's not much scope for a
> driver-specific bug.

Unless there's a different interpretation of what the MSS field in the
driver means...

Looking at mvneta, which works correctly,

- On mvneta (192.168.1.59):

21:39:38.535549 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP (6), length 7252)
    192.168.1.59.55170 > 192.168.1.18.5001: Flags [.], seq 25:7225, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 7200

- On laptop (192.168.1.18):

21:39:38.537442 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 25:1465, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537453 IP (tos 0x0, ttl 64, id 27669, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 1465:2905, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537461 IP (tos 0x0, ttl 64, id 27670, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 2905:4345, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537464 IP (tos 0x0, ttl 64, id 9968, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.18.commplex-link > 192.168.1.59.55170: Flags [.], cksum 0x83c4 (incorrect -> 0xa338), ack 1465, win 249, options [nop,nop,TS val 1387514368 ecr 62231754], length 0
21:39:38.537465 IP (tos 0x0, ttl 64, id 27671, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 4345:5785, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537469 IP (tos 0x0, ttl 64, id 27672, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 5785:7225, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440

which is all correct.  Now, these packets have a larger TCP header
due to the options:

        0x0000:  0022 6815 37dd 0050 4321 0201 0800 4500  ."h.7..PC!....E.
                 ^mac                               ^iphdr
        0x0010:  05d4 6c14 4000 4006 4572 c0a8 013b c0a8  ..l.@.@.Er...;..
        0x0020:  0112 d782 1389 4cb4 f8f4 7454 ef10 8010  ......L...tT....
                      ^tcphdr
        0x0030:  00e5 2a80 0000 0101 080a 03b5 94ca 52b3  ..*...........R.
                                ^tcpopts
        0x0040:  c9ff 0000 0000 0000 0001 0000 1389 0000  ................
                      ^start of data
        0x0050:  0000 0000 0000 ffff fc18 3435 3637 3839  ..........456789
        0x0060:  3031 3233 3435 3637 3839 3031 3233 3435  0123456789012345

So the data starts at 66 (0x42) into this packet, followed by 1440 bytes
of data.  Looking at drivers/net/ethernet/marvell/mvneta.c, the only
way this can happen is if skb_shinfo(skb)->gso_size is 1440.  I'll
instrument mvneta to dump this value...

While waiting for the kernel to build, I've been reading the TCP code,
and found this:

/* Compute the current effective MSS, taking SACKs and IP options,
 * and even PMTU discovery events into account.
 */
unsigned int tcp_current_mss(struct sock *sk)
...
        /* The mss_cache is sized based on tp->tcp_header_len, which assumes
         * some common options. If this is an odd packet (because we have SACK
         * blocks etc) then our calculated header_len will be different, and
         * we have to adjust mss_now correspondingly */

mss_now is what becomes gso_size, which means that gso_size will be
adjusted for the TCP options - which makes sense.  So, because there
are 12 bytes of options in the above hex packet dump, negotiated MSS
- 12 gives the data payload size of 1440, and so gso_size will be
1440.

And now, going back to that kernel that finished compiling...

[   53.468319] skb len=7266 hdr len=66 gso_size=1440
...
[   53.728752] skb len=64866 hdr len=66 gso_size=1440

so my guesses were right, at least for 4.9-rc4.  Whether that holds
for the fedora f23 4.4 kernel is another matter.  For the record,
removing the TCPMSS clamp gives the expected result:

[  231.244018] skb len=7306 hdr len=66 gso_size=1448

So, gso_size is the size of the TCP data after the TCP header plus
TCP options.

The other thing to notice is that the SKB length minus header
length is divisible by the gso size for these full-sized packets.
There is no "one packet larger next packet smaller" here.

> It's also *fairly* unlikely that the kernel in the guest has developed
> a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> that qemu isn't properly emulating those bits. But at first glance at
> the code, it looks like *that's* been there for the last decade too...

Whether or not it's been there for a decade is kind of irrelevant -
bugs can be around for a decade and remain undiscovered.

Looking at the 8139C information, it says:

"The new buffer management algorithm provides capabilities of Microsoft
Large-Send offload" and as yet I haven't found anything that describes
what this is or how it works.  How certain are we that the LSO MSS
value is the same as our gso_size, iow the size of the data after
the TCP header and options?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* [PATCH] net: alx: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2016-11-11 22:30 UTC (permalink / raw)
  To: jcliburn, chris.snook, davem; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 drivers/net/ethernet/atheros/alx/ethtool.c |   59 ++++++++++++++++-----------
 1 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c b/drivers/net/ethernet/atheros/alx/ethtool.c
index 08e22df..2f4eabf 100644
--- a/drivers/net/ethernet/atheros/alx/ethtool.c
+++ b/drivers/net/ethernet/atheros/alx/ethtool.c
@@ -125,64 +125,75 @@ static u32 alx_get_supported_speeds(struct alx_hw *hw)
 	return supported;
 }
 
-static int alx_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+static int alx_get_link_ksettings(struct net_device *netdev,
+				  struct ethtool_link_ksettings *cmd)
 {
 	struct alx_priv *alx = netdev_priv(netdev);
 	struct alx_hw *hw = &alx->hw;
+	u32 supported, advertising;
 
-	ecmd->supported = SUPPORTED_Autoneg |
+	supported = SUPPORTED_Autoneg |
 			  SUPPORTED_TP |
 			  SUPPORTED_Pause |
 			  SUPPORTED_Asym_Pause;
 	if (alx_hw_giga(hw))
-		ecmd->supported |= SUPPORTED_1000baseT_Full;
-	ecmd->supported |= alx_get_supported_speeds(hw);
+		supported |= SUPPORTED_1000baseT_Full;
+	supported |= alx_get_supported_speeds(hw);
 
-	ecmd->advertising = ADVERTISED_TP;
+	advertising = ADVERTISED_TP;
 	if (hw->adv_cfg & ADVERTISED_Autoneg)
-		ecmd->advertising |= hw->adv_cfg;
+		advertising |= hw->adv_cfg;
 
-	ecmd->port = PORT_TP;
-	ecmd->phy_address = 0;
+	cmd->base.port = PORT_TP;
+	cmd->base.phy_address = 0;
 
 	if (hw->adv_cfg & ADVERTISED_Autoneg)
-		ecmd->autoneg = AUTONEG_ENABLE;
+		cmd->base.autoneg = AUTONEG_ENABLE;
 	else
-		ecmd->autoneg = AUTONEG_DISABLE;
-	ecmd->transceiver = XCVR_INTERNAL;
+		cmd->base.autoneg = AUTONEG_DISABLE;
 
 	if (hw->flowctrl & ALX_FC_ANEG && hw->adv_cfg & ADVERTISED_Autoneg) {
 		if (hw->flowctrl & ALX_FC_RX) {
-			ecmd->advertising |= ADVERTISED_Pause;
+			advertising |= ADVERTISED_Pause;
 
 			if (!(hw->flowctrl & ALX_FC_TX))
-				ecmd->advertising |= ADVERTISED_Asym_Pause;
+				advertising |= ADVERTISED_Asym_Pause;
 		} else if (hw->flowctrl & ALX_FC_TX) {
-			ecmd->advertising |= ADVERTISED_Asym_Pause;
+			advertising |= ADVERTISED_Asym_Pause;
 		}
 	}
 
-	ethtool_cmd_speed_set(ecmd, hw->link_speed);
-	ecmd->duplex = hw->duplex;
+	cmd->base.speed = hw->link_speed;
+	cmd->base.duplex = hw->duplex;
+
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+						supported);
+	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+						advertising);
 
 	return 0;
 }
 
-static int alx_set_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+static int alx_set_link_ksettings(struct net_device *netdev,
+				  const struct ethtool_link_ksettings *cmd)
 {
 	struct alx_priv *alx = netdev_priv(netdev);
 	struct alx_hw *hw = &alx->hw;
 	u32 adv_cfg;
+	u32 advertising;
 
 	ASSERT_RTNL();
 
-	if (ecmd->autoneg == AUTONEG_ENABLE) {
-		if (ecmd->advertising & ~alx_get_supported_speeds(hw))
+	ethtool_convert_link_mode_to_legacy_u32(&advertising,
+						cmd->link_modes.advertising);
+
+	if (cmd->base.autoneg == AUTONEG_ENABLE) {
+		if (advertising & ~alx_get_supported_speeds(hw))
 			return -EINVAL;
-		adv_cfg = ecmd->advertising | ADVERTISED_Autoneg;
+		adv_cfg = advertising | ADVERTISED_Autoneg;
 	} else {
-		adv_cfg = alx_speed_to_ethadv(ethtool_cmd_speed(ecmd),
-					      ecmd->duplex);
+		adv_cfg = alx_speed_to_ethadv(cmd->base.speed,
+					      cmd->base.duplex);
 
 		if (!adv_cfg || adv_cfg == ADVERTISED_1000baseT_Full)
 			return -EINVAL;
@@ -300,8 +311,6 @@ static int alx_get_sset_count(struct net_device *netdev, int sset)
 }
 
 const struct ethtool_ops alx_ethtool_ops = {
-	.get_settings	= alx_get_settings,
-	.set_settings	= alx_set_settings,
 	.get_pauseparam	= alx_get_pauseparam,
 	.set_pauseparam	= alx_set_pauseparam,
 	.get_msglevel	= alx_get_msglevel,
@@ -310,4 +319,6 @@ static int alx_get_sset_count(struct net_device *netdev, int sset)
 	.get_strings	= alx_get_strings,
 	.get_sset_count	= alx_get_sset_count,
 	.get_ethtool_stats	= alx_get_ethtool_stats,
+	.get_link_ksettings	= alx_get_link_ksettings,
+	.set_link_ksettings	= alx_set_link_ksettings,
 };
-- 
1.7.4.4

^ permalink raw reply related

* Re: Source address fib invalidation on IPv6
From: David Ahern @ 2016-11-11 22:14 UTC (permalink / raw)
  To: Jason A. Donenfeld, Netdev; +Cc: LKML, WireGuard mailing list
In-Reply-To: <CAHmME9qi7_C7c=wsZg=EwBg3jzFzVmW1eiFGGXgcX8fCcOOZcA@mail.gmail.com>

On 11/11/16 12:29 PM, Jason A. Donenfeld wrote:
> Hi folks,
> 
> If I'm replying to a UDP packet, I generally want to use a source
> address that's the same as the destination address of the packet to
> which I'm replying. For example:
> 
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1
> 
> But let's complicate things. Let's say Peer B has multiple IPs on an
> interface: 10.0.0.2, 10.0.0.3. The default route uses 10.0.0.2. In
> this case what do you think should happen?
> 
> Case 1:
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.2, dst = 10.0.0.1
> 
> Case 2:
> Peer A sends packet: src = 10.0.0.1,  dst = 10.0.0.3
> Peer B replies with: src = 10.0.0.3, dst = 10.0.0.1
> 
> Intuition tells me the answer is "Case 2". If you agree, keep reading.
> If you disagree, stop reading here, and instead correct my poor
> intuition.
> 
> So, assuming "Case 2", when Peer B receives the first packet, he notes
> that packet's destination address, so that he can use it as a source
> address next. When replying, Peer B sets the stored source address and
> calls the routing function:
> 
>     struct flowi4 fl = {
>        .saddr = from_daddr_of_previous_packet,
>        .daddr = from_saddr_of_previous_packet,
>     };
>     rt = ip_route_output_flow(sock_net(sock), &fl, sock);
> 
> What if, however, by the time Peer B chooses to reply, his interface
> no longer has that source address? No problem, because
> ip_route_output_flow will return -EINVAL in that case. So, we can do
> this:
> 
>     struct flowi4 fl = {
>        .saddr = from_daddr_of_previous_packet,
>        .daddr = from_saddr_of_previous_packet,
>     };
>     rt = ip_route_output_flow(sock_net(sock), &fl, sock);
>     if (unlikely(IS_ERR(rt))) {
>         fl.saddr = 0;
>         rt = ip_route_output_flow(sock_net(sock), &fl, sock);
>     }
> 
> And then all is good in the neighborhood. This solution works. Done.
> 
> But what about IPv6? That's where we get into trouble:
> 
>     struct flowi6 fl = {
>        .saddr = from_daddr_of_previous_packet,
>        .daddr = from_saddr_of_previous_packet,
>     };
>     ret = ipv6_stub->ipv6_dst_lookup(sock_net(sock), sock, &dst, &fl);
> 
> In this case, IPv6 returns a valid dst, when no interface has the
> source address anymore! So, there's no way to know whether or not the
> source address for replying has gone stale. We don't have a means of
> falling back to inaddr_any for the source address.

What do you mean by 'valid dst'? ipv6 returns net->ipv6.ip6_null_entry on lookup failures so yes dst is non-NULL but that does not mean the lookup succeeded.

For example take a look at ip6_dst_lookup_tail():
        if (!*dst)
                *dst = ip6_route_output_flags(net, sk, fl6, flags);

        err = (*dst)->error;
        if (err)
                goto out_err_release;


perhaps I should add dst->error to the fib tracepoints ...

> 
> Primary question: is this behavior a bug? Or is this some consequence
> of a fundamental IPv6 difference with v4? Or is something else
> happening here?
> 
> Thanks,
> Jason
> 

^ permalink raw reply

* [PATCH net-next] bpf: fix range arithmetic for bpf map access
From: Josef Bacik @ 2016-11-11 21:47 UTC (permalink / raw)
  To: jannh, ast, daniel, davem, netdev

I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries.  Fix this up by doing a few things

1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
life and just adds extra complexity.

2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.

3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.

This fixes the testcase provided by Jann as well as a few other theoretical
problems.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/bpf_verifier.h |  3 +-
 kernel/bpf/verifier.c        | 70 +++++++++++++++++++++++++++++---------------
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ac5b393..15ceb7f 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -22,7 +22,8 @@ struct bpf_reg_state {
 	 * Used to determine if any memory access using this register will
 	 * result in a bad access.
 	 */
-	u64 min_value, max_value;
+	s64 min_value;
+	u64 max_value;
 	u32 id;
 	union {
 		/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 89f787c..709fe0e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -234,8 +234,8 @@ static void print_verifier_state(struct bpf_verifier_state *state)
 				reg->map_ptr->value_size,
 				reg->id);
 		if (reg->min_value != BPF_REGISTER_MIN_RANGE)
-			verbose(",min_value=%llu",
-				(unsigned long long)reg->min_value);
+			verbose(",min_value=%lld",
+				(long long)reg->min_value);
 		if (reg->max_value != BPF_REGISTER_MAX_RANGE)
 			verbose(",max_value=%llu",
 				(unsigned long long)reg->max_value);
@@ -778,7 +778,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 			 * index'es we need to make sure that whatever we use
 			 * will have a set floor within our range.
 			 */
-			if ((s64)reg->min_value < 0) {
+			if (reg->min_value < 0) {
 				verbose("R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
 					regno);
 				return -EACCES;
@@ -1490,7 +1490,8 @@ static void check_reg_overflow(struct bpf_reg_state *reg)
 {
 	if (reg->max_value > BPF_REGISTER_MAX_RANGE)
 		reg->max_value = BPF_REGISTER_MAX_RANGE;
-	if ((s64)reg->min_value < BPF_REGISTER_MIN_RANGE)
+	if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
+	    reg->min_value > BPF_REGISTER_MAX_RANGE)
 		reg->min_value = BPF_REGISTER_MIN_RANGE;
 }
 
@@ -1498,7 +1499,8 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 				    struct bpf_insn *insn)
 {
 	struct bpf_reg_state *regs = env->cur_state.regs, *dst_reg;
-	u64 min_val = BPF_REGISTER_MIN_RANGE, max_val = BPF_REGISTER_MAX_RANGE;
+	s64 min_val = BPF_REGISTER_MIN_RANGE;
+	u64 max_val = BPF_REGISTER_MAX_RANGE;
 	u8 opcode = BPF_OP(insn->code);
 
 	dst_reg = &regs[insn->dst_reg];
@@ -1532,22 +1534,43 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		return;
 	}
 
+	/* If one of our values was at the end of our ranges then we can't just
+	 * do our normal operations to the register, we need to set the values
+	 * to the min/max since they are undefined.
+	 */
+	if (min_val == BPF_REGISTER_MIN_RANGE)
+		dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+	if (max_val == BPF_REGISTER_MAX_RANGE)
+		dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+
 	switch (opcode) {
 	case BPF_ADD:
-		dst_reg->min_value += min_val;
-		dst_reg->max_value += max_val;
+		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+			dst_reg->min_value += min_val;
+		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
+			dst_reg->max_value += max_val;
 		break;
 	case BPF_SUB:
-		dst_reg->min_value -= min_val;
-		dst_reg->max_value -= max_val;
+		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+			dst_reg->min_value -= min_val;
+		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
+			dst_reg->max_value -= max_val;
 		break;
 	case BPF_MUL:
-		dst_reg->min_value *= min_val;
-		dst_reg->max_value *= max_val;
+		if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+			dst_reg->min_value *= min_val;
+		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
+			dst_reg->max_value *= max_val;
 		break;
 	case BPF_AND:
-		/* & is special since it could end up with 0 bits set. */
-		dst_reg->min_value &= min_val;
+		/* Disallow AND'ing of negative numbers, ain't nobody got time
+		 * for that.  Otherwise the minimum is 0 and the max is the max
+		 * value we could AND against.
+		 */
+		if (min_val < 0)
+			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+		else
+			dst_reg->min_value = 0;
 		dst_reg->max_value = max_val;
 		break;
 	case BPF_LSH:
@@ -1557,24 +1580,25 @@ static void adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		 */
 		if (min_val > ilog2(BPF_REGISTER_MAX_RANGE))
 			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
-		else
+		else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
 			dst_reg->min_value <<= min_val;
 
 		if (max_val > ilog2(BPF_REGISTER_MAX_RANGE))
 			dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
-		else
+		else if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
 			dst_reg->max_value <<= max_val;
 		break;
 	case BPF_RSH:
-		dst_reg->min_value >>= min_val;
-		dst_reg->max_value >>= max_val;
-		break;
-	case BPF_MOD:
-		/* % is special since it is an unsigned modulus, so the floor
-		 * will always be 0.
+		/* RSH by a negative number is undefined, and the BPF_RSH is an
+		 * unsigned shift, so make the appropriate casts.
 		 */
-		dst_reg->min_value = 0;
-		dst_reg->max_value = max_val - 1;
+		if (min_val < 0)
+			dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+		else if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
+			dst_reg->min_value =
+				(u64)(dst_reg->min_value) >> min_val;
+		if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
+			dst_reg->max_value >>= max_val;
 		break;
 	default:
 		reset_reg_range_values(regs, insn->dst_reg);
-- 
2.5.5

^ permalink raw reply related

* Re: TCP performance problems - GSO/TSO, MSS, 8139cp related
From: David Woodhouse @ 2016-11-11 21:23 UTC (permalink / raw)
  To: Russell King - ARM Linux, netdev; +Cc: qemu-devel
In-Reply-To: <20161111210500.GE1041@n2100.armlinux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

On Fri, 2016-11-11 at 21:05 +0000, Russell King - ARM Linux wrote:
> 
> 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP (6), length 60)
>     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 0,nop,wscale 7], length 0

... (MSS 1452)

> 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP (6), length 1500)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, length: 1448
> 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto TCP (6), length 1484)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP

Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
value of 'mss' when this happens?

All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
bit, and shove it in the descriptor ring. There's not much scope for a
driver-specific bug.

It's also *fairly* unlikely that the kernel in the guest has developed
a bug and isn't setting gso_size sanely. I'm more inclined to suspect
that qemu isn't properly emulating those bits. But at first glance at
the code, it looks like *that's* been there for the last decade too...

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5760 bytes --]

^ permalink raw reply

* do bridge members need to be listed in /proc/net/dev_mcast
From: Brian J. Murrell @ 2016-11-11 21:18 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2382 bytes --]

Hi.

I have a Linux router running 3.18.23 with IPv6 as well as IPv4
interfaces.  It doesn't seem to be hearing IPv6 multicast packets
though.

For example, it won't hear and respond to either router or neighbour
discovery packets unless i put the interface in promiscuous mode with
tcpdump.  I'm a bit stumped at what could cause that.

The interface that is not hearing the IPv6 multicast packets is a
bridge with an ethernet and wifi interfaces as members:

# brctl show br-lan
bridge name	bridge id		STP enabled	interfaces
br-lan		7fff.26d42cb3eadf	no		eth0.1
							wlan0
							wlan1

The bridge does have the right multicast addresses configured in
/proc/net/dev_mcast:

8    br-lan          1     0     333300000001
8    br-lan          1     0     333300000002
8    br-lan          1     0     01005e000001
8    br-lan          1     0     3333ff000001
8    br-lan          1     0     3333ffb3eadf
8    br-lan          1     0     3333ff000000
8    br-lan          1     0     01005e000005
8    br-lan          1     0     01005e000006

But what is interesting is that the wlan{0,1} interfaces that are in
the br-lan bridge are in the /proc/net/dev_mcast also:

15   wlan1           2     0     333300000001
15   wlan1           2     0     333300000002
15   wlan1           2     0     01005e000001
15   wlan1           2     0     3333fff51e4c
15   wlan1           2     0     3333ff000000
16   wlan0           2     0     333300000001
16   wlan0           2     0     333300000002
16   wlan0           2     0     01005e000001
16   wlan0           2     0     3333fff51e4a
16   wlan0           2     0     3333ff000000

But the ethernet member, eth0.1 is not.

Is it sufficient to have a bridge interface in /proc/net/dev_mcast or
do all of it's member interfaces need the respective multicast
addresses listed in that file also?  It just seems odd to me that the
wlan interfaces are there but the ethernet interface is not.

If it is sufficient to have just the bridge in /proc/net/dev_mcast what
else could be causing this "deafness" to multicast that is resolved by
putting the interface into promiscuous mode?

Cheers,
b.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply

* TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Russell King - ARM Linux @ 2016-11-11 21:05 UTC (permalink / raw)
  To: netdev, dwmw2

Hi,

I seem to have found a severe performance issue somewhere in the
networking code.

This involves ZenIV.linux.org.uk, which is a qemu-kvm guest instance
on ZenV, which is configured to use macvtap for ZenIV to gain its
network access, with ZenIV using the 8139cp driver.

My initial testing was from my laptop (running 4.5.7), through a
router box (also running 4.5.7) and out my FTTC link, across the
Internet to ZenV (4.4.8-300.fc23.x86_64) and then onto the ZenIV
(also 4.4.8-300.fc23.x86_64) guest.  Thinking that it may be an
issue with my crappy FTTC, I switched the routing at my end over
the ADSL line, which showed the same issues.

Eventually, what fixed it was disabling both TSO and GSO in the
ZenIV guest.

Now, both my FTTC and ADSL links have a reduced MTU, and I'm having
to use TCPMSS on the router box to clamp the MSS - which gets
clamped to 1452, 8 bytes lower than the usual 1460 for standard
ethernet.

With TSO on, I see the guest sending TCP packets with a 2880 byte
payload:

17:36:07.006009 IP (tos 0x0, ttl 52, id 17517, offset 0, flags [DF], proto TCP (6), length 60)
    84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [S], cksum 0x2c25 (correct), seq 356291023, win 29200, options [mss 1452,sackOK,TS val 1372902818 ecr 0,nop,wscale 7], length 0
17:36:07.006122 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [S.], cksum 0xed7f (incorrect -> 0x674a), seq 2784716623, ack 356291024, win 28960, options [mss 1460,sackOK,TS val 3358126141 ecr 1372902818,nop,wscale 7], length 0
17:36:07.035531 IP (tos 0x0, ttl 52, id 17518, offset 0, flags [DF], proto TCP (6), length 52)
    84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [.], cksum 0x0634 (correct), ack 1, win 229, options [nop,nop,TS val 1372902848 ecr 3358126141], length 0
17:36:07.038233 IP (tos 0x0, ttl 52, id 17519, offset 0, flags [DF], proto TCP (6), length 205)
    84.xx.xxx.196.60846 > 195.92.253.2.http: Flags [P.], cksum 0x3a1e (correct), seq 1:154, ack 1, win 229, options [nop,nop,TS val 1372902848 ecr 3358126141], length 153: HTTP, length: 153
17:36:07.038356 IP (tos 0x0, ttl 64, id 38669, offset 0, flags [DF], proto TCP (6), length 52)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], cksum 0xed77 (incorrect -> 0x0575), ack 154, win 235, options [nop,nop,TS val 3358126173 ecr 1372902848], length 0
17:36:07.039255 IP (tos 0x0, ttl 64, id 38670, offset 0, flags [DF], proto TCP (6), length 2932)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 1:2881, ack 154, win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP, length: 2880
17:36:07.039442 IP (tos 0x0, ttl 64, id 38672, offset 0, flags [DF], proto TCP (6), length 2932)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 2881:5761, ack 154, win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP
17:36:07.039579 IP (tos 0x0, ttl 64, id 38674, offset 0, flags [DF], proto TCP (6), length 2932)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 5761:8641, ack 154, win 235, options [nop,nop,TS val 3358126174 ecr 1372902848], length 2880: HTTP
...etc...

On the macvtap side, however, which is post-segmentation by the
virtualised 8139cp hardware (this taken at a later time):

18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP (6), length 60)
    84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 0,nop,wscale 7], length 0
18:59:38.783270 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [S.], cksum 0x575d (correct), seq 4091022471, ack 158975431, win 28960, options [mss 1460,sackOK,TS val 3363137919 ecr 1377914597,nop,wscale 7], length 0
18:59:38.812089 IP (tos 0x0, ttl 52, id 35620, offset 0, flags [DF], proto TCP (6), length 52)
    84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [.], cksum 0xf646 (correct), ack 1, win 229, options [nop,nop,TS val 1377914627 ecr 3363137919], length 0
18:59:38.814623 IP (tos 0x0, ttl 52, id 35621, offset 0, flags [DF], proto TCP (6), length 205)
    84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [P.], cksum 0x2a31 (correct), seq 1:154, ack 1, win 229, options [nop,nop,TS val 1377914627 ecr 3363137919], length 153: HTTP, length: 153
18:59:38.815025 IP (tos 0x0, ttl 64, id 25878, offset 0, flags [DF], proto TCP (6), length 52)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], cksum 0xf588 (correct), ack 154, win 235, options [nop,nop,TS val 3363137950 ecr 1377914627], length 0
18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP (6), length 1500)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, length: 1448
18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto TCP (6), length 1484)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP
18:59:38.816471 IP (tos 0x0, ttl 64, id 25881, offset 0, flags [DF], proto TCP (6), length 1500)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 2881:4329, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP
18:59:38.816501 IP (tos 0x0, ttl 64, id 25882, offset 0, flags [DF], proto TCP (6), length 1484)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 4329:5761, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP
18:59:38.816660 IP (tos 0x0, ttl 64, id 25883, offset 0, flags [DF], proto TCP (6), length 1500)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 5761:7209, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP

Now, every packet which has 1448 bytes of payload is 1514 bytes in length,
which gets dropped on its way to me at the ISP end of the link, because
the PPPoE link seems unable to handle this sized packet (annoyingly.)

The result is that the oversized "200 OK" packet gets lost and has to be
re-transmitted - here it is on the guest side:

17:36:07.176351 IP (tos 0x0, ttl 64, id 38681, offset 0, flags [DF], proto TCP (6), length 1492)
    195.92.253.2.http > 84.xx.xxx.196.60846: Flags [.], seq 1:1441, ack 154, win 235, options [nop,nop,TS val 3358126311 ecr 1372902989], length 1440: HTTP, length: 1440

notice that it is 1440 bytes in size now... and of course it comes
through on the macvtap side correctly:

18:59:38.950513 IP (tos 0x0, ttl 64, id 25890, offset 0, flags [DF], proto TCP (6), length 1492)
    195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1441, ack 154, win 235, options [nop,nop,TS val 3363138086 ecr 1377914764], length 1440: HTTP, length: 1440

This kind of thing goes on throughout the transfer - whenever the guest
sends a GSO/TSO packet, it is incorrectly segmented, resulting in the
over-sized segments being dropped, and causing lots of retransmissions.

The result is that with TSO/GSO on, I get around 70-80KB/s, but with
TSO/GSO off, I get 723KB/s - around a factor of 10 faster.

Doing some local testing between the 4.5.7 laptop and a Marvell board
running 4.9-rc, and using TCPMSS to clamp the MSS To 1452 between these
(on both the SYN and SYNACK packets) shows that the laptop's E1000e
driver and the 4.5.7 net stack correctly segment - I end up with TCP
packets with 1440 byte payloads being spat out of the E1000e NIC.

So, my guess is there's something wrong with either 8139cp (and dwmw2's
commit says to scream at him if it breaks!) or something wrong in the
qemu 8139cp hardware emulation.

I've suggested to bryce (who setup the VM and knows it better than I)
to try switching ZenIV to E1000e to see whether that makes any
difference - that would point towards either the 8139cp driver or the
qemu 8139 hardware emulation being broken, rather than something in
the network stack.

However, it may be worth someone testing TSO/GSO with real 8139cp
hardware - the MSS can be clamped with:

# iptables -t mangle -I INPUT -p tcp --tcp-flags SYN,RST SYN \
	-j TCPMSS --set-mss 1452
# iptables -t mangle -I OUTPUT -p tcp --tcp-flags SYN,RST SYN \
	-j TCPMSS --set-mss 1452

and testing with something like wget/iperf.  You'll need to ensure
that GRO is disabled on the box receiving the TCP packets from the
8139cp machine to see the raw packets in tcpdump, otherwise you'll
get much larger packets reassembled by the GRO code.  You should
see the TCP packets with a data size of 1440 bytes, not alternating
between 1448 and 1432 bytes.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [PATCH v2 1/6] qed: Add support for hardware offloaded iSCSI.
From: Arun Easi @ 2016-11-11 18:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Manish Rangankar, martin.petersen, lduncan, cleech, linux-scsi,
	netdev, QLogic-Storage-Upstream, Yuval.Mintz
In-Reply-To: <1c5639c1-4160-72fe-ecc7-ca6257ec333f@suse.de>

On Fri, 11 Nov 2016, 7:57am, Hannes Reinecke wrote:

> On 11/08/2016 07:56 AM, Manish Rangankar wrote:
> > From: Yuval Mintz <Yuval.Mintz@cavium.com>
> > 
> > This adds the backbone required for the various HW initalizations
> > which are necessary for the iSCSI driver (qedi) for QLogic FastLinQ
> > 4xxxx line of adapters - FW notification, resource initializations, etc.
> > 
> > Signed-off-by: Arun Easi <arun.easi@cavium.com>
> > Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
> > ---
> >  drivers/net/ethernet/qlogic/Kconfig            |   15 +
> >  drivers/net/ethernet/qlogic/qed/Makefile       |    1 +
> >  drivers/net/ethernet/qlogic/qed/qed.h          |    7 +-
> >  drivers/net/ethernet/qlogic/qed/qed_dev.c      |   12 +
> >  drivers/net/ethernet/qlogic/qed/qed_int.h      |    1 -
> >  drivers/net/ethernet/qlogic/qed/qed_iscsi.c    | 1276
> > ++++++++++++++++++++++++
> >  drivers/net/ethernet/qlogic/qed/qed_iscsi.h    |   52 +
> >  drivers/net/ethernet/qlogic/qed/qed_l2.c       |    1 -
> >  drivers/net/ethernet/qlogic/qed/qed_ll2.c      |    4 +-
> >  drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |    2 +
> >  drivers/net/ethernet/qlogic/qed/qed_spq.c      |   15 +
> >  include/linux/qed/qed_if.h                     |    2 +
> >  include/linux/qed/qed_iscsi_if.h               |  229 +++++
> >  13 files changed, 1613 insertions(+), 4 deletions(-)
> >  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c
> >  create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h
> >  create mode 100644 include/linux/qed/qed_iscsi_if.h
> > 
> > diff --git a/drivers/net/ethernet/qlogic/Kconfig
> > b/drivers/net/ethernet/qlogic/Kconfig
> > index 32f2a45..2832570 100644
> > --- a/drivers/net/ethernet/qlogic/Kconfig
> > +++ b/drivers/net/ethernet/qlogic/Kconfig
> > @@ -110,4 +110,19 @@ config QEDE
> >  config QED_RDMA
> >  	bool
> > 
> > +config QED_ISCSI
> > +	bool
> > +
> > +config QEDI
> > +	tristate "QLogic QED 25/40/100Gb iSCSI driver"
> > +	depends on QED
> > +	select QED_LL2
> > +	select QED_ISCSI
> > +	default n
> > +	---help---
> > +	  This provides a temporary node that allows the compilation
> > +	  and logical testing of the hardware offload iSCSI support
> > +	  for QLogic QED. This would be replaced by the 'real' option
> > +	  once the QEDI driver is added [+relocated].
> > +
> >  endif # NET_VENDOR_QLOGIC
> > diff --git a/drivers/net/ethernet/qlogic/qed/Makefile
> > b/drivers/net/ethernet/qlogic/qed/Makefile
> > index 967acf3..597e15c 100644
> > --- a/drivers/net/ethernet/qlogic/qed/Makefile
> > +++ b/drivers/net/ethernet/qlogic/qed/Makefile
> > @@ -6,3 +6,4 @@ qed-y := qed_cxt.o qed_dev.o qed_hw.o qed_init_fw_funcs.o
> > qed_init_ops.o \
> >  qed-$(CONFIG_QED_SRIOV) += qed_sriov.o qed_vf.o
> >  qed-$(CONFIG_QED_LL2) += qed_ll2.o
> >  qed-$(CONFIG_QED_RDMA) += qed_roce.o
> > +qed-$(CONFIG_QED_ISCSI) += qed_iscsi.o
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed.h
> > b/drivers/net/ethernet/qlogic/qed/qed.h
> > index 50b8a01..15286c1 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed.h
> > +++ b/drivers/net/ethernet/qlogic/qed/qed.h
> > @@ -35,6 +35,7 @@
> > 
> >  #define QED_WFQ_UNIT	100
> > 
> > +#define ISCSI_BDQ_ID(_port_id) (_port_id)
> >  #define QED_WID_SIZE            (1024)
> >  #define QED_PF_DEMS_SIZE        (4)
> > 
> > @@ -392,6 +393,7 @@ struct qed_hwfn {
> >  	bool				using_ll2;
> >  	struct qed_ll2_info		*p_ll2_info;
> >  	struct qed_rdma_info		*p_rdma_info;
> > +	struct qed_iscsi_info		*p_iscsi_info;
> >  	struct qed_pf_params		pf_params;
> > 
> >  	bool b_rdma_enabled_in_prs;
> > @@ -593,6 +595,8 @@ struct qed_dev {
> >  	/* Linux specific here */
> >  	struct  qede_dev		*edev;
> >  	struct  pci_dev			*pdev;
> > +	u32 flags;
> > +#define QED_FLAG_STORAGE_STARTED	(BIT(0))
> >  	int				msg_enable;
> > 
> >  	struct pci_params		pci_params;
> > @@ -606,6 +610,7 @@ struct qed_dev {
> >  	union {
> >  		struct qed_common_cb_ops	*common;
> >  		struct qed_eth_cb_ops		*eth;
> > +		struct qed_iscsi_cb_ops		*iscsi;
> >  	} protocol_ops;
> >  	void				*ops_cookie;
> > 
> > @@ -615,7 +620,7 @@ struct qed_dev {
> >  	struct qed_cb_ll2_info		*ll2;
> >  	u8				ll2_mac_address[ETH_ALEN];
> >  #endif
> > -
> > +	DECLARE_HASHTABLE(connections, 10);
> >  	const struct firmware		*firmware;
> > 
> >  	u32 rdma_max_sge;
> 10 connections? Only?
> Hmm.

10 is the hash bits => 2^10 hash buckets, allowing for a large number of 
connections. qedi driver currently uses 1k connections per port.

Thanks for the reviews, Hannes.

Regards,
-Arun

> 
> Other than that:
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> 
> Cheers,
> 
> Hannes
> 

^ permalink raw reply

* [RFC PATCH net-next] net: ethtool: add support for forward error correction modes
From: Casey Leedom @ 2016-11-11 19:51 UTC (permalink / raw)
  To: netdev@vger.kernel.org

N.B.  Sorry I'm not able to respond to the original message since I
wasn't subscribed to netdev when it was sent a couple of weeks ago.

This feature is something that Chelsio's cxgb4 driver needs.
As we've tested our adapters against a number of switches,
we've discovered a few which use varying defaults for FEC.
And when Auto-Negotiation isn't used (or even possible with
Optical Links), we need to be able to control turning FEC on/off.

For our part, we default FEC Off for Optical Transceivers.
For Copper, we read the Cable's EEPROM to determine
how to default FEC.  For some switches this works, but
for at least one where that switch enables FEC for Optical
Transceivers, it doesn't.  For that switch we had to hard
wire FEC on.  Obviously that's not a good solution and we
need an administrative interface so the system
administrator can configure our adapter to use the
appropriate FEC setting to match that of the switch.

So this is basically a long-winded ACK of Vidya's patch
and we would immediately implement this new ethtool API
as soon as it's available.

Casey

^ permalink raw reply

* [PATCH] icmp: Restore resistence to abnormal messages
From: Vicente Jimenez Aguilar @ 2016-11-11 20:20 UTC (permalink / raw)
  To: David S . Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev, linux-kernel, Vicente Jimenez Aguilar

Restore network resistance to abnormal ICMP fragmentation needed messages
with next hop MTU equal to (or exceeding) dropped packet size

Fixes: 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Vicente Jimenez Aguilar <googuy@gmail.com>
---
 net/ipv4/icmp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 38abe70..4c90d76 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -773,6 +773,7 @@ static bool icmp_tag_validation(int proto)
 static bool icmp_unreach(struct sk_buff *skb)
 {
 	const struct iphdr *iph;
+	unsigned short old_mtu;
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
@@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
 				/* fall through */
 			case 0:
 				info = ntohs(icmph->un.frag.mtu);
+				/* Handle weird case where next hop MTU is
+				 * equal to or exceeding dropped packet size
+				 */
+				old_mtu = ntohs(iph->tot_len);
+				if (info >= old_mtu)
+					info = old_mtu - 2;
 			}
 			break;
 		case ICMP_SR_FAILED:
-- 
2.9.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox