Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 4/4] net/sched: act_mirred: Implement ingress actions
From: Eric Dumazet @ 2016-09-22 18:42 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: David S. Miller, Jamal Hadi Salim, WANG Cong, Eric Dumazet,
	netdev
In-Reply-To: <20160922212745.789734c5@halley>

On Thu, 2016-09-22 at 21:27 +0300, Shmulik Ladkani wrote:
> On Thu, 22 Sep 2016 07:54:13 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Hmm... we probably need to apply the full rcu protection before this
> > patch.
> > 
> > https://patchwork.ozlabs.org/patch/667680/
> 
> Are you referring to order of application into net-next?
> 
> This patch seems to present no new tcf_mirred_params members nor
> need-to-be-protected code regions (please, correct me if wrong).
> So it does not _depend_ on the 'full rcu fixes', does it?

No, simply a reminder that we run lockless there, so you might need to
read some control variables once, and in a consistent way.

(Or a concurrent writer could change params in the middle of the
function)

^ permalink raw reply

* [PATCH v3] tcp: fix wrong checksum calculation on MTU probing
From: Douglas Caetano dos Santos @ 2016-09-22 18:52 UTC (permalink / raw)
  To: Sergei Shtylyov, David Miller; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <5ffc1020-a36f-75b4-ff25-4f58c243f803@cogentembedded.com>

With TCP MTU probing enabled and offload TX checksumming disabled,
tcp_mtu_probe() calculated the wrong checksum when a fragment being copied
into the probe's SKB had an odd length. This was caused by the direct use
of skb_copy_and_csum_bits() to calculate the checksum, as it pads the
fragment being copied, if needed. When this fragment was not the last, a
subsequent call used the previous checksum without considering this
padding.

The effect was a stale connection in one way, as even retransmissions
wouldn't solve the problem, because the checksum was never recalculated for
the full SKB length.

Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
---
 net/ipv4/tcp_output.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f53d0cc..2d32952 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1966,12 +1966,14 @@ static int tcp_mtu_probe(struct sock *sk)
 	len = 0;
 	tcp_for_write_queue_from_safe(skb, next, sk) {
 		copy = min_t(int, skb->len, probe_size - len);
-		if (nskb->ip_summed)
+		if (nskb->ip_summed) {
 			skb_copy_bits(skb, 0, skb_put(nskb, copy), copy);
-		else
-			nskb->csum = skb_copy_and_csum_bits(skb, 0,
-							    skb_put(nskb, copy),
-							    copy, nskb->csum);
+		} else {
+			__wsum csum = skb_copy_and_csum_bits(skb, 0,
+							     skb_put(nskb, copy),
+							     copy, 0);
+			nskb->csum = csum_block_add(nskb->csum, csum, len);
+		}

 		if (skb->len <= copy) {
 			/* We've eaten all the data from this skb.
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH net] net: rtnl_register in net_ns_init need rtnl_lock
From: Hannes Frederic Sowa @ 2016-09-22 19:02 UTC (permalink / raw)
  To: Cong Wang; +Cc: Eric Dumazet, Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUsYLWxWvcuVzHt=Q9hRjZ1vKJHWjS6ZKWc0rjdtdOo8w@mail.gmail.com>

On Thu, Sep 22, 2016, at 19:20, Cong Wang wrote:
> > I don't think it is a big issue but wanted the writes to the
> > rtnl_msg_handlers array to be strictly serialized. I was working on
> > adding this to other places, too. Maybe better for net-next even?
> 
> But they are called during boot, why is it possible to have a parallel
> reader/writer at that time?

It also happens during module load time, which isn't synchronized with
rtnl_lock.

Bye,
Hannes

^ permalink raw reply

* CONGRATULATIONS !!!
From: Cheveron Texaco @ 2016-09-22 18:59 UTC (permalink / raw)




Attention Winner,

Congratulations !!! Your email Address has won you a substantial amount of 1 Million United States dollars in the just concluded Chevron Texaco Promo-Lottery.

You are to contact your claim agent for claims process at Email:   chevron.claimsagent1401@gmail.com


Best Regards
Ms Stella Newton

^ permalink raw reply

* [RFC] net: store port/representative id in metadata_dst
From: Jakub Kicinski @ 2016-09-22 19:26 UTC (permalink / raw)
  To: netdev, Thomas Graf
  Cc: Roopa Prabhu, Jiri Pirko, ogerlitz, John Fastabend,
	sridhar.samudrala, ast, daniel, simon.horman, Paolo Abeni,
	Pravin B Shelar, Jiri Benc, hannes, kubakici, Jakub Kicinski

Switches and modern SR-IOV enabled NICs may multiplex traffic from
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure
today unless we have a way of attaching metadata to skbs at the upper
device.  Because single set of queues is used for many netdevs stopping
TC/sched queues of all of them reliably is impossible and lower
device has to retreat to returning NETDEV_TX_BUSY and usually
has to take extra locks on the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id.  This way representatives can be queueless
and all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have 
metadata attached and will subsequently be queued to the lower device
for transmission.  The lower device should recognize the metadata and
translate it to HW specific format which is most likely either a special
header inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new 
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower
netdevs today.

Since I don't have any real user in the tree at this point please
allow me to present a trivial example use here:

void upper_init(struct upper *upper, struct lower *lower, unsigned int id)
{
	upper->lower_dev = lower;

	upper->dst_meta = metadata_dst_alloc(0, METADATA_HW_PORT_MUX,
					     GFP_KERNEL);
	upper->dst_meta.u.lower_dev = lower->netdev;
	upper->dst_meta.u.port_info.port_id = id;
}

int upper_tx(struct sk_buff *skb, struct net_device *netdev)
{
	struct upper *upper = netdev_priv(netdev);

	skb_dst_drop(skb);
	skb_dst_set_noref(skb, upper->dst_meta);

	return dev_queue_xmit(upper->lower_dev, skb);
}

int lower_tx(struct sk_buff *skb, struct net_device *netdev)
{
	struct metadata_dst *md = skb_metadata_dst(skb);
	struct lower *lower = netdev_priv(netdev);

	if (md->type == METADATA_HW_PORT_MUX &&
	    md->u.lower_dev == netdev) {
		/* use md->u.port_id to set port in
		 * descriptor/metadata/do encap
		 */
	}
	...
}

Other approaches considered but found inferior:
 - in-data tags - inserting tags into data will be confusing
   to classifiers which start parsing from mac headers, also
   in-band data is less perfect and allows sufficiently privileged
   user to inject control messages from userspace (this is DSA model
   - note that in SR-IOV switchdev mode I control both upper and lower
   device which differs from DSA where lower device can be any MAC);
 - per-VFR HW queues - requiring a queue per VF is a little wasteful and
   less scalable, muxing allows us to use all PF queues to transmit
   and receive with full RSS (this is model of existing SR-IOV switchdev
   mode implementations);
 - per-VFR TC queue - we could use per-VFR queue in the lower device,
   tag traffic and TX on smaller set of HW queues but again scaling
   would suffer, we would need to lock an extra queue and we have no
   way to stop all queues when HW queues fill up reliably (this model
   would piggy back on dev_queue_xmit_accel() to select queue).


Any comments, reactions would be much appreciated!

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/dst_metadata.h     | 35 ++++++++++++++++++++++++++++-------
 net/core/dst.c                 | 14 +++++++++-----
 net/core/filter.c              |  1 +
 net/ipv4/ip_tunnel_core.c      |  5 +++--
 net/openvswitch/flow_netlink.c |  3 ++-
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 6965c8f68ade..6d7e1e4f3acd 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -5,10 +5,22 @@
 #include <net/ip_tunnels.h>
 #include <net/dst.h>
 
+enum metadata_type {
+	METADATA_IP_TUNNEL,
+	METADATA_HW_PORT_MUX,
+};
+
+struct hw_port_info {
+	struct netdevice *lower_dev;
+	u32 port_id;
+};
+
 struct metadata_dst {
 	struct dst_entry		dst;
+	enum metadata_type		type;
 	union {
 		struct ip_tunnel_info	tun_info;
+		struct hw_port_info	port_info;
 	} u;
 };
 
@@ -27,7 +39,7 @@ static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
 	struct metadata_dst *md_dst = skb_metadata_dst(skb);
 	struct dst_entry *dst;
 
-	if (md_dst)
+	if (md_dst && md_dst->type == METADATA_IP_TUNNEL)
 		return &md_dst->u.tun_info;
 
 	dst = skb_dst(skb);
@@ -55,7 +67,14 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
 	a = (const struct metadata_dst *) skb_dst(skb_a);
 	b = (const struct metadata_dst *) skb_dst(skb_b);
 
-	if (!a != !b || a->u.tun_info.options_len != b->u.tun_info.options_len)
+	if (!a != !b || a->type != b->type)
+		return 1;
+
+	if (a->type == METADATA_HW_PORT_MUX)
+		return memcmp(&a->u.port_info, &b->u.port_info,
+			      sizeof(a->u.port_info));
+
+	if (a->u.tun_info.options_len != b->u.tun_info.options_len)
 		return 1;
 
 	return memcmp(&a->u.tun_info, &b->u.tun_info,
@@ -63,14 +82,16 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
 }
 
 void metadata_dst_free(struct metadata_dst *);
-struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags);
-struct metadata_dst __percpu *metadata_dst_alloc_percpu(u8 optslen, gfp_t flags);
+struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
+					gfp_t flags);
+struct metadata_dst __percpu *
+metadata_dst_alloc_percpu(u8 optslen, enum metadata_type type, gfp_t flags);
 
 static inline struct metadata_dst *tun_rx_dst(int md_size)
 {
 	struct metadata_dst *tun_dst;
 
-	tun_dst = metadata_dst_alloc(md_size, GFP_ATOMIC);
+	tun_dst = metadata_dst_alloc(md_size, METADATA_IP_TUNNEL, GFP_ATOMIC);
 	if (!tun_dst)
 		return NULL;
 
@@ -85,11 +106,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
 	int md_size;
 	struct metadata_dst *new_md;
 
-	if (!md_dst)
+	if (!md_dst || md_dst->type != METADATA_IP_TUNNEL)
 		return ERR_PTR(-EINVAL);
 
 	md_size = md_dst->u.tun_info.options_len;
-	new_md = metadata_dst_alloc(md_size, GFP_ATOMIC);
+	new_md = metadata_dst_alloc(md_size, METADATA_IP_TUNNEL, GFP_ATOMIC);
 	if (!new_md)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/net/core/dst.c b/net/core/dst.c
index b5cbbe07f786..dc8c0c0b197b 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -367,7 +367,8 @@ static int dst_md_discard(struct sk_buff *skb)
 	return 0;
 }
 
-static void __metadata_dst_init(struct metadata_dst *md_dst, u8 optslen)
+static void __metadata_dst_init(struct metadata_dst *md_dst,
+				enum metadata_type type, u8 optslen)
 {
 	struct dst_entry *dst;
 
@@ -379,9 +380,11 @@ static void __metadata_dst_init(struct metadata_dst *md_dst, u8 optslen)
 	dst->output = dst_md_discard_out;
 
 	memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
+	md_dst->type = type;
 }
 
-struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags)
+struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
+					gfp_t flags)
 {
 	struct metadata_dst *md_dst;
 
@@ -389,7 +392,7 @@ struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags)
 	if (!md_dst)
 		return NULL;
 
-	__metadata_dst_init(md_dst, optslen);
+	__metadata_dst_init(md_dst, type, optslen);
 
 	return md_dst;
 }
@@ -403,7 +406,8 @@ void metadata_dst_free(struct metadata_dst *md_dst)
 	kfree(md_dst);
 }
 
-struct metadata_dst __percpu *metadata_dst_alloc_percpu(u8 optslen, gfp_t flags)
+struct metadata_dst __percpu *
+metadata_dst_alloc_percpu(u8 optslen, enum metadata_type type, gfp_t flags)
 {
 	int cpu;
 	struct metadata_dst __percpu *md_dst;
@@ -414,7 +418,7 @@ struct metadata_dst __percpu *metadata_dst_alloc_percpu(u8 optslen, gfp_t flags)
 		return NULL;
 
 	for_each_possible_cpu(cpu)
-		__metadata_dst_init(per_cpu_ptr(md_dst, cpu), optslen);
+		__metadata_dst_init(per_cpu_ptr(md_dst, cpu), type, optslen);
 
 	return md_dst;
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 0920c2ac1d00..61536a7e932e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2386,6 +2386,7 @@ bpf_get_skb_set_tunnel_proto(enum bpf_func_id which)
 		 * that is holding verifier mutex.
 		 */
 		md_dst = metadata_dst_alloc_percpu(IP_TUNNEL_OPTS_MAX,
+						   METADATA_IP_TUNNEL,
 						   GFP_KERNEL);
 		if (!md_dst)
 			return NULL;
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 777bc1883870..12ffbc4a4daa 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -145,10 +145,11 @@ struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md,
 	struct metadata_dst *res;
 	struct ip_tunnel_info *dst, *src;
 
-	if (!md || md->u.tun_info.mode & IP_TUNNEL_INFO_TX)
+	if (!md || md->type != METADATA_IP_TUNNEL ||
+	    md->u.tun_info.mode & IP_TUNNEL_INFO_TX)
 		return NULL;
 
-	res = metadata_dst_alloc(0, flags);
+	res = metadata_dst_alloc(0, METADATA_IP_TUNNEL, flags);
 	if (!res)
 		return NULL;
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index ae25ded82b3b..c9971701d0af 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2072,7 +2072,8 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	if (start < 0)
 		return start;
 
-	tun_dst = metadata_dst_alloc(key.tun_opts_len, GFP_KERNEL);
+	tun_dst = metadata_dst_alloc(key.tun_opts_len, METADATA_IP_TUNNEL,
+				     GFP_KERNEL);
 	if (!tun_dst)
 		return -ENOMEM;
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v1] bpf: Set register type according to is_valid_access()
From: Daniel Borkmann @ 2016-09-22 19:41 UTC (permalink / raw)
  To: Mickaël Salaün, linux-kernel
  Cc: Alexei Starovoitov, Andy Lutomirski, Kees Cook, Sargun Dhillon,
	Tejun Heo, netdev
In-Reply-To: <20160922183512.13576-1-mic@digikod.net>

On 09/22/2016 08:35 PM, Mickaël Salaün wrote:
> This fix a pointer leak when an unprivileged eBPF program read a pointer
> value from the context. Even if is_valid_access() returns a pointer
> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
> value containing an address is then allowed to leak. Moreover, this
> prevented unprivileged eBPF programs to use functions with (legitimate)
> pointer arguments.
>
> This bug is not an issue for now because the only unprivileged eBPF
> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
> from its context are UNKNOWN_VALUE. However, this fix is important for
> future unprivileged eBPF program types which could use pointers in their
> context.
>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Fixes: 969bf05eb3ce ("bpf: direct packet access")
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Kees Cook <keescook@chromium.org>
> Acked-by: Sargun Dhillon <sargun@sargun.me>
> ---
>   kernel/bpf/verifier.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index daea765d72e6..0698ccd67715 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -794,10 +794,8 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
>   		}
>   		err = check_ctx_access(env, off, size, t, &reg_type);
>   		if (!err && t == BPF_READ && value_regno >= 0) {
> -			mark_reg_unknown_value(state->regs, value_regno);
> -			if (env->allow_ptr_leaks)
> -				/* note that reg.[id|off|range] == 0 */
> -				state->regs[value_regno].type = reg_type;
> +			/* note that reg.[id|off|range] == 0 */
> +			state->regs[value_regno].type = reg_type;

True that it's not an issue currently, since reg_type is only set for
PTR_TO_PACKET/PTR_TO_PACKET_END in xdp and tc programs that can only be
loaded as privileged. So not an issue for BPF_PROG_TYPE_SOCKET_FILTER.

One thing I don't quite follow is why you remove the mark_reg_unknown_value()
as this also clears imm? I think this could result in an actual verifier
bug when it would reuse previous tracked imm value of that dst register?

>   		}
>
>   	} else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
>

^ permalink raw reply

* Re: [PATCH v1] bpf: Set register type according to is_valid_access()
From: Mickaël Salaün @ 2016-09-22 19:53 UTC (permalink / raw)
  To: Daniel Borkmann, linux-kernel
  Cc: Alexei Starovoitov, Andy Lutomirski, Kees Cook, Sargun Dhillon,
	Tejun Heo, netdev
In-Reply-To: <57E433F0.90407@iogearbox.net>


[-- Attachment #1.1: Type: text/plain, Size: 2680 bytes --]


On 22/09/2016 21:41, Daniel Borkmann wrote:
> On 09/22/2016 08:35 PM, Mickaël Salaün wrote:
>> This fix a pointer leak when an unprivileged eBPF program read a pointer
>> value from the context. Even if is_valid_access() returns a pointer
>> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
>> value containing an address is then allowed to leak. Moreover, this
>> prevented unprivileged eBPF programs to use functions with (legitimate)
>> pointer arguments.
>>
>> This bug is not an issue for now because the only unprivileged eBPF
>> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
>> from its context are UNKNOWN_VALUE. However, this fix is important for
>> future unprivileged eBPF program types which could use pointers in their
>> context.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Fixes: 969bf05eb3ce ("bpf: direct packet access")
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: Kees Cook <keescook@chromium.org>
>> Acked-by: Sargun Dhillon <sargun@sargun.me>
>> ---
>>   kernel/bpf/verifier.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index daea765d72e6..0698ccd67715 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -794,10 +794,8 @@ static int check_mem_access(struct verifier_env
>> *env, u32 regno, int off,
>>           }
>>           err = check_ctx_access(env, off, size, t, &reg_type);
>>           if (!err && t == BPF_READ && value_regno >= 0) {
>> -            mark_reg_unknown_value(state->regs, value_regno);
>> -            if (env->allow_ptr_leaks)
>> -                /* note that reg.[id|off|range] == 0 */
>> -                state->regs[value_regno].type = reg_type;
>> +            /* note that reg.[id|off|range] == 0 */
>> +            state->regs[value_regno].type = reg_type;
> 
> True that it's not an issue currently, since reg_type is only set for
> PTR_TO_PACKET/PTR_TO_PACKET_END in xdp and tc programs that can only be
> loaded as privileged. So not an issue for BPF_PROG_TYPE_SOCKET_FILTER.
> 
> One thing I don't quite follow is why you remove the
> mark_reg_unknown_value()
> as this also clears imm? I think this could result in an actual verifier
> bug when it would reuse previous tracked imm value of that dst register?

Good catch, I missed the imm initialization. I'm going to send a new patch.

> 
>>           }
>>
>>       } else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply

* [PATCH v2] bpf: Set register type according to is_valid_access()
From: Mickaël Salaün @ 2016-09-22 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Daniel Borkmann, Kees Cook, Sargun Dhillon, Tejun Heo, netdev

This fix a pointer leak when an unprivileged eBPF program read a pointer
value from the context. Even if is_valid_access() returns a pointer
type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
value containing an address is then allowed to leak. Moreover, this
prevented unprivileged eBPF programs to use functions with (legitimate)
pointer arguments.

This bug is not an issue for now because the only unprivileged eBPF
program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
from its context are UNKNOWN_VALUE. However, this fix is important for
future unprivileged eBPF program types which could use pointers in their
context.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Fixes: 969bf05eb3ce ("bpf: direct packet access")
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Sargun Dhillon <sargun@sargun.me>
---
 kernel/bpf/verifier.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index daea765d72e6..adbc7c161ba5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -795,9 +795,8 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
 		err = check_ctx_access(env, off, size, t, &reg_type);
 		if (!err && t == BPF_READ && value_regno >= 0) {
 			mark_reg_unknown_value(state->regs, value_regno);
-			if (env->allow_ptr_leaks)
-				/* note that reg.[id|off|range] == 0 */
-				state->regs[value_regno].type = reg_type;
+			/* note that reg.[id|off|range] == 0 */
+			state->regs[value_regno].type = reg_type;
 		}
 
 	} else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Paolo Abeni @ 2016-09-22 20:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Edward Cree, netdev, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, linux-nfs
In-Reply-To: <1474561848.23058.133.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 2016-09-22 at 09:30 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:14 +0200, Paolo Abeni wrote:
> 
> > I think that the idea behind using atomic ops directly on
> > sk_forward_alloc is to avoid adding other fields to the udp_socket. 
> > 
> > If we can add some fields to the udp_sock structure, the schema proposed
> > in this patch should fit better (modulo bugs ;-), always requiring a
> > single atomic operation at memory reclaiming time and at memory
> > allocation time.
> 
> But do we want any additional atomic to begin with ?
> 
> Given typical number of UDP sockets on a host, we could reserve/forward
> alloc at socket creation time, and when SO_RCVBUF is changed.

That would be very efficient and would probably work on most scenario,
but if/when the system will reach udp memory pressure things will be
very bad: forward allocation on open() will fail and nobody will be able
to create any new udp socket, right ?

We are working on a v2 incorporating the feedback of your previous email
- still keeping the new udp_sock fields.
It looks quite simpler than v1, will work reasonably well in memory
pressure scenario, and performance are measurably better than v1, most
probably comparable with the above solution, since usually no additional
atomic operations  (beyond sk_rmem_alloc updating) are performed on
enqueue/dequeue.

Paolo

^ permalink raw reply

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Eric Dumazet @ 2016-09-22 20:34 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Edward Cree, netdev, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, linux-nfs
In-Reply-To: <1474576020.7120.1.camel@redhat.com>

On Thu, 2016-09-22 at 22:27 +0200, Paolo Abeni wrote:
> On Thu, 2016-09-22 at 09:30 -0700, Eric Dumazet wrote:
> > On Thu, 2016-09-22 at 18:14 +0200, Paolo Abeni wrote:
> > 
> > > I think that the idea behind using atomic ops directly on
> > > sk_forward_alloc is to avoid adding other fields to the udp_socket. 
> > > 
> > > If we can add some fields to the udp_sock structure, the schema proposed
> > > in this patch should fit better (modulo bugs ;-), always requiring a
> > > single atomic operation at memory reclaiming time and at memory
> > > allocation time.
> > 
> > But do we want any additional atomic to begin with ?
> > 
> > Given typical number of UDP sockets on a host, we could reserve/forward
> > alloc at socket creation time, and when SO_RCVBUF is changed.
> 
> That would be very efficient and would probably work on most scenario,
> but if/when the system will reach udp memory pressure things will be
> very bad: forward allocation on open() will fail and nobody will be able
> to create any new udp socket, right ?
> 

No, we could allow one page per socket (udp_mem[0]) and applications
would still work.

TCP has the notion of memory pressure, and behaves roughly the same in
this case (one skb is allowed to be received)

The other (fat) sockets could notice udp_memory_pressure is set and
start reclaiming their forward allocations for other sockets.

We have a counter of UDP sockets, so probably doable to compute
udp_mem[2]/number 

Anyway, just an idea.

> We are working on a v2 incorporating the feedback of your previous email
> - still keeping the new udp_sock fields.
> It looks quite simpler than v1, will work reasonably well in memory
> pressure scenario, and performance are measurably better than v1, most
> probably comparable with the above solution, since usually no additional
> atomic operations  (beyond sk_rmem_alloc updating) are performed on
> enqueue/dequeue.

^ permalink raw reply

* Re: [PATCH net-next 2/3] udp: implement memory accounting helpers
From: Eric Dumazet @ 2016-09-22 20:37 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Edward Cree, netdev, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, linux-nfs
In-Reply-To: <1474576448.28155.7.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 2016-09-22 at 13:34 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 22:27 +0200, Paolo Abeni wrote:
> > On Thu, 2016-09-22 at 09:30 -0700, Eric Dumazet wrote:
> > > On Thu, 2016-09-22 at 18:14 +0200, Paolo Abeni wrote:
> > > 
> > > > I think that the idea behind using atomic ops directly on
> > > > sk_forward_alloc is to avoid adding other fields to the udp_socket. 
> > > > 
> > > > If we can add some fields to the udp_sock structure, the schema proposed
> > > > in this patch should fit better (modulo bugs ;-), always requiring a
> > > > single atomic operation at memory reclaiming time and at memory
> > > > allocation time.
> > > 
> > > But do we want any additional atomic to begin with ?
> > > 
> > > Given typical number of UDP sockets on a host, we could reserve/forward
> > > alloc at socket creation time, and when SO_RCVBUF is changed.
> > 
> > That would be very efficient and would probably work on most scenario,
> > but if/when the system will reach udp memory pressure things will be
> > very bad: forward allocation on open() will fail and nobody will be able
> > to create any new udp socket, right ?
> > 
> 
> No, we could allow one page per socket (udp_mem[0]) and applications
> would still work.

I meant udp_rmem_min, not udp_mem[0]

udp_rmem_min - INTEGER
        Minimal size of receive buffer used by UDP sockets in moderation.
        Each UDP socket is able to use the size for receiving data, even if
        total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
        Default: 1 page

^ permalink raw reply

* [PATCH net-next 1/4] net: dsa: add port STP state helper
From: Vivien Didelot @ 2016-09-22 20:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, John Crispin, Vivien Didelot
In-Reply-To: <20160922204924.16229-1-vivien.didelot@savoirfairelinux.com>

Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/slave.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9ecbe78..fd78d4c 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -69,6 +69,12 @@ static inline bool dsa_port_is_bridged(struct dsa_slave_priv *p)
 	return !!p->bridge_dev;
 }
 
+static void dsa_port_set_stp_state(struct dsa_switch *ds, int port, u8 state)
+{
+	if (ds->ops->port_stp_state_set)
+		ds->ops->port_stp_state_set(ds, port, state);
+}
+
 static int dsa_slave_open(struct net_device *dev)
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
@@ -104,8 +110,7 @@ static int dsa_slave_open(struct net_device *dev)
 			goto clear_promisc;
 	}
 
-	if (ds->ops->port_stp_state_set)
-		ds->ops->port_stp_state_set(ds, p->port, stp_state);
+	dsa_port_set_stp_state(ds, p->port, stp_state);
 
 	if (p->phy)
 		phy_start(p->phy);
@@ -147,8 +152,7 @@ static int dsa_slave_close(struct net_device *dev)
 	if (ds->ops->port_disable)
 		ds->ops->port_disable(ds, p->port, p->phy);
 
-	if (ds->ops->port_stp_state_set)
-		ds->ops->port_stp_state_set(ds, p->port, BR_STATE_DISABLED);
+	dsa_port_set_stp_state(ds, p->port, BR_STATE_DISABLED);
 
 	return 0;
 }
@@ -354,7 +358,7 @@ static int dsa_slave_stp_state_set(struct net_device *dev,
 	if (switchdev_trans_ph_prepare(trans))
 		return ds->ops->port_stp_state_set ? 0 : -EOPNOTSUPP;
 
-	ds->ops->port_stp_state_set(ds, p->port, attr->u.stp_state);
+	dsa_port_set_stp_state(ds, p->port, attr->u.stp_state);
 
 	return 0;
 }
@@ -556,8 +560,7 @@ static void dsa_slave_bridge_port_leave(struct net_device *dev)
 	/* Port left the bridge, put in BR_STATE_DISABLED by the bridge layer,
 	 * so allow it to be in BR_STATE_FORWARDING to be kept functional
 	 */
-	if (ds->ops->port_stp_state_set)
-		ds->ops->port_stp_state_set(ds, p->port, BR_STATE_FORWARDING);
+	dsa_port_set_stp_state(ds, p->port, BR_STATE_FORWARDING);
 }
 
 static int dsa_slave_port_attr_get(struct net_device *dev,
-- 
2.10.0

^ permalink raw reply related

* [PATCH net-next 2/4] net: dsa: add port fast ageing
From: Vivien Didelot @ 2016-09-22 20:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, John Crispin, Vivien Didelot
In-Reply-To: <20160922204924.16229-1-vivien.didelot@savoirfairelinux.com>

Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 include/net/dsa.h |  2 ++
 net/dsa/slave.c   | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7556646..b122196 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -143,6 +143,7 @@ struct dsa_port {
 	struct net_device	*netdev;
 	struct device_node	*dn;
 	unsigned int		ageing_time;
+	u8			stp_state;
 };
 
 struct dsa_switch {
@@ -339,6 +340,7 @@ struct dsa_switch_ops {
 	void	(*port_bridge_leave)(struct dsa_switch *ds, int port);
 	void	(*port_stp_state_set)(struct dsa_switch *ds, int port,
 				      u8 state);
+	void	(*port_fast_age)(struct dsa_switch *ds, int port);
 
 	/*
 	 * VLAN support
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index fd78d4c..6b1282c 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -71,8 +71,26 @@ static inline bool dsa_port_is_bridged(struct dsa_slave_priv *p)
 
 static void dsa_port_set_stp_state(struct dsa_switch *ds, int port, u8 state)
 {
+	struct dsa_port *dp = &ds->ports[port];
+
 	if (ds->ops->port_stp_state_set)
 		ds->ops->port_stp_state_set(ds, port, state);
+
+	if (ds->ops->port_fast_age) {
+		/* Fast age FDB entries or flush appropriate forwarding database
+		 * for the given port, if we are moving it from Learning or
+		 * Forwarding state, to Disabled or Blocking or Listening state.
+		 */
+
+		if ((dp->stp_state == BR_STATE_LEARNING ||
+		     dp->stp_state == BR_STATE_FORWARDING) &&
+		    (state == BR_STATE_DISABLED ||
+		     state == BR_STATE_BLOCKING ||
+		     state == BR_STATE_LISTENING))
+			ds->ops->port_fast_age(ds, port);
+	}
+
+	dp->stp_state = state;
 }
 
 static int dsa_slave_open(struct net_device *dev)
-- 
2.10.0

^ permalink raw reply related

* [PATCH net-next 3/4] net: dsa: b53: implement DSA port fast ageing
From: Vivien Didelot @ 2016-09-22 20:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, John Crispin, Vivien Didelot
In-Reply-To: <20160922204924.16229-1-vivien.didelot@savoirfairelinux.com>

Remove the fast ageing logic from b53_br_set_stp_state and implement the
new DSA switch port_fast_age operation instead.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/b53/b53_common.c | 31 +++++++++++--------------------
 1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 1a492c0..64be66d 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1402,16 +1402,12 @@ static void b53_br_leave(struct dsa_switch *ds, int port)
 	}
 }
 
-static void b53_br_set_stp_state(struct dsa_switch *ds, int port,
-				 u8 state)
+static void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state)
 {
 	struct b53_device *dev = ds->priv;
-	u8 hw_state, cur_hw_state;
+	u8 hw_state;
 	u8 reg;
 
-	b53_read8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), &reg);
-	cur_hw_state = reg & PORT_CTRL_STP_STATE_MASK;
-
 	switch (state) {
 	case BR_STATE_DISABLED:
 		hw_state = PORT_CTRL_DIS_STATE;
@@ -1433,26 +1429,20 @@ static void b53_br_set_stp_state(struct dsa_switch *ds, int port,
 		return;
 	}
 
-	/* Fast-age ARL entries if we are moving a port from Learning or
-	 * Forwarding (cur_hw_state) state to Disabled, Blocking or Listening
-	 * state (hw_state)
-	 */
-	if (cur_hw_state != hw_state) {
-		if (cur_hw_state >= PORT_CTRL_LEARN_STATE &&
-		    hw_state <= PORT_CTRL_LISTEN_STATE) {
-			if (b53_fast_age_port(dev, port)) {
-				dev_err(ds->dev, "fast ageing failed\n");
-				return;
-			}
-		}
-	}
-
 	b53_read8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), &reg);
 	reg &= ~PORT_CTRL_STP_STATE_MASK;
 	reg |= hw_state;
 	b53_write8(dev, B53_CTRL_PAGE, B53_PORT_CTRL(port), reg);
 }
 
+static void b53_br_fast_age(struct dsa_switch *ds, int port, u8 state)
+{
+	struct b53_device *dev = ds->priv;
+
+	if (b53_fast_age_port(dev, port))
+		dev_err(ds->dev, "fast ageing failed\n");
+}
+
 static enum dsa_tag_protocol b53_get_tag_protocol(struct dsa_switch *ds)
 {
 	return DSA_TAG_PROTO_NONE;
@@ -1472,6 +1462,7 @@ static struct dsa_switch_ops b53_switch_ops = {
 	.port_bridge_join	= b53_br_join,
 	.port_bridge_leave	= b53_br_leave,
 	.port_stp_state_set	= b53_br_set_stp_state,
+	.port_fast_age		= b53_br_fast_age,
 	.port_vlan_filtering	= b53_vlan_filtering,
 	.port_vlan_prepare	= b53_vlan_prepare,
 	.port_vlan_add		= b53_vlan_add,
-- 
2.10.0

^ permalink raw reply related

* [PATCH net-next 4/4] net: dsa: mv88e6xxx: implement DSA port fast ageing
From: Vivien Didelot @ 2016-09-22 20:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, John Crispin, Vivien Didelot
In-Reply-To: <20160922204924.16229-1-vivien.didelot@savoirfairelinux.com>

Now that the DSA layer handles port fast ageing on correct STP change,
simplify _mv88e6xxx_port_state and implement mv88e6xxx_port_fast_age.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 45 ++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 25bd3fa..122876c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1133,31 +1133,18 @@ static int _mv88e6xxx_port_state(struct mv88e6xxx_chip *chip, int port,
 
 	oldstate = reg & PORT_CONTROL_STATE_MASK;
 
-	if (oldstate != state) {
-		/* Flush forwarding database if we're moving a port
-		 * from Learning or Forwarding state to Disabled or
-		 * Blocking or Listening state.
-		 */
-		if ((oldstate == PORT_CONTROL_STATE_LEARNING ||
-		     oldstate == PORT_CONTROL_STATE_FORWARDING) &&
-		    (state == PORT_CONTROL_STATE_DISABLED ||
-		     state == PORT_CONTROL_STATE_BLOCKING)) {
-			err = _mv88e6xxx_atu_remove(chip, 0, port, false);
-			if (err)
-				return err;
-		}
+	reg &= ~PORT_CONTROL_STATE_MASK;
+	reg |= state;
 
-		reg = (reg & ~PORT_CONTROL_STATE_MASK) | state;
-		err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
-		if (err)
-			return err;
+	err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
+	if (err)
+		return err;
 
-		netdev_dbg(ds->ports[port].netdev, "PortState %s (was %s)\n",
-			   mv88e6xxx_port_state_names[state],
-			   mv88e6xxx_port_state_names[oldstate]);
-	}
+	netdev_dbg(ds->ports[port].netdev, "PortState %s (was %s)\n",
+		   mv88e6xxx_port_state_names[state],
+		   mv88e6xxx_port_state_names[oldstate]);
 
-	return err;
+	return 0;
 }
 
 static int _mv88e6xxx_port_based_vlan_map(struct mv88e6xxx_chip *chip, int port)
@@ -1232,6 +1219,19 @@ static void mv88e6xxx_port_stp_state_set(struct dsa_switch *ds, int port,
 			   mv88e6xxx_port_state_names[stp_state]);
 }
 
+static void mv88e6xxx_port_fast_age(struct dsa_switch *ds, int port)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mutex_lock(&chip->reg_lock);
+	err = _mv88e6xxx_atu_remove(chip, 0, port, false);
+	mutex_unlock(&chip->reg_lock);
+
+	if (err)
+		netdev_err(ds->ports[port].netdev, "failed to flush ATU\n");
+}
+
 static int _mv88e6xxx_port_pvid(struct mv88e6xxx_chip *chip, int port,
 				u16 *new, u16 *old)
 {
@@ -3684,6 +3684,7 @@ static struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.port_bridge_join	= mv88e6xxx_port_bridge_join,
 	.port_bridge_leave	= mv88e6xxx_port_bridge_leave,
 	.port_stp_state_set	= mv88e6xxx_port_stp_state_set,
+	.port_fast_age		= mv88e6xxx_port_fast_age,
 	.port_vlan_filtering	= mv88e6xxx_port_vlan_filtering,
 	.port_vlan_prepare	= mv88e6xxx_port_vlan_prepare,
 	.port_vlan_add		= mv88e6xxx_port_vlan_add,
-- 
2.10.0

^ permalink raw reply related

* [PATCH net-next 0/4] net: dsa: add port fast ageing
From: Vivien Didelot @ 2016-09-22 20:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, John Crispin, Vivien Didelot

Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides this generic switch logic.

This patchset introduces a new optional port_fast_age operation to
dsa_switch_ops, to move this logic to the DSA layer and keep drivers
simple. b53 and mv88e6xxx are updated accordingly.

Vivien Didelot (4):
  net: dsa: add port STP state helper
  net: dsa: add port fast ageing
  net: dsa: b53: implement DSA port fast ageing
  net: dsa: mv88e6xxx: implement DSA port fast ageing

 drivers/net/dsa/b53/b53_common.c | 31 ++++++++++-----------------
 drivers/net/dsa/mv88e6xxx/chip.c | 45 ++++++++++++++++++++--------------------
 include/net/dsa.h                |  2 ++
 net/dsa/slave.c                  | 35 ++++++++++++++++++++++++-------
 4 files changed, 64 insertions(+), 49 deletions(-)

-- 
2.10.0

^ permalink raw reply

* [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2
From: R. Parameswaran @ 2016-09-22 20:52 UTC (permalink / raw)
  To: kleptog, jchapman, netdev
  Cc: davem, linux-kernel, nprachan, rshearma, dfawcus, stephen, acme,
	lboccass, parameswaran.r7

>From ed585bdd6d3d2b3dec58d414f514cd764d89159d Mon Sep 17 00:00:00 2001
From: "R. Parameswaran" <rparames@brocade.com>
Date: Thu, 22 Sep 2016 13:19:25 -0700
Subject: [PATCH] L2TP:Adjust intf MTU,factor underlay L3,overlay L2

Take into account all of the tunnel encapsulation headers when setting
up the MTU on the L2TP logical interface device. Otherwise, packets
created by the applications on top of the L2TP layer are larger
than they ought to be, relative to the underlay MTU, leading to
needless fragmentation once the outer IP encap is added.

Specifically, take into account the (outer, underlay) IP header
imposed on the encapsulated L2TP packet, and the Layer 2 header
imposed on the inner IP packet prior to L2TP encapsulation.

Do not assume an Ethernet (non-jumbo) underlay. Use the PMTU mechanism
and the dst entry in the L2TP tunnel socket to directly pull up
the underlay MTU (as the baseline number on top of which the
encapsulation headers are factored in).  Fall back to Ethernet MTU
if this fails.

Signed-off-by: R. Parameswaran <rparames@brocade.com>

Reviewed-by: "N. Prachanda" <nprachan@brocade.com>,
Reviewed-by: "R. Shearman" <rshearma@brocade.com>,
Reviewed-by: "D. Fawcus" <dfawcus@brocade.com>
---
 net/l2tp/l2tp_eth.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 57fc5a4..dbcd6bd 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -30,6 +30,9 @@
 #include <net/xfrm.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/udp.h>
 
 #include "l2tp_core.h"
 
@@ -206,6 +209,46 @@ static void l2tp_eth_show(struct seq_file *m, void *arg)
 }
 #endif
 
+static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
+				struct l2tp_session *session,
+				struct net_device *dev)
+{
+	unsigned int overhead = 0;
+	struct dst_entry *dst;
+
+	if (session->mtu != 0) {
+		dev->mtu = session->mtu;
+		dev->needed_headroom += session->hdr_len;
+		if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+			dev->needed_headroom += sizeof(struct udphdr);
+		return;
+	}
+	overhead = session->hdr_len;
+	/* Adjust MTU, factor overhead - underlay L3 hdr, overlay L2 hdr*/
+	if (tunnel->sock->sk_family == AF_INET)
+		overhead += (ETH_HLEN + sizeof(struct iphdr));
+	else if (tunnel->sock->sk_family == AF_INET6)
+		overhead += (ETH_HLEN + sizeof(struct ipv6hdr));
+	/* Additionally, if the encap is UDP, account for UDP header size */
+	if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+		overhead += sizeof(struct udphdr);
+	/* If PMTU discovery was enabled, use discovered MTU on L2TP device */
+	dst = sk_dst_get(tunnel->sock);
+	if (dst) {
+		u32 pmtu = dst_mtu(dst);
+
+		if (pmtu != 0)
+			dev->mtu = pmtu;
+		dst_release(dst);
+	}
+	/* else (no PMTUD) L2TP dev MTU defaulted to Ethernet MTU in caller */
+	session->mtu = dev->mtu - overhead;
+	dev->mtu = session->mtu;
+	dev->needed_headroom += session->hdr_len;
+	if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+		dev->needed_headroom += sizeof(struct udphdr);
+}
+
 static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg)
 {
 	struct net_device *dev;
@@ -255,11 +298,8 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 p
 	}
 
 	dev_net_set(dev, net);
-	if (session->mtu == 0)
-		session->mtu = dev->mtu - session->hdr_len;
-	dev->mtu = session->mtu;
-	dev->needed_headroom += session->hdr_len;
 
+	l2tp_eth_adjust_mtu(tunnel, session, dev);
 	priv = netdev_priv(dev);
 	priv->dev = dev;
 	priv->session = session;
-- 
2.1.4

^ permalink raw reply related

* [PATCH v2 net-next 0/2] bnx2x: page allocation failure
From: Jason Baron @ 2016-09-22 21:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

Hi,

While configuring ~500 multicast addrs, we ran into high order
page allocation failures. They don't need to be high order, and
thus I'm proposing to split them into at most PAGE_SIZE allocations.

Below is a sample failure.

Thanks,

-Jason

[1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create multicast MACs list: -12
[1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
[1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: G        W
[1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, BIOS V8.810 05/16/2013
[1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
[1207325.728206]  0000000000000000 ffff88012d873a78 ffffffff8267f7c7 000000000000c020
[1207325.736754]  0000000000000000 ffff88012d873b08 ffffffff8212f8e0 fffffffc00000003
[1207325.745301]  ffff88041ffecd80 ffff880400000030 0000000000000002 0000c0206800da13
[1207325.753846] Call Trace:
[1207325.756789]  [<ffffffff8267f7c7>] dump_stack+0x4d/0x63
[1207325.762426]  [<ffffffff8212f8e0>] warn_alloc_failed+0xe0/0x130
[1207325.768756]  [<ffffffff8213c898>] ? wakeup_kswapd+0x48/0x140
[1207325.774914]  [<ffffffff82132afc>] __alloc_pages_nodemask+0x2bc/0x970
[1207325.781761]  [<ffffffff82173691>] alloc_pages_current+0x91/0x100
[1207325.788260]  [<ffffffff8212fa1e>] alloc_kmem_pages+0xe/0x10
[1207325.794329]  [<ffffffff8214c9c8>] kmalloc_order+0x18/0x50
[1207325.800227]  [<ffffffff8214ca26>] kmalloc_order_trace+0x26/0xb0
[1207325.806642]  [<ffffffff82451c68>] ? _xfer_secondary_pool+0xa8/0x1a0
[1207325.813404]  [<ffffffff8217cfda>] __kmalloc+0x19a/0x1b0
[1207325.819142]  [<ffffffffa02fe975>] bnx2x_set_rx_mode_inner+0x3d5/0x590 [bnx2x]
[1207325.827000]  [<ffffffffa02ff52d>] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
[1207325.834197]  [<ffffffff820695d4>] process_one_work+0x134/0x3c0
[1207325.840522]  [<ffffffff82069981>] worker_thread+0x121/0x460
[1207325.846585]  [<ffffffff82069860>] ? process_one_work+0x3c0/0x3c0
[1207325.853089]  [<ffffffff8206f039>] kthread+0xc9/0xe0
[1207325.858459]  [<ffffffff82070000>] ? notify_die+0x10/0x40
[1207325.864263]  [<ffffffff8206ef70>] ? kthread_create_on_node+0x180/0x180
[1207325.871288]  [<ffffffff826852d2>] ret_from_fork+0x42/0x70
[1207325.877183]  [<ffffffff8206ef70>] ? kthread_create_on_node+0x180/0x180

v2:
 -make use of list_next_entry()
 -only use PAGE_SIZE allocations

Jason Baron (2):
  bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  79 +++++++++------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c   | 123 ++++++++++++++++-------
 2 files changed, 137 insertions(+), 65 deletions(-)

-- 
2.6.1

^ permalink raw reply

* [PATCH v2 net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
From: Jason Baron @ 2016-09-22 21:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <cover.1474577624.git.jbaron@akamai.com>

From: Jason Baron <jbaron@akamai.com>

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Yuval Mintz <Yuval.Mintz@qlogic.com>
Cc: Ariel Elior <Ariel.Elior@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 79 +++++++++++++++---------
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index dab61a81a3ba..20fe6a8c35c1 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12563,43 +12563,64 @@ static int bnx2x_close(struct net_device *dev)
 	return 0;
 }
 
-static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
-				      struct bnx2x_mcast_ramrod_params *p)
+struct bnx2x_mcast_list_elem_group
 {
-	int mc_count = netdev_mc_count(bp->dev);
-	struct bnx2x_mcast_list_elem *mc_mac =
-		kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC);
-	struct netdev_hw_addr *ha;
+	struct list_head mcast_group_link;
+	struct bnx2x_mcast_list_elem mcast_elems[];
+};
 
-	if (!mc_mac) {
-		BNX2X_ERR("Failed to allocate mc MAC list\n");
-		return -ENOMEM;
+#define MCAST_ELEMS_PER_PG \
+	((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \
+	sizeof(struct bnx2x_mcast_list_elem))
+
+static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_list_elem_group *current_mcast_group;
+
+	while (!list_empty(mcast_group_list)) {
+		current_mcast_group = list_first_entry(mcast_group_list,
+				      struct bnx2x_mcast_list_elem_group,
+				      mcast_group_link);
+		list_del(&current_mcast_group->mcast_group_link);
+		free_page((unsigned long)current_mcast_group);
 	}
+}
 
-	INIT_LIST_HEAD(&p->mcast_list);
+static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
+				      struct bnx2x_mcast_ramrod_params *p,
+				      struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_list_elem *mc_mac;
+	struct netdev_hw_addr *ha;
+	struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL;
+	int mc_count = netdev_mc_count(bp->dev);
+	int offset = 0;
 
+	INIT_LIST_HEAD(&p->mcast_list);
 	netdev_for_each_mc_addr(ha, bp->dev) {
+		if (!offset) {
+			current_mcast_group =
+				(struct bnx2x_mcast_list_elem_group *)
+				__get_free_page(GFP_ATOMIC);
+			if (!current_mcast_group) {
+				bnx2x_free_mcast_macs_list(mcast_group_list);
+				BNX2X_ERR("Failed to allocate mc MAC list\n");
+				return -ENOMEM;
+			}
+			list_add(&current_mcast_group->mcast_group_link,
+				 mcast_group_list);
+		}
+		mc_mac = &current_mcast_group->mcast_elems[offset];
 		mc_mac->mac = bnx2x_mc_addr(ha);
 		list_add_tail(&mc_mac->link, &p->mcast_list);
-		mc_mac++;
+		offset++;
+		if (offset == MCAST_ELEMS_PER_PG)
+			offset = 0;
 	}
-
 	p->mcast_list_len = mc_count;
-
 	return 0;
 }
 
-static void bnx2x_free_mcast_macs_list(
-	struct bnx2x_mcast_ramrod_params *p)
-{
-	struct bnx2x_mcast_list_elem *mc_mac =
-		list_first_entry(&p->mcast_list, struct bnx2x_mcast_list_elem,
-				 link);
-
-	WARN_ON(!mc_mac);
-	kfree(mc_mac);
-}
-
 /**
  * bnx2x_set_uc_list - configure a new unicast MACs list.
  *
@@ -12647,6 +12668,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 {
+	LIST_HEAD(mcast_group_list);
 	struct net_device *dev = bp->dev;
 	struct bnx2x_mcast_ramrod_params rparam = {NULL};
 	int rc = 0;
@@ -12662,7 +12684,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
 	/* then, configure a new MACs list */
 	if (netdev_mc_count(dev)) {
-		rc = bnx2x_init_mcast_macs_list(bp, &rparam);
+		rc = bnx2x_init_mcast_macs_list(bp, &rparam, &mcast_group_list);
 		if (rc)
 			return rc;
 
@@ -12673,7 +12695,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 			BNX2X_ERR("Failed to set a new multicast configuration: %d\n",
 				  rc);
 
-		bnx2x_free_mcast_macs_list(&rparam);
+		bnx2x_free_mcast_macs_list(&mcast_group_list);
 	}
 
 	return rc;
@@ -12681,6 +12703,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list(struct bnx2x *bp)
 {
+	LIST_HEAD(mcast_group_list);
 	struct bnx2x_mcast_ramrod_params rparam = {NULL};
 	struct net_device *dev = bp->dev;
 	int rc = 0;
@@ -12692,7 +12715,7 @@ static int bnx2x_set_mc_list(struct bnx2x *bp)
 	rparam.mcast_obj = &bp->mcast_obj;
 
 	if (netdev_mc_count(dev)) {
-		rc = bnx2x_init_mcast_macs_list(bp, &rparam);
+		rc = bnx2x_init_mcast_macs_list(bp, &rparam, &mcast_group_list);
 		if (rc)
 			return rc;
 
@@ -12703,7 +12726,7 @@ static int bnx2x_set_mc_list(struct bnx2x *bp)
 			BNX2X_ERR("Failed to set a new multicast configuration: %d\n",
 				  rc);
 
-		bnx2x_free_mcast_macs_list(&rparam);
+		bnx2x_free_mcast_macs_list(&mcast_group_list);
 	} else {
 		/* If no mc addresses are required, flush the configuration */
 		rc = bnx2x_config_mcast(bp, &rparam, BNX2X_MCAST_CMD_DEL);
-- 
2.6.1

^ permalink raw reply related

* [PATCH v2 net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
From: Jason Baron @ 2016-09-22 21:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <cover.1474577624.git.jbaron@akamai.com>

From: Jason Baron <jbaron@akamai.com>

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Yuval Mintz <Yuval.Mintz@qlogic.com>
Cc: Ariel Elior <Ariel.Elior@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 123 +++++++++++++++++--------
 1 file changed, 86 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..4947a9cbf0c1 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem {
 	int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+	struct bnx2x_mcast_bin_elem bin_elem;
+	struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+	struct list_head mcast_group_link;
+	union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+	((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+	sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
 	struct list_head link;
+	struct list_head group_head;
 	int type; /* BNX2X_MCAST_CMD_X */
 	union {
 		struct list_head macs_head;
@@ -2638,16 +2653,29 @@ static int bnx2x_mcast_wait(struct bnx2x *bp,
 	return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_elem_group *current_mcast_group;
+
+	while (!list_empty(mcast_group_list)) {
+		current_mcast_group = list_first_entry(mcast_group_list,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
+		list_del(&current_mcast_group->mcast_group_link);
+		free_page((unsigned long)current_mcast_group);
+	}
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 				   struct bnx2x_mcast_obj *o,
 				   struct bnx2x_mcast_ramrod_params *p,
 				   enum bnx2x_mcast_cmd cmd)
 {
-	int total_sz;
 	struct bnx2x_pending_mcast_cmd *new_cmd;
-	struct bnx2x_mcast_mac_elem *cur_mac = NULL;
 	struct bnx2x_mcast_list_elem *pos;
-	int macs_list_len = 0, macs_list_len_size;
+	struct bnx2x_mcast_elem_group *elem_group;
+	struct bnx2x_mcast_mac_elem *mac_elem;
+	int total_elems = 0, macs_list_len = 0, offset = 0;
 
 	/* When adding MACs we'll need to store their values */
 	if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2685,61 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 	if (!p->mcast_list_len)
 		return 0;
 
-	/* For a set command, we need to allocate sufficient memory for all
-	 * the bins, since we can't analyze at this point how much memory would
-	 * be required.
-	 */
-	macs_list_len_size = macs_list_len *
-			     sizeof(struct bnx2x_mcast_mac_elem);
-	if (cmd == BNX2X_MCAST_CMD_SET) {
-		int bin_size = BNX2X_MCAST_BINS_NUM *
-			       sizeof(struct bnx2x_mcast_bin_elem);
-
-		if (bin_size > macs_list_len_size)
-			macs_list_len_size = bin_size;
-	}
-	total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
 	/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-	new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+	new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
 	if (!new_cmd)
 		return -ENOMEM;
 
-	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
-	   cmd, macs_list_len);
-
 	INIT_LIST_HEAD(&new_cmd->data.macs_head);
-
+	INIT_LIST_HEAD(&new_cmd->group_head);
 	new_cmd->type = cmd;
 	new_cmd->done = false;
 
+	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
+	   cmd, macs_list_len);
+
 	switch (cmd) {
 	case BNX2X_MCAST_CMD_ADD:
 	case BNX2X_MCAST_CMD_SET:
-		cur_mac = (struct bnx2x_mcast_mac_elem *)
-			  ((u8 *)new_cmd + sizeof(*new_cmd));
-
-		/* Push the MACs of the current command into the pending command
-		 * MACs list: FIFO
+		/* For a set command, we need to allocate sufficient memory for
+		 * all the bins, since we can't analyze at this point how much
+		 * memory would be required.
 		 */
+		total_elems = macs_list_len;
+		if (cmd == BNX2X_MCAST_CMD_SET) {
+			if (total_elems < BNX2X_MCAST_BINS_NUM)
+				total_elems = BNX2X_MCAST_BINS_NUM;
+		}
+		while (total_elems > 0) {
+			elem_group = (struct bnx2x_mcast_elem_group *)
+				     __get_free_page(GFP_ATOMIC | __GFP_ZERO);
+			if (!elem_group) {
+				kfree(new_cmd);
+				bnx2x_free_groups(&new_cmd->group_head);
+				return -ENOMEM;
+			}
+			total_elems -= MCAST_MAC_ELEMS_PER_PG;
+			list_add_tail(&elem_group->mcast_group_link,
+				      &new_cmd->group_head);
+		}
+		elem_group = list_first_entry(&new_cmd->group_head,
+					      struct bnx2x_mcast_elem_group,
+					      mcast_group_link);
 		list_for_each_entry(pos, &p->mcast_list, link) {
-			memcpy(cur_mac->mac, pos->mac, ETH_ALEN);
-			list_add_tail(&cur_mac->link, &new_cmd->data.macs_head);
-			cur_mac++;
+			mac_elem = &elem_group->mcast_elems[offset].mac_elem;
+			memcpy(mac_elem->mac, pos->mac, ETH_ALEN);
+			/* Push the MACs of the current command into the pending
+			 * command MACs list: FIFO
+			 */
+			list_add_tail(&mac_elem->link,
+				      &new_cmd->data.macs_head);
+			offset++;
+			if (offset == MCAST_MAC_ELEMS_PER_PG) {
+				offset = 0;
+				elem_group = list_next_entry(elem_group,
+							     mcast_group_link);
+			}
 		}
-
 		break;
 
 	case BNX2X_MCAST_CMD_DEL:
@@ -2978,7 +3017,8 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	u64 cur[BNX2X_MCAST_VEC_SZ], req[BNX2X_MCAST_VEC_SZ];
 	struct bnx2x_mcast_mac_elem *pmac_pos, *pmac_pos_n;
 	struct bnx2x_mcast_bin_elem *p_item;
-	int i, cnt = 0, mac_cnt = 0;
+	struct bnx2x_mcast_elem_group *elem_group;
+	int cnt = 0, mac_cnt = 0, offset = 0, i;
 
 	memset(req, 0, sizeof(u64) * BNX2X_MCAST_VEC_SZ);
 	memcpy(cur, o->registry.aprox_match.vec,
@@ -3001,9 +3041,10 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	 * a list that will be used to configure bins.
 	 */
 	cmd_pos->set_convert = true;
-	p_item = (struct bnx2x_mcast_bin_elem *)(cmd_pos + 1);
 	INIT_LIST_HEAD(&cmd_pos->data.macs_head);
-
+	elem_group = list_first_entry(&cmd_pos->group_head,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
 	for (i = 0; i < BNX2X_MCAST_BINS_NUM; i++) {
 		bool b_current = !!BIT_VEC64_TEST_BIT(cur, i);
 		bool b_required = !!BIT_VEC64_TEST_BIT(req, i);
@@ -3011,12 +3052,18 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 		if (b_current == b_required)
 			continue;
 
+		p_item = &elem_group->mcast_elems[offset].bin_elem;
 		p_item->bin = i;
 		p_item->type = b_required ? BNX2X_MCAST_CMD_SET_ADD
 					  : BNX2X_MCAST_CMD_SET_DEL;
 		list_add_tail(&p_item->link , &cmd_pos->data.macs_head);
-		p_item++;
 		cnt++;
+		offset++;
+		if (offset == MCAST_MAC_ELEMS_PER_PG) {
+			offset = 0;
+			elem_group = list_next_entry(elem_group,
+						     mcast_group_link);
+		}
 	}
 
 	/* We now definitely know how many commands are hiding here.
@@ -3103,6 +3150,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e2(struct bnx2x *bp,
 		 */
 		if (cmd_pos->done) {
 			list_del(&cmd_pos->link);
+			bnx2x_free_groups(&cmd_pos->group_head);
 			kfree(cmd_pos);
 		}
 
@@ -3741,6 +3789,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e1(
 	}
 
 	list_del(&cmd_pos->link);
+	bnx2x_free_groups(&cmd_pos->group_head);
 	kfree(cmd_pos);
 
 	return cnt;
-- 
2.6.1

^ permalink raw reply related

* Re: [PATCH] L2TP:Adjust intf MTU, add underlay L3, overlay L2
From: R. Parameswaran @ 2016-09-22 21:19 UTC (permalink / raw)
  To: R. Parameswaran, kleptog, jchapman, mostrows, acme, netdev, davem,
	linux-kernel, nprachan, rshearma, stephen
In-Reply-To: <20160922085316.GA11264@dfawcus.brocade.com>



On Thu, 22 Sep 2016, Derek Fawcus wrote:

> On Wed, Sep 21, 2016 at 02:11:04pm -0700, R. Parameswaran wrote:
> > 
> [snip]
> 
> > @@ -206,6 +209,46 @@ static void l2tp_eth_show(struct seq_file *m, void
> > *arg)
> >  }
> >  #endif
> [snip]
> 
> > +
> >  static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id,
> > u32 peer_session_id, struct l2tp_session_cfg *cfg)
> >  {
> >  	struct net_device *dev;
> > @@ -255,11 +298,8 @@ static int l2tp_eth_create(struct net *net, u32
> > tunnel_id, u32 session_id, u32 p
> >  	}
> > 
> 
> Your diff has whitespace errors,  probably where your MUA has decided to do
> 'intelligent' line wrapping.
> You should (re)send from a proper MUA which does not suffer from this issue.
> 
> DF
> 

Reposted the patch fixing this, and after rebasing the patch to the 
dmiller 'net' tree, verified that 'git am -c' applies the reposted patch 
successfully (after email header is removed) - thanks for identifying 
this.

regards,

Ramkumar

^ permalink raw reply

* Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Sowmini Varadhan @ 2016-09-22 21:30 UTC (permalink / raw)
  To: David Miller; +Cc: jbenc, netdev, hannes, aduyck, daniel, pabeni
In-Reply-To: <20160922.015242.735026657310158125.davem@davemloft.net>

On (09/22/16 01:52), David Miller wrote:
> Alternatively we can do Alexander Duyck's trick, by pushing
> the headers into the frag list, forcing a pull and realignment
> by the next protocol layer.

What is the "Alexander Duyck trick" (hints about module or commit id,
where this can be found, please)?

Is this basically about, e.g., putting the vxlanhdr in its own
skb_frag_t, or something else?

--Sowmini

^ permalink raw reply

* rfc: Are any of the seq_pad() uses really necessary?
From: Joe Perches @ 2016-09-22 21:52 UTC (permalink / raw)
  To: netdev; +Cc: Tetsuo Handa, LKML

$ git grep -w seq_pad net
net/ipv4/fib_trie.c:            seq_pad(seq, '\n');
net/ipv4/ping.c:        seq_pad(seq, '\n');
net/ipv4/tcp_ipv4.c:    seq_pad(seq, '\n');
net/ipv4/udp.c: seq_pad(seq, '\n');
net/phonet/socket.c:    seq_pad(seq, '\n');
net/phonet/socket.c:    seq_pad(seq, '\n');
net/sctp/objcnt.c:      seq_pad(seq, '\n');

what these uses do is add trailing blanks to a particular
preset block width and then append a newline.

None of these trailing pad bytes seem useful to me.

Are there really tools that expect specific line widths
when reading from things like /proc/<pid>/net/<file>

For instance:

$ cat /proc/<pid>/net/udp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
  484: 00000000:14E9 00000000:0000 07 00000000:00000000 00:00000000 00000000   111        0 16961 2 0000000000000000 0         
  486: 00000000:14EB 00000000:0000 07 00000000:00000000 00:00000000 00000000   102        0 2022599 2 0000000000000000 0       
  788: 00000000:A619 00000000:0000 07 00000000:00000000 00:00000000 00000000  1000        0 4390482 2 0000000000000000 0       
 3081: 00000000:8F0E 00000000:0000 07 00000000:00000000 00:00000000 00000000   111        0 16963 2 0000000000000000 0         
 3376: 3500007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000   102        0 2022601 2 0000000000000000 0       
 3391: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4546167 2 0000000000000000 0       

These seq_pad uses were modified by:

>From 652586df95e5d76b37d07a11839126dcfede1621 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 14 Nov 2013 14:31:57 -0800
Subject: [PATCH] seq_file: remove "%n" usage from seq_file users

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

If these are really necessary, then maybe the seq_pad function
could be optimized using a memset instead of
	seq_printf(, "%*s", len, "");

^ permalink raw reply

* [PATCH 1/7] hv_netvsc: use consume_skb
From: sthemmin @ 2016-09-22 23:56 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, davem; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1474588595-16054-1-git-send-email-sthemmin@exchange.microsoft.com>

From: Stephen Hemminger <sthemmin@microsoft.com>

Packets that are transmitted in normal path should use consume_skb
instead of kfree_skb. This allows for better tracing of packet drops.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index ff05b9b..720b5fa 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -635,7 +635,7 @@ static void netvsc_send_tx_complete(struct netvsc_device *net_device,
 		q_idx = nvsc_packet->q_idx;
 		channel = incoming_channel;
 
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	}
 
 	num_outstanding_sends =
@@ -944,7 +944,7 @@ int netvsc_send(struct hv_device *device,
 		}
 
 		if (msdp->skb)
-			dev_kfree_skb_any(msdp->skb);
+			dev_consume_skb_any(msdp->skb);
 
 		if (xmit_more && !packet->cp_partial) {
 			msdp->skb = skb;
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 3/7] hv_netvsc: simplify callback event code
From: sthemmin @ 2016-09-22 23:56 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, davem; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1474588595-16054-1-git-send-email-sthemmin@exchange.microsoft.com>

From: Stephen Hemminger <sthemmin@microsoft.com>

The callback handler for netlink events can be simplified:
 * Consolidate check for netlink callback events about this driver itself.
 * Ignore non-Ethernet devices.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c |   28 ++++++++++------------------
 1 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index e74dbcc..849b566 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1238,10 +1238,6 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	const struct ethtool_ops *eth_ops = vf_netdev->ethtool_ops;
-
-	if (eth_ops == NULL || eth_ops == &ethtool_ops)
-		return NOTIFY_DONE;
 
 	/*
 	 * We will use the MAC address to locate the synthetic interface to
@@ -1286,12 +1282,8 @@ static int netvsc_vf_up(struct net_device *vf_netdev)
 {
 	struct net_device *ndev;
 	struct netvsc_device *netvsc_dev;
-	const struct ethtool_ops *eth_ops = vf_netdev->ethtool_ops;
 	struct net_device_context *net_device_ctx;
 
-	if (eth_ops == &ethtool_ops)
-		return NOTIFY_DONE;
-
 	ndev = get_netvsc_net_device(vf_netdev->dev_addr);
 	if (!ndev)
 		return NOTIFY_DONE;
@@ -1329,10 +1321,6 @@ static int netvsc_vf_down(struct net_device *vf_netdev)
 	struct net_device *ndev;
 	struct netvsc_device *netvsc_dev;
 	struct net_device_context *net_device_ctx;
-	const struct ethtool_ops *eth_ops = vf_netdev->ethtool_ops;
-
-	if (eth_ops == &ethtool_ops)
-		return NOTIFY_DONE;
 
 	ndev = get_netvsc_net_device(vf_netdev->dev_addr);
 	if (!ndev)
@@ -1361,12 +1349,8 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
 {
 	struct net_device *ndev;
 	struct netvsc_device *netvsc_dev;
-	const struct ethtool_ops *eth_ops = vf_netdev->ethtool_ops;
 	struct net_device_context *net_device_ctx;
 
-	if (eth_ops == &ethtool_ops)
-		return NOTIFY_DONE;
-
 	ndev = get_netvsc_net_device(vf_netdev->dev_addr);
 	if (!ndev)
 		return NOTIFY_DONE;
@@ -1542,13 +1526,21 @@ static int netvsc_netdev_event(struct notifier_block *this,
 {
 	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
 
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &device_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
 	/* Avoid Vlan dev with same MAC registering as VF */
 	if (event_dev->priv_flags & IFF_802_1Q_VLAN)
 		return NOTIFY_DONE;
 
 	/* Avoid Bonding master dev with same MAC registering as VF */
-	if (event_dev->priv_flags & IFF_BONDING &&
-	    event_dev->flags & IFF_MASTER)
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
 		return NOTIFY_DONE;
 
 	switch (event) {
-- 
1.7.4.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox