Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2 1/2] devlink: Add usage help for eswitch subcommand
From: Stephen Hemminger @ 2016-11-30  3:19 UTC (permalink / raw)
  To: Roi Dayan; +Cc: netdev, Or Gerlitz
In-Reply-To: <1480245663-814-2-git-send-email-roid@mellanox.com>

On Sun, 27 Nov 2016 13:21:02 +0200
Roi Dayan <roid@mellanox.com> wrote:

> Add missing usage help for devlink dev eswitch subcommand.
> 
> Signed-off-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>

Ok. Applied both, will show in next merge

^ permalink raw reply

* [net-next v2] neigh: remove duplicate check for same neigh
From: Zhang Shengju @ 2016-11-30  3:24 UTC (permalink / raw)
  To: netdev, dsa

Currently loop index 'idx' is used as the index in the neigh list of interest.
It's increased only when the neigh is dumped. It's not the absolute index in
the list. Because there is no info to record which neigh has already be scanned
by previous loop. This will cause the filtered out neighs to be scanned mulitple
times.

This patch make idx as the absolute index in the list, it will increase no matter
whether the neigh is filtered. This will prevent the above problem.

And this is in line with other dump functions.

v2:
 - take David Ahern's advice to do simple change

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 net/core/neighbour.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f..782dd86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2291,13 +2291,10 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
 		     n != NULL;
 		     n = rcu_dereference_bh(n->next)) {
-			if (!net_eq(dev_net(n->dev), net))
-				continue;
-			if (neigh_ifindex_filtered(n->dev, filter_idx))
-				continue;
-			if (neigh_master_filtered(n->dev, filter_master_idx))
-				continue;
-			if (idx < s_idx)
+			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
+				goto next;
+			if (neigh_ifindex_filtered(n->dev, filter_idx) ||
+			    neigh_master_filtered(n->dev, filter_master_idx))
 				goto next;
 			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
@@ -2332,9 +2329,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		if (h > s_h)
 			s_idx = 0;
 		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
-			if (pneigh_net(n) != net)
-				continue;
-			if (idx < s_idx)
+			if (idx < s_idx || pneigh_net(n) != net)
 				goto next;
 			if (pneigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH iproute2 V2 1/2] tc/cls_flower: Classify packet in ip tunnels
From: Stephen Hemminger @ 2016-11-30  3:26 UTC (permalink / raw)
  To: Amir Vadai
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161128125136.3393-2-amir@vadai.me>

The overall design is fine, just a couple nits with the code.

> diff --git a/tc/f_flower.c b/tc/f_flower.c
> index 2d31d1aa832d..1cf0750b5b83 100644
> --- a/tc/f_flower.c
> +++ b/tc/f_flower.c

>  
> +static int flower_parse_key_id(char *str, int type, struct nlmsghdr *n)

str is not modified, therefore use: const char *str

> +{
> +	int ret;
> +	__be32 key_id;
> +
> +	ret = get_be32(&key_id, str, 10);
> +	if (ret)
> +		return -1;

Traditionally netlink attributes are in host order, why was flower
chosen to be different?

> +
> +	addattr32(n, MAX_MSG, type, key_id);
> +
> +	return 0;


Why lose the return value here?  Why not:

	ret = get_be32(&key_id, str, 10);
	if (!ret)
		addattr32(n, MAX_MSG, type, key_id);

> +}
> +
>  static int flower_parse_opt(struct filter_util *qu, char *handle,
>  			    int argc, char **argv, struct nlmsghdr *n)
>  {
> @@ -339,6 +359,38 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
>  				fprintf(stderr, "Illegal \"src_port\"\n");
>  				return -1;
>  			}
> +		} else if (matches(*argv, "enc_dst_ip") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_ip_addr(*argv, 0,
> +						   TCA_FLOWER_KEY_ENC_IPV4_DST,
> +						   TCA_FLOWER_KEY_ENC_IPV4_DST_MASK,
> +						   TCA_FLOWER_KEY_ENC_IPV6_DST,
> +						   TCA_FLOWER_KEY_ENC_IPV6_DST_MASK,
> +						   n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_dst_ip\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "enc_src_ip") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_ip_addr(*argv, 0,
> +						   TCA_FLOWER_KEY_ENC_IPV4_SRC,
> +						   TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK,
> +						   TCA_FLOWER_KEY_ENC_IPV6_SRC,
> +						   TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK,
> +						   n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_src_ip\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "enc_key_id") == 0) {
> +			NEXT_ARG();
> +			ret = flower_parse_key_id(*argv,
> +						  TCA_FLOWER_KEY_ENC_KEY_ID, n);
> +			if (ret < 0) {
> +				fprintf(stderr, "Illegal \"enc_key_id\"\n");
> +				return -1;
> +			}
>  		} else if (matches(*argv, "action") == 0) {
>  			NEXT_ARG();
>  			ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
> @@ -509,6 +561,14 @@ static void flower_print_port(FILE *f, char *name, __u8 ip_proto,
>  	fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
>  }
>  
> +static void flower_print_key_id(FILE *f, char *name,
> +				struct rtattr *attr)

const char *name?


> +{
> +	if (!attr)
> +		return;
> +	fprintf(f, "\n  %s %d", name, ntohl(rta_getattr_u32(attr)));
> +}
> +

Why short circuit, just change the order:

	if (attr)
		fprintf(f, "\n  %s %s", name, ntohl(rta_getattr_u32(attr));

You might also want to introduce rta_getattr_be32()

Please change, retest and resubmit both patches.

^ permalink raw reply

* Re: [PATCH net-next 5/5] udp: add recvmmsg implementation
From: Hannes Frederic Sowa @ 2016-11-30  3:47 UTC (permalink / raw)
  To: David Miller; +Cc: pabeni, netdev, edumazet, brouer, sd
In-Reply-To: <20161129.192233.392359849404330043.davem@davemloft.net>

Hello,

On Wed, Nov 30, 2016, at 01:22, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 25 Nov 2016 18:09:00 +0100
> 
> > During review we discussed on how to handle major errors in the kernel:
> > 
> > The old code and the new code still can report back success even though
> > the kernel got back an EFAULT while copying from kernel space to user
> > space (due to bad pointers).
> > 
> > I favor that we drop all packets (also the already received batches) in
> > this case and let the code report -EFAULT and increase sk_drops for all
> > dropped packets from the queue.
> > 
> > Currently sk_err is set so the next syscall would get an -EFAULT, which
> > seems very bad and can also be overwritten by incoming icmp packets, so
> > we never get a notification that we actually had a bad pointer somewhere
> > in the mmsghdr. Also delivering -EFAULT on the follow-up syscalls really
> > will make people confused that use strace.
> > 
> > If people would like to know the amount of packets dropped we can make
> > sk_drops readable by an getsockopt.
> > 
> > Thoughts?
> > 
> > Unfortunately the interface doesn't allow for better error handling.
> 
> I think this is a major problem.
> 
> If, as a side effect of batch dequeueing the SKBs from the socket,
> you cannot stop properly mid-transfer if an error occurs, well then
> you simply cannot batch like that.
> 
> You have to stop the exact byte where an error occurs mid-stream,
> return the successful amount of bytes transferred, and then return
> the error on the next recvmmsg call.
> 
> There is no other sane error reporting strategy.

Actually I think there is no sane error handling strategy at all.

SIGSEGV and EFAULT should be delivered reliable in my opinion and all
the details become very difficult suddenly.

E.g. if we recvmmsg with -EFAULT and we try to deliver the fault on the
following socket call, I am pretty certain most programs don't bother
with close() return values, so the application might simply ignore it.
Also -EFAULT is not in our repository for error codes to return.

In case of UDP we can simply drop the packets and I would be okay with
that (in some cases we actually guarantee correctly ordered packets,
even for UDP, so we can't simply queue those packets back).

Also I would very much prefer ptrace/gdb to show me the syscall where
the memory management fault happened and not the next one.

> If I get 4 frames, and the kernel can successfully copy the first
> three and get an -EFAULT on the 4th.  Dammit you better tell the
> application this so it can properly process the first 3 packets and
> then determine how it is going to error out and recover for the 4th
> one.
> 
> If we need to add prioritized sk_err stuff, or another value like
> "sk_app_err" to handle the ICMP vs. -EFAULT issue, so be it.
> 
> I know what you guys are thinking, in that you can't figure out a
> way to avoid the transactional overhead if it is necessary to
> "put back" some SKBs if one of them in the batch gets a fault.

I prefer correctness over performance all the time. :)

> That's too bad, we need a proper implementation and proper error
> reporting.  Those performance numbers are useless if we effectively
> lose error notifications.

We have those problems right now and besides deprecating the syscalls I
have no idea how to fix this reliably and would probably need a lot of
changes (besides the sk_app_err solution, which I don't really favor at
all).

The syscall should have been designed in a way that the struct mmsghdr
-> msg_len would be ssize_t, so we could return error codes per fragment
and test before starting the batch that we have proper memory, so we
don't fail in the management code. :(

Bye,
Hannes

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic Sowa @ 2016-11-30  3:52 UTC (permalink / raw)
  To: David Lebrun, netdev
In-Reply-To: <1480439718-18019-1-git-send-email-david.lebrun@uclouvain.be>

Hi,

On Tue, Nov 29, 2016, at 18:15, David Lebrun wrote:
> When multiple nexthops are available for a given route, the routing
> engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting
> value
> indexes the nexthop to select. This method causes issues when a new
> nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to
> another
> nexthop.
> 
> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a
> sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
> 
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
> 
> The number of slices for a route also fixes the maximum number of
> nexthops
> possible for that route.

In the worst case this causes 2GB (order 19) allocations (x == 31) to
happen in GFP_ATOMIC (due to write lock) context and could cause update
failures to the routing table due to fragmentation. Are you sure the
upper limit of 31 is reasonable? I would very much prefer an upper limit
of below or equal 25 for x to stay within the bounds of the slab
allocators (which is still a lot and probably causes errors!).
Unfortunately because of the nature of the sysctl you can't really
create its own cache for it. :/

Also by design, one day this should all be RCU and having that much data
outstanding worries me a bit during routing table mutation.

I am a fan of consistent hashing but I am not so sure if it belongs into
a generic ECMP implementation or into its own ipvs or netfilter module
where you specifically know how much memory to burn for it.

Also please convert the sysctl to a netlink attribute if you pursue this
because if I change the sysctl while my quagga is hammering the routing
table I would like to know which nodes allocate what amount of memory.

Bye,
Hannes

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Tom Herbert @ 2016-11-30  4:04 UTC (permalink / raw)
  To: David Lebrun; +Cc: Linux Kernel Network Developers
In-Reply-To: <1480439718-18019-1-git-send-email-david.lebrun@uclouvain.be>

On Tue, Nov 29, 2016 at 9:15 AM, David Lebrun <david.lebrun@uclouvain.be> wrote:
> When multiple nexthops are available for a given route, the routing engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting value
> indexes the nexthop to select. This method causes issues when a new nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to another
> nexthop.
>
This is a lot of code to make ECMP work better. Can you be more
specific as to what the "issues" are? Assuming this is just the
transient packet reorder that happens in one link flap I am wondering
if this complexity is justified.

Tom

> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
>
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
>
> The number of slices for a route also fixes the maximum number of nexthops
> possible for that route.
>
> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
> ---
>  include/net/ip6_fib.h    |  25 +++++++
>  include/net/netns/ipv6.h |   1 +
>  net/ipv6/ip6_fib.c       | 174 ++++++++++++++++++++++++++++++++++++++++++++++-
>  net/ipv6/route.c         |  58 ++++++++--------
>  4 files changed, 229 insertions(+), 29 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index a74e2aa..29594cc 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -93,6 +93,30 @@ struct rt6key {
>
>  struct fib6_table;
>
> +/* Multipath nexthop.
> + * @nh is the actual nexthop
> + * @nslices is the number of slices assigned to this nexthop
> + */
> +struct rt6_multi_nh {
> +       struct rt6_info *nh;
> +       unsigned int    nslices;
> +};
> +
> +/* Multipath routes map.
> + * @nhs is an array of available nexthops
> + * @size is the number of elements in @nhs
> + * @nslices is the number of slices and the number of elements in @index
> + *         and is always in the form 2^x to prevent using a division for
> + *         the computation of the modulo.
> + * @index is an array mapping a slice index to a nexthop index in @nhs
> + */
> +struct rt6_multi_map {
> +       struct rt6_multi_nh     *nhs;
> +       unsigned int            size;
> +       unsigned int            nslices;
> +       __u8                    *index;
> +};
> +
>  struct rt6_info {
>         struct dst_entry                dst;
>
> @@ -113,6 +137,7 @@ struct rt6_info {
>          */
>         struct list_head                rt6i_siblings;
>         unsigned int                    rt6i_nsiblings;
> +       struct rt6_multi_map            *rt6i_nh_map;
>
>         atomic_t                        rt6i_ref;
>
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index de7745e..4967846 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -36,6 +36,7 @@ struct netns_sysctl_ipv6 {
>         int idgen_retries;
>         int idgen_delay;
>         int flowlabel_state_ranges;
> +       int ip6_rt_ecmp_slices;
>  };
>
>  struct netns_ipv6 {
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef54852..b6b1895 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -727,6 +727,166 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn,
>         }
>  }
>
> +static void fib6_mp_free(struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = rt->rt6i_nh_map;
> +       struct rt6_info *sibling;
> +
> +       if (nh_map) {
> +               list_for_each_entry(sibling, &rt->rt6i_siblings,
> +                                   rt6i_siblings) {
> +                       sibling->rt6i_nh_map = NULL;
> +               }
> +
> +               rt->rt6i_nh_map = NULL;
> +
> +               kfree(nh_map->nhs);
> +               kfree(nh_map->index);
> +               kfree(nh_map);
> +       }
> +}
> +
> +static int fib6_mp_extend(struct net *net, struct rt6_info *sref,
> +                         struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
> +       struct rt6_multi_nh *tmp_nhs, *old_nhs;
> +       unsigned int kslices;
> +       unsigned int i, j;
> +
> +       if (!nh_map) {
> +               __u8 *index;
> +
> +               nh_map = kmalloc(sizeof(*nh_map), GFP_ATOMIC);
> +               if (!nh_map)
> +                       return -ENOMEM;
> +
> +               nh_map->size = 1;
> +               nh_map->nslices = (1 << net->ipv6.sysctl.ip6_rt_ecmp_slices);
> +
> +               tmp_nhs = kmalloc(sizeof(*tmp_nhs), GFP_ATOMIC);
> +               if (!tmp_nhs) {
> +                       kfree(nh_map);
> +                       return -ENOMEM;
> +               }
> +
> +               tmp_nhs[0].nh = sref;
> +               tmp_nhs[0].nslices = nh_map->nslices;
> +               nh_map->nhs = tmp_nhs;
> +
> +               index = kmalloc_array(nh_map->nslices, sizeof(*index),
> +                                     GFP_ATOMIC);
> +               if (!index) {
> +                       kfree(nh_map->nhs);
> +                       kfree(nh_map);
> +                       return -ENOMEM;
> +               }
> +
> +               for (i = 0; i < nh_map->nslices; i++)
> +                       index[i] = 0;
> +
> +               nh_map->index = index;
> +               sref->rt6i_nh_map = nh_map;
> +       }
> +
> +       if (nh_map->size == nh_map->nslices)
> +               return -ENOBUFS;
> +
> +       nh_map->size++;
> +
> +       tmp_nhs = kmalloc_array(nh_map->size, sizeof(*tmp_nhs), GFP_ATOMIC);
> +       if (!tmp_nhs)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nh_map->size - 1; i++)
> +               tmp_nhs[i] = nh_map->nhs[i];
> +
> +       tmp_nhs[nh_map->size - 1].nh = rt;
> +       tmp_nhs[nh_map->size - 1].nslices = 0;
> +
> +       kslices = nh_map->nslices / nh_map->size;
> +
> +       /* Loop over nexthops and steal a random slice until we have kslices for
> +        * the new nexthop. This algorithm ensures that flows are taken as
> +        * equally as possible from current nexthops.
> +        *
> +        * At each iteration, @j is the index in tmp_nhs of the nexthop
> +        * to steal from and @idx is the index of the slice to steal
> +        * among @j's slices.
> +        */
> +       for (i = 0, j = 0; i < kslices; i++) {
> +               unsigned int idx, k = 0;
> +
> +               if (tmp_nhs[j].nslices == 1)
> +                       continue;
> +
> +               idx = (prandom_u32() % tmp_nhs[j].nslices) + 1;
> +               do {
> +                       if (nh_map->index[k++] == j)
> +                               idx--;
> +               } while (idx);
> +
> +               nh_map->index[k - 1] = nh_map->size - 1;
> +               tmp_nhs[nh_map->size - 1].nslices++;
> +               tmp_nhs[j].nslices--;
> +
> +               j = (j + 1) % (nh_map->size - 1);
> +       }
> +
> +       WARN_ON(tmp_nhs[nh_map->size - 1].nslices != kslices);
> +
> +       old_nhs = nh_map->nhs;
> +       nh_map->nhs = tmp_nhs;
> +       kfree(old_nhs);
> +
> +       rt->rt6i_nh_map = nh_map;
> +
> +       return 0;
> +}
> +
> +static int fib6_mp_shrink(struct rt6_info *sref, struct rt6_info *rt)
> +{
> +       struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
> +       struct rt6_multi_nh *tmp_nhs, *old_nhs;
> +       unsigned int i, j, idx = 0;
> +
> +       tmp_nhs = kmalloc_array(nh_map->size - 1, sizeof(*tmp_nhs), GFP_ATOMIC);
> +       if (!tmp_nhs)
> +               return -ENOMEM;
> +
> +       for (i = 0, j = 0; i < nh_map->size; i++) {
> +               if (nh_map->nhs[i].nh != rt)
> +                       tmp_nhs[j++] = nh_map->nhs[i];
> +               else
> +                       idx = i;
> +       }
> +
> +       nh_map->size--;
> +       old_nhs = nh_map->nhs;
> +       nh_map->nhs = tmp_nhs;
> +       kfree(old_nhs);
> +
> +       rt->rt6i_nh_map = NULL;
> +
> +       /* Update the slices index. For each slice mapping to the removed
> +        * nexthop, assign another random nexthop. For each slice mapping
> +        * to a nexthop that was after the removed nexthop, decrement the
> +        * nexthop index to reflect the shift. Nexthops that were before
> +        * the removed nexthop are unaffected.
> +        */
> +       for (i = 0; i < nh_map->nslices; i++) {
> +               if (nh_map->index[i] == idx) {
> +                       j = prandom_u32() % nh_map->size;
> +                       nh_map->index[i] = j;
> +                       nh_map->nhs[j].nslices++;
> +               } else if (nh_map->index[i] > idx) {
> +                       nh_map->index[i]--;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   *     Insert routing information in a node.
>   */
> @@ -837,6 +997,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                         }
>                         sibling = sibling->dst.rt6_next;
>                 }
> +               err = fib6_mp_extend(info->nl_net, sibling, rt);
> +               if (err < 0)
> +                       return err;
>                 /* For each sibling in the list, increment the counter of
>                  * siblings. BUG() if counters does not match, list of siblings
>                  * is broken!
> @@ -900,6 +1063,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>                         fn->fn_flags |= RTN_RTINFO;
>                 }
>                 nsiblings = iter->rt6i_nsiblings;
> +               if (nsiblings)
> +                       fib6_mp_free(iter);
>                 fib6_purge_rt(iter, fn, info->nl_net);
>                 rt6_release(iter);
>
> @@ -1407,13 +1572,20 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
>
>         /* Remove this entry from other siblings */
>         if (rt->rt6i_nsiblings) {
> -               struct rt6_info *sibling, *next_sibling;
> +               struct rt6_info *sibling, *next_sibling, *sref;
> +
> +               sref = list_first_entry(&rt->rt6i_siblings, struct rt6_info,
> +                                       rt6i_siblings);
>
>                 list_for_each_entry_safe(sibling, next_sibling,
>                                          &rt->rt6i_siblings, rt6i_siblings)
>                         sibling->rt6i_nsiblings--;
>                 rt->rt6i_nsiblings = 0;
>                 list_del_init(&rt->rt6i_siblings);
> +               if (!sref->rt6i_nsiblings)
> +                       fib6_mp_free(sref);
> +               else
> +                       fib6_mp_shrink(sref, rt);
>         }
>
>         /* Adjust walkers */
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index b317bb1..e9f13e0 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -427,39 +427,27 @@ static bool rt6_check_expired(const struct rt6_info *rt)
>         return false;
>  }
>
> -/* Multipath route selection:
> - *   Hash based function using packet header and flowlabel.
> - * Adapted from fib_info_hashfn()
> - */
> -static int rt6_info_hash_nhsfn(unsigned int candidate_count,
> -                              const struct flowi6 *fl6)
> -{
> -       return get_hash_from_flowi6(fl6) % candidate_count;
> -}
> -
>  static struct rt6_info *rt6_multipath_select(struct rt6_info *match,
>                                              struct flowi6 *fl6, int oif,
>                                              int strict)
>  {
> -       struct rt6_info *sibling, *next_sibling;
> -       int route_choosen;
> +       struct rt6_multi_map *nh_map = match->rt6i_nh_map;
> +       __u32 hash = get_hash_from_flowi6(fl6);
> +       unsigned int slice, idx;
> +       struct rt6_info *res;
>
> -       route_choosen = rt6_info_hash_nhsfn(match->rt6i_nsiblings + 1, fl6);
> -       /* Don't change the route, if route_choosen == 0
> -        * (siblings does not include ourself)
> -        */
> -       if (route_choosen)
> -               list_for_each_entry_safe(sibling, next_sibling,
> -                               &match->rt6i_siblings, rt6i_siblings) {
> -                       route_choosen--;
> -                       if (route_choosen == 0) {
> -                               if (rt6_score_route(sibling, oif, strict) < 0)
> -                                       break;
> -                               match = sibling;
> -                               break;
> -                       }
> -               }
> -       return match;
> +       WARN_ON(!nh_map);
> +       if (!nh_map)
> +               return match;
> +
> +       slice = hash & (nh_map->nslices - 1);
> +       idx = nh_map->index[slice];
> +       res = nh_map->nhs[idx].nh;
> +
> +       if (rt6_score_route(res, oif, strict) < 0)
> +               res = match;
> +
> +       return res;
>  }
>
>  /*
> @@ -3550,6 +3538,9 @@ int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
>         return 0;
>  }
>
> +static int one = 1;
> +static int thirtyone = 31;
> +
>  struct ctl_table ipv6_route_table_template[] = {
>         {
>                 .procname       =       "flush",
> @@ -3621,6 +3612,15 @@ struct ctl_table ipv6_route_table_template[] = {
>                 .mode           =       0644,
>                 .proc_handler   =       proc_dointvec_ms_jiffies,
>         },
> +       {
> +               .procname       =       "ecmp_slices",
> +               .data           =       &init_net.ipv6.sysctl.ip6_rt_ecmp_slices,
> +               .maxlen         =       sizeof(int),
> +               .mode           =       0644,
> +               .proc_handler   =       proc_dointvec_minmax,
> +               .extra1         =       &one,
> +               .extra2         =       &thirtyone,
> +       },
>         { }
>  };
>
> @@ -3644,6 +3644,7 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
>                 table[7].data = &net->ipv6.sysctl.ip6_rt_mtu_expires;
>                 table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
>                 table[9].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
> +               table[10].data = &net->ipv6.sysctl.ip6_rt_ecmp_slices;
>
>                 /* Don't export sysctls to unprivileged users */
>                 if (net->user_ns != &init_user_ns)
> @@ -3707,6 +3708,7 @@ static int __net_init ip6_route_net_init(struct net *net)
>         net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
>         net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
>         net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
> +       net->ipv6.sysctl.ip6_rt_ecmp_slices = 5;
>
>         net->ipv6.ip6_rt_gc_expire = 30*HZ;
>
> --
> 2.7.3
>

^ permalink raw reply

* Re: [PATCH net 00/16] net: fix fixed-link phydev leaks
From: David Miller @ 2016-11-30  4:17 UTC (permalink / raw)
  To: johan
  Cc: vbridger, f.fainelli, fugang.duan, pantelis.antoniou, vbordug,
	claudiu.manoil, leoli, thomas.petazzoni, nbd, blogic,
	matthias.bgg, sergei.shtylyov, lars.persson, mugunthanvnm,
	grygorii.strashko, robh+dt, frowand.list, andrew, vivien.didelot,
	netdev, nios2-dev, linux-kernel, linuxppc-dev, linux-mediatek,
	linux-renesas-soc, linux-omap, devicetree
In-Reply-To: <1480357509-28074-1-git-send-email-johan@kernel.org>

From: Johan Hovold <johan@kernel.org>
Date: Mon, 28 Nov 2016 19:24:53 +0100

> This series fixes failures to deregister and free fixed-link phydevs
> that have been registered using the of_phy_register_fixed_link()
> interface.
> 
> All but two drivers currently fail to do this and this series fixes most
> of them with the exception of a staging driver and the stmmac drivers
> which will be fixed by follow-on patches.
> 
> Included are also a couple of fixes for related of-node leaks.
> 
> Note that all patches except the of_mdio one have been compile-tested
> only.
> 
> Also note that the series is against net due to dependencies not yet in
> net-next.

Series applied, thanks Johan.

^ permalink raw reply

* Re: [PATCH] net: macb: Write only necessary bits in NCR in macb reset
From: Harini Katakam @ 2016-11-30  4:17 UTC (permalink / raw)
  To: David Miller
  Cc: Harini Katakam, Nicolas Ferre, netdev,
	linux-kernel@vger.kernel.org, michals@xilinx.com
In-Reply-To: <20161129.190526.1414678491662084583.davem@davemloft.net>

Hi David,

On Wed, Nov 30, 2016 at 5:35 AM, David Miller <davem@davemloft.net> wrote:
> From: Harini Katakam <harini.katakam@xilinx.com>
> Date: Mon, 28 Nov 2016 14:53:49 +0530
>
>> In macb_reset_hw, use read-modify-write to disable RX and TX.
>> This way exiting settings and reserved bits wont be disturbed.
>> Use the same method for clearing statistics as well.
>>
>> Signed-off-by: Harini Katakam <harinik@xilinx.com>
>
> This doesn't make much sense to me.
>
> Consider the two callers of this function.
>
> macb_init_hw() is going to do a non-masking write to the NCR
> register:
>
>         /* Enable TX and RX */
>         macb_writel(bp, NCR, MACB_BIT(RE) | MACB_BIT(TE) | MACB_BIT(MPE));
>
> So obviously no other writable fields matter at all for programming
> the chip properly, otherwise macb_init_hw() would "or" in the bits
> after a read of NCR.  But that's not what this code does, it
> writes "RE | TE | MPE" directly.
>
> And the other caller is macb_close() which is shutting down the
> chip so can zero out all the other bits and it can't possibly
> matter, also due to the assertion above about macb_init_hw()
> showing that only the RE, TE, and MPE bits matter for proper
> functioning of the chip.
>
> You haven't shown a issue caused by the way the code works now, so
> this patch isn't fixing a bug.  In fact, the "bit preserving" would
> even be misleading to someone reading the code.  They will ask
> themselves what bits need to be preserved, and as shown above none of
> them need to be.
>
> I'm not applying this, sorry.
>

Thanks for the detailed analysis.
I noticed an issue with respect to management port enable bit
when working on the patch series for macb mdio driver for
multi MAC - multi PHY access via same mdio bus.
MPE bit of the MAC (whose phy management register
is used) was being reset. But that is a separate
series still in review.
You are right, there is no existing bug that this
patch addresses. I will come back to this later if
required.

Regards,
Harini

^ permalink raw reply

* Re: netlink: GPF in sock_sndtimeo
From: Richard Guy Briggs @ 2016-11-30  4:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: Dmitry Vyukov, David Miller, Johannes Berg, Florian Westphal,
	Eric Dumazet, Herbert Xu, netdev, LKML, syzkaller
In-Reply-To: <CAM_iQpX_KBEpL1dCBNMgZQ-VQJM9nmsVdRPVK-Aen9k0gV3f0w@mail.gmail.com>

On 2016-11-29 15:13, Cong Wang wrote:
> On Tue, Nov 29, 2016 at 8:48 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2016-11-26 17:11, Cong Wang wrote:
> >> It is racy on audit_sock, especially on the netns exit path.
> >
> > I think that is the only place it is racy.  The other places audit_sock
> > is set is when the socket failure has just triggered a reset.
> >
> > Is there a notifier callback for failed or reaped sockets?
> 
> Is NETLINK_URELEASE event what you are looking for?

Possibly, yes.  Thanks, I'll have a look.

- RGB

^ permalink raw reply

* [PATCH net 1/2] tun: handle ubuf refcount correctly when meet erros
From: Jason Wang @ 2016-11-30  5:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, wangyunjian, Jason Wang

We trigger uarg->callback() immediately after we decide do datacopy
even if caller want to do zerocopy. This will cause the callback
(vhost_net_zerocopy_callback) decrease the refcount. But when we meet
an error afterwards, the error handling in vhost handle_tx() will try
to decrease it again. This is wrong and fix this by delay the
uarg->callback() until we're sure there's no errors.

Reported-by: wangyunjian <wangyunjian@huawei.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
The patch is needed for -stable.
---
 drivers/net/tun.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8093e39..db6acec 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1246,13 +1246,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (zerocopy)
 		err = zerocopy_sg_from_iter(skb, from);
-	else {
+	else
 		err = skb_copy_datagram_from_iter(skb, 0, from, len);
-		if (!err && msg_control) {
-			struct ubuf_info *uarg = msg_control;
-			uarg->callback(uarg, false);
-		}
-	}
 
 	if (err) {
 		this_cpu_inc(tun->pcpu_stats->rx_dropped);
@@ -1298,6 +1293,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		skb_shinfo(skb)->destructor_arg = msg_control;
 		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
 		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+	} else if (msg_control) {
+		struct ubuf_info *uarg = msg_control;
+		uarg->callback(uarg, false);
 	}
 
 	skb_reset_network_header(skb);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 2/2] macvtap: handle ubuf refcount correctly when meet erros
From: Jason Wang @ 2016-11-30  5:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, wangyunjian, Jason Wang
In-Reply-To: <1480483072-14201-1-git-send-email-jasowang@redhat.com>

We trigger uarg->callback() immediately after we decide do datacopy
even if caller want to do zerocopy. This will cause the callback
(vhost_net_zerocopy_callback) decrease the refcount. But when we meet
an error afterwards, the error handling in vhost handle_tx() will try
to decrease it again. This is wrong and fix this by delay the
uarg->callback() until we're sure there's no errors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
The patch is needed for -stable.
---
 drivers/net/macvtap.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index bceca28..7869b06 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -742,13 +742,8 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 
 	if (zerocopy)
 		err = zerocopy_sg_from_iter(skb, from);
-	else {
+	else
 		err = skb_copy_datagram_from_iter(skb, 0, from, len);
-		if (!err && m && m->msg_control) {
-			struct ubuf_info *uarg = m->msg_control;
-			uarg->callback(uarg, false);
-		}
-	}
 
 	if (err)
 		goto err_kfree;
@@ -779,7 +774,11 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 		skb_shinfo(skb)->destructor_arg = m->msg_control;
 		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
 		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+	} else if (m && m->msg_control) {
+		struct ubuf_info *uarg = m->msg_control;
+		uarg->callback(uarg, false);
 	}
+
 	if (vlan) {
 		skb->dev = vlan->dev;
 		dev_queue_xmit(skb);
-- 
2.7.4

^ permalink raw reply related

* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: Alexei Starovoitov @ 2016-11-30  5:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael S. Tsirkin, eric.dumazet, daniel, shm, davem, tgraf,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <583E3E6A.8050706@gmail.com>

On Tue, Nov 29, 2016 at 06:50:18PM -0800, John Fastabend wrote:
> On 16-11-29 04:37 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 12:11:33PM -0800, John Fastabend wrote:
> >> virtio_net XDP support expects receive buffers to be contiguous.
> >> If this is not the case we enable a slowpath to allow connectivity
> >> to continue but at a significan performance overhead associated with
> >> linearizing data. To make it painfully aware to users that XDP is
> >> running in a degraded mode we throw an xdp buffer error.
> >>
> >> To linearize packets we allocate a page and copy the segments of
> >> the data, including the header, into it. After this the page can be
> >> handled by XDP code flow as normal.
> >>
> >> Then depending on the return code the page is either freed or sent
> >> to the XDP xmit path. There is no attempt to optimize this path.
> >>
> >> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ...
> >> +/* The conditions to enable XDP should preclude the underlying device from
> >> + * sending packets across multiple buffers (num_buf > 1). However per spec
> >> + * it does not appear to be illegal to do so but rather just against convention.
> >> + * So in order to avoid making a system unresponsive the packets are pushed
> >> + * into a page and the XDP program is run. This will be extremely slow and we
> >> + * push a warning to the user to fix this as soon as possible. Fixing this may
> >> + * require resolving the underlying hardware to determine why multiple buffers
> >> + * are being received or simply loading the XDP program in the ingress stack
> >> + * after the skb is built because there is no advantage to running it here
> >> + * anymore.
> >> + */
> > ...
> >>  		if (num_buf > 1) {
> >>  			bpf_warn_invalid_xdp_buffer();
> 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Here is the warn once call. I made it a defined bpf warning so that we
> can easily identify if needed and it can be used by other drivers as
> well.

ahh. I thought it has - in front of it :)
and you're deleting it in this patch.

> >> -			goto err_xdp;
> >> +
> >> +			/* linearize data for XDP */
> >> +			xdp_page = xdp_linearize_page(rq, num_buf,
> >> +						      page, offset, &len);
> >> +			if (!xdp_page)
> >> +				goto err_xdp;
> > 
> > in case when we're 'lucky' the performance will silently be bad.
> 
> Were you specifically worried about the alloc failing here and no
> indication? I was thinking a statistic counter could be added as a
> follow on series to catch this and other such cases in non XDP paths
> if needed.
> 
> > Can we do warn_once here? so at least something in dmesg points out
> > that performance is not as expected. Am I reading it correctly that
> > you had to do a special kernel hack to trigger this situation and
> > in all normal cases it's not the case?
> > 
> 
> Correct the only way to produce this with upstream qemu and Linux is to
> write a kernel hack to build these types of buffers. AFAIK I caught all
> the cases where it could happen otherwise in the setup xdp prog call and
> required user to configure device driver so they can not happen e.g.
> disable LRO, set MTU < PAGE_SIZE, etc.
> 
> If I missed anything, I don't see it now, I would call it a bug and fix
> it. Again AFAIK everything should be covered though. As Micheal pointed
> out earlier although unexpected its not strictly against virtio spec to
> build such packets so we should handle it at least in a degraded mode
> even though it is not expected in any qemu cases.

Ok. Makes sense. The above wasn't clear in the commit log.
Thanks

^ permalink raw reply

* Re: The ubufs->refcount maybe be subtracted twice when tun_get_user failed
From: Jason Wang @ 2016-11-30  5:24 UTC (permalink / raw)
  To: wangyunjian, mst@redhat.com, netdev@vger.kernel.org; +Cc: caihe
In-Reply-To: <fad5d47a-96ee-1afc-b10c-cb2a6d5dc784@redhat.com>



On 2016年11月30日 10:53, Jason Wang wrote:
>
>
> On 2016年11月29日 21:27, wangyunjian wrote:
>> Sorry, I didn't describe it clearly. In fact, the second subtraction 
>> happens in the function handle_tx,
>> when tun_get_user fails and zcopy_used is ture. Fllowing the steps:
>
> I get your meaning. Thanks for the reporting. Will post patches (since 
> macvtap has similar issue).

Posted, appreciate if you can have a test on them.

Thanks

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Alexei Starovoitov @ 2016-11-30  5:37 UTC (permalink / raw)
  To: John Fastabend; +Cc: Thomas Graf, davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <583E3EF4.7030107@gmail.com>

On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> >> Registers new BPF program types which correspond to the LWT hooks:
> >>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
> >>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
> >>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> >>
> >> The separate program types are required to differentiate between the
> >> capabilities each LWT hook allows:
> >>
> >>  * Programs attached to dst_input() or dst_output() are restricted and
> >>    may only read the data of an skb. This prevent modification and
> >>    possible invalidation of already validated packet headers on receive
> >>    and the construction of illegal headers while the IP headers are
> >>    still being assembled.
> >>
> >>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
> >>    content as well as prepending an L2 header via a newly introduced
> >>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
> >>    after the IP header has been assembled completely.
> >>
> >> All BPF programs receive an skb with L3 headers attached and may return
> >> one of the following error codes:
> >>
> >>  BPF_OK - Continue routing as per nexthop
> >>  BPF_DROP - Drop skb and return EPERM
> >>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> >>                 (Only valid in lwtunnel_xmit() context)
> >>
> >> The return codes are binary compatible with their TC_ACT_
> >> relatives to ease compatibility.
> >>
> >> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > ...
> >> +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> > 
> 
> hopefully not too off-topic but for XDP I would like to see this get

definitely off-topic. lwt->headroom is existing concept. Too late
to do anything about it.

> passed down with the program. It would be more generic and drivers could
> configure the headroom on demand and more importantly verify that a
> program pushing data is not going to fail at runtime.

For xdp I think it will be problematic, since we'd have to check for
that value at prog array access to make sure tailcalls are not broken.
Mix and match won't be possible.
So what does 'configure the headroom on demand' buys us?
Isn't it much easier to tell all drivers "always reserve this much" ?
We burn the page anyway.
If it's configurable per driver, then we'd need an api
to retrieve it. Yet the program author doesn't care what the value is.
If program needs to do udp encap, it will try do it. No matter what.
If xdp_adjust_head() helper fails, the program will likely decide
to drop the packet. In some cases it may decide to punt to stack
for further processing, but for high performance dataplane code
it's highly unlikely.
If it's configurable to something that is not cache line boundary
hw dma performance may suffer and so on.
So I see only cons in such 'configurable headroom' and propose
to have fixed 256 bytes headroom for XDP
which is enough for any sensible encap and metadata.

^ permalink raw reply

* Re: [PATCH net-next v5 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: Alexei Starovoitov @ 2016-11-30  5:41 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <d3359d0c-8995-d605-cd2d-f0558fdd0408@cumulusnetworks.com>

On Tue, Nov 29, 2016 at 06:07:18PM -0700, David Ahern wrote:
> On 11/29/16 5:59 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 05:43:08PM -0700, David Ahern wrote:
> >> On 11/29/16 1:01 PM, Alexei Starovoitov wrote:
> >>> Could you also expose sk_protcol and sk_type as read only fields?
> >>
> >> Those are bitfields in struct sock, so can't use offsetof or sizeof. Any existing use cases that try to load a bitfield in a bpf that I can look at?
> > 
> > pkt_type, vlan are also bitfileds in skb. Please see convert_skb_access()
> > There is a bit of ugliness due to __BIG_ENDIAN_BITFIELD though..
> > 
> 
> Given the added complexity I'd prefer to defer this second use case to a follow on patch set. This one introduces the infra for sockets and I don't see anything needing to change with it to add the read of 3 more sock elements. Agree?

I don't see a complexity. It was straightforward for skb bitfields,
but if there is some unforeseen issue, it's better to tackle it now
otherwise the feature may never come and this 'infra for sockets' will
stay as 'infra for vrf only' and I'm struggling to convince myself that it's ok.
So I'll try to tweak this patch to add these 3 fields...

^ permalink raw reply

* Re: [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
From: Nikita Yushchenko @ 2016-11-30  6:36 UTC (permalink / raw)
  To: Andy Duan, David S. Miller, Troy Kisky, Andrew Lunn, Eric Nelson,
	Philippe Reynes, Johannes Berg, netdev@vger.kernel.org
  Cc: Chris Healy, Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <AM4PR0401MB2260D022EC9B9E4C4333E621FF8C0@AM4PR0401MB2260.eurprd04.prod.outlook.com>

> But I think it is not necessary since the driver don't support jumbo frame.

Hardcoded 1522 raises two separate issues.

(1) When DSA is in use, frames processed by FEC chip contain DSA tag and
thus can be larger than hardcoded limit of 1522. This issue is not
FEC-specific, any driver that hardcodes maximum frame size to 1522 (many
do) will have this issue if used with DSA.

Clean solution for this must take into account that difference between
MTU and max frame size is no longer known at compile time. Actually this
is the case even without DSA, due to VLANs: max frame size is (MTU + 18)
without VLANs, but (MTU + 22) with VLANs. However currently drivers tend
to ignore this and hardcode 22.  With DSA, 22 is not enough, need to add
switch-specific tag size to that.

Not yet sure how to handle this. DSA-specific API to find out tag size
could be added, but generic solution should handle all cases of dynamic
difference between MTU and max frame size, not only DSA.

(2) There is some demand to use larger frames for optimization purposes.

FEC register fields that limit frame size are 14-bit, thus allowing
frames up to (4k-1). I'm about to prepare a larger patch:
- add ndo_change_mtu handler, allowing MTU up to (4k - overhead),
- set MAX_FL / TRUNC_FL based on configured MTU,
- if necessary, do buffer reallocation with larger buffers.

Is this suitable for upstreaming?
Is there any policy related to handling larger frames?

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Thomas Graf @ 2016-11-30  6:48 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130001504.GA28238@ast-mbp.thefacebook.com>

On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> ...
> > +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.

It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
fine with bumping it to 256.

> > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > +		       struct dst_entry *dst, bool can_redirect)
> > +{
> > +	int ret;
> > +
> > +	/* Preempt disable is needed to protect per-cpu redirect_info between
> > +	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > +	 * access to maps strictly require a rcu_read_lock() for protection,
> > +	 * mixing with BH RCU lock doesn't work.
> > +	 */
> > +	preempt_disable();
> > +	rcu_read_lock();
> > +	bpf_compute_data_end(skb);
> > +	ret = BPF_PROG_RUN(lwt->prog, skb);
> > +	rcu_read_unlock();
> > +
> > +	switch (ret) {
> > +	case BPF_OK:
> > +		break;
> > +
> > +	case BPF_REDIRECT:
> > +		if (!can_redirect) {
> > +			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> > +				  lwt->name ? : "<unknown>");
> > +			ret = BPF_OK;
> > +		} else {
> > +			ret = skb_do_redirect(skb);
> 
> I think this assumes that program did bpf_skb_push and L2 header is present.
> Would it make sense to check that mac_header < network_header here to make
> sure that it actually happened? I think the cost of single 'if' isn't much.
> Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> so program shouldn't be doing bpf_skb_push in such case...

We are currently guaranteeing mac_header <= network_header given that
bpf_skb_push() is calling skb_reset_mac_header() unconditionally.

Even if a program were to push an L2 header and then redirect to an l3
tunnel, __bpf_redirect_no_mac will pull it off again and correct the
mac_header offset.

Should we check in __bpf_redirect_common() whether mac_header <
nework_header then or add it to lwt-bpf conditional on
dev_is_mac_header_xmit()?

> May be rename bpf_skb_push to bpf_skb_push_l2 ?
> since it's doing skb_reset_mac_header(skb); at the end of it?
> Or it's probably better to use 'flags' argument to tell whether
> bpf_skb_push() should set mac_header or not ?

I added the flags for this purpose but left it unused for now. This
will allow to add a flag later to extend the l3 header prior to
pushing an l2 header, hence the current helper name. But even then,
we would always reset the mac header as well to ensure we never
leave an skb with mac_header > network_header. Are you seeing a
scenario where we would want to omit the mac header reset?

> Then this bit:
> 
> > +		case BPF_OK:
> > +			/* If the L3 header was expanded, headroom might be too
> > +			 * small for L2 header now, expand as needed.
> > +			 */
> > +			ret = xmit_check_hhlen(skb);
> 
> will work fine as well...
> which probably needs "mac_header wasn't set" check? or it's fine?

Right. The check is not strictly required right now but is a safety net
to ensure that the skb passed to dst_neigh_output() which will add the
l2 header in the non-redirect case to always have sufficient headroom.
It will currently be a NOP as we are not allowing to extend the l3 header
in bpf_skb_push() yet. I left this in there to ensure that we are not
missing to add this later.

^ permalink raw reply

* Re: [PATCH] netns: avoid disabling irq for netns id
From: Cong Wang @ 2016-11-30  6:51 UTC (permalink / raw)
  To: Paul Moore; +Cc: Linux Kernel Network Developers, linux-audit
In-Reply-To: <CAGH-KgsK=Dc=Ptfg8P39pemFAsxVr9NKccev7sOW=w1suqKk+w@mail.gmail.com>

On Tue, Nov 29, 2016 at 2:14 PM, Paul Moore <pmoore@redhat.com> wrote:
> On Tue, Nov 29, 2016 at 5:11 PM, Paul Moore <pmoore@redhat.com> wrote:
>> From: Paul Moore <paul@paul-moore.com>
>>
>> Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
>> id") now that we've fixed some audit multicast issues that caused
>> problems with original attempt.  Additional information, and history,
>> can be found in the links below:
>>
>>  * https://github.com/linux-audit/audit-kernel/issues/22
>>  * https://github.com/linux-audit/audit-kernel/issues/23
>>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> Signed-off-by: Paul Moore <paul@paul-moore.com>
>> ---
>>  net/core/net_namespace.c |   35 +++++++++++++++--------------------
>>  1 file changed, 15 insertions(+), 20 deletions(-)
>
> Cong Wang, I added your sign-off to the patch since you were the
> original author, if you would prefer I leave it off or want it changed
> just let me know.

Thanks for not forgetting it. I am fine with signed-off-by.

^ permalink raw reply

* Re: [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF
From: Thomas Graf @ 2016-11-30  6:52 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130001750.GB28238@ast-mbp.thefacebook.com>

On 11/29/16 at 04:17pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:23PM +0100, Thomas Graf wrote:
> > Adds a series of test to verify the functionality of attaching
> > BPF programs at LWT hooks.
> > 
> > Also adds a sample which collects a histogram of packet sizes which
> > pass through an LWT hook.
> > 
> > $ ./lwt_len_hist.sh
> > Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> >  87380  16384  16384    10.00    39857.69
> 
> Nice!
> 
> > +	ret = bpf_redirect(ifindex, 0);
> > +	if (ret < 0) {
> > +		printk("bpf_redirect() failed: %d\n", ret);
> > +		return BPF_DROP;
> > +	}
> 
> this 'if' looks a bit weird. You're passing 0 as flags,
> so this helper will always succeed.
> Other sample code often does 'return bpf_redirect(...)'
> due to this reasoning.

Right, the if branch is absolutely useless. I will remove it.

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Alexei Starovoitov @ 2016-11-30  7:01 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130064850.GB16856@pox.localdomain>

On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> > ...
> > > +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 
> It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
> fine with bumping it to 256.
> 
> > > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > > +		       struct dst_entry *dst, bool can_redirect)
> > > +{
> > > +	int ret;
> > > +
> > > +	/* Preempt disable is needed to protect per-cpu redirect_info between
> > > +	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > > +	 * access to maps strictly require a rcu_read_lock() for protection,
> > > +	 * mixing with BH RCU lock doesn't work.
> > > +	 */
> > > +	preempt_disable();
> > > +	rcu_read_lock();
> > > +	bpf_compute_data_end(skb);
> > > +	ret = BPF_PROG_RUN(lwt->prog, skb);
> > > +	rcu_read_unlock();
> > > +
> > > +	switch (ret) {
> > > +	case BPF_OK:
> > > +		break;
> > > +
> > > +	case BPF_REDIRECT:
> > > +		if (!can_redirect) {
> > > +			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> > > +				  lwt->name ? : "<unknown>");
> > > +			ret = BPF_OK;
> > > +		} else {
> > > +			ret = skb_do_redirect(skb);
> > 
> > I think this assumes that program did bpf_skb_push and L2 header is present.
> > Would it make sense to check that mac_header < network_header here to make
> > sure that it actually happened? I think the cost of single 'if' isn't much.
> > Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> > so program shouldn't be doing bpf_skb_push in such case...
> 
> We are currently guaranteeing mac_header <= network_header given that
> bpf_skb_push() is calling skb_reset_mac_header() unconditionally.
> 
> Even if a program were to push an L2 header and then redirect to an l3
> tunnel, __bpf_redirect_no_mac will pull it off again and correct the
> mac_header offset.

yes. that part is fine.

> Should we check in __bpf_redirect_common() whether mac_header <
> nework_header then or add it to lwt-bpf conditional on
> dev_is_mac_header_xmit()?

may be only extra 'if' in lwt-bpf is all we need?
I'm still missing what will happen if we 'forget' to do
bpf_skb_push() inside the lwt-bpf program, but still do redirect
in lwt_xmit stage to l2 netdev...

^ permalink raw reply

* Re: [PATCH iproute2 V2 1/2] tc/cls_flower: Classify packet in ip tunnels
From: Amir Vadai @ 2016-11-30  7:17 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161129192658.5a4c3e2c@samsung9.wavecable.com>

On Tue, Nov 29, 2016 at 07:26:58PM -0800, Stephen Hemminger wrote:
> The overall design is fine, just a couple nits with the code.
> 
> > diff --git a/tc/f_flower.c b/tc/f_flower.c
> > index 2d31d1aa832d..1cf0750b5b83 100644
> > --- a/tc/f_flower.c
> > +++ b/tc/f_flower.c
> 
> >  
> > +static int flower_parse_key_id(char *str, int type, struct nlmsghdr *n)
> 
> str is not modified, therefore use: const char *str
ack

> 
> > +{
> > +	int ret;
> > +	__be32 key_id;
> > +
> > +	ret = get_be32(&key_id, str, 10);
> > +	if (ret)
> > +		return -1;
> 
> Traditionally netlink attributes are in host order, why was flower
> chosen to be different?
I don't know, maybe Jiri (cc'ed) can explain, but it is all over the
flower code.

> 
> > +
> > +	addattr32(n, MAX_MSG, type, key_id);
> > +
> > +	return 0;
> 
> 
> Why lose the return value here?  Why not:
> 
> 	ret = get_be32(&key_id, str, 10);
> 	if (!ret)
> 		addattr32(n, MAX_MSG, type, key_id);
The get_*() function can return only -1 or 0. But you are right, and it
is better the way you suggested.  Changing accordingly in V3.

> 
> > +}
> > +
> >  static int flower_parse_opt(struct filter_util *qu, char *handle,
> >  			    int argc, char **argv, struct nlmsghdr *n)
> >  {
> > @@ -339,6 +359,38 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
> >  				fprintf(stderr, "Illegal \"src_port\"\n");
> >  				return -1;
> >  			}
> > +		} else if (matches(*argv, "enc_dst_ip") == 0) {
> > +			NEXT_ARG();
> > +			ret = flower_parse_ip_addr(*argv, 0,
> > +						   TCA_FLOWER_KEY_ENC_IPV4_DST,
> > +						   TCA_FLOWER_KEY_ENC_IPV4_DST_MASK,
> > +						   TCA_FLOWER_KEY_ENC_IPV6_DST,
> > +						   TCA_FLOWER_KEY_ENC_IPV6_DST_MASK,
> > +						   n);
> > +			if (ret < 0) {
> > +				fprintf(stderr, "Illegal \"enc_dst_ip\"\n");
> > +				return -1;
> > +			}
> > +		} else if (matches(*argv, "enc_src_ip") == 0) {
> > +			NEXT_ARG();
> > +			ret = flower_parse_ip_addr(*argv, 0,
> > +						   TCA_FLOWER_KEY_ENC_IPV4_SRC,
> > +						   TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK,
> > +						   TCA_FLOWER_KEY_ENC_IPV6_SRC,
> > +						   TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK,
> > +						   n);
> > +			if (ret < 0) {
> > +				fprintf(stderr, "Illegal \"enc_src_ip\"\n");
> > +				return -1;
> > +			}
> > +		} else if (matches(*argv, "enc_key_id") == 0) {
> > +			NEXT_ARG();
> > +			ret = flower_parse_key_id(*argv,
> > +						  TCA_FLOWER_KEY_ENC_KEY_ID, n);
> > +			if (ret < 0) {
> > +				fprintf(stderr, "Illegal \"enc_key_id\"\n");
> > +				return -1;
> > +			}
> >  		} else if (matches(*argv, "action") == 0) {
> >  			NEXT_ARG();
> >  			ret = parse_action(&argc, &argv, TCA_FLOWER_ACT, n);
> > @@ -509,6 +561,14 @@ static void flower_print_port(FILE *f, char *name, __u8 ip_proto,
> >  	fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
> >  }
> >  
> > +static void flower_print_key_id(FILE *f, char *name,
> > +				struct rtattr *attr)
> 
> const char *name?
ack

> 
> 
> > +{
> > +	if (!attr)
> > +		return;
> > +	fprintf(f, "\n  %s %d", name, ntohl(rta_getattr_u32(attr)));
> > +}
> > +
> 
> Why short circuit, just change the order:
> 
> 	if (attr)
> 		fprintf(f, "\n  %s %s", name, ntohl(rta_getattr_u32(attr));
> 
> You might also want to introduce rta_getattr_be32()
ack

> 
> Please change, retest and resubmit both patches.
ack

Thanks for reviewing,
Amir

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Pravin Shelar @ 2016-11-30  7:23 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: Linux Kernel Network Developers, Jiri Benc, Eric Garver
In-Reply-To: <1480462253-114713-2-git-send-email-jarno@ovn.org>

On Tue, Nov 29, 2016 at 3:30 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
> Use is_skb_forwardable() instead of an explicit length check.  This
> gets around the apparent MTU check failure in OVS test cases when
> skb->protocol is not properly set in case of non-accelerated VLAN
> skbs.
>
> Suggested-by: Pravin Shelar <pshelar@ovn.org>
> Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> ---
> v3: New patch suggested by Pravin.
>
> net/openvswitch/vport.c | 38 ++++++++++++++------------------------
>  1 file changed, 14 insertions(+), 24 deletions(-)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index b6c8524..076b39f 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c

>  void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>  {
> -       int mtu = vport->dev->mtu;
> -
>         switch (vport->dev->type) {
>         case ARPHRD_NONE:
>                 if (mac_proto == MAC_PROTO_ETHERNET) {
> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>                 goto drop;
>         }
>
> -       if (unlikely(packet_length(skb, vport->dev) > mtu &&
> -                    !skb_is_gso(skb))) {
> -               net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
> -                                    vport->dev->name,
> -                                    packet_length(skb, vport->dev), mtu);
> +       if (unlikely(!is_skb_forwardable(vport->dev, skb))) {

This would easy to read if you inverse the if condition here.

> +               /* Log only if the device is up. */
> +               if (vport->dev->flags & IFF_UP) {

since is_skb_forwardable() is checking for IFF_UP we can remove same
check from internal-device send() op.

> +                       unsigned int length = skb->len
> +                               - vport->dev->hard_header_len;
> +
> +                       if (!skb_vlan_tag_present(skb)
> +                           && eth_type_vlan(skb->protocol))
> +                               length -= VLAN_HLEN;
> +
> +                       net_warn_ratelimited("%s: dropped over-mtu packet %d > %d\n",
> +                                            vport->dev->name, length,
> +                                            vport->dev->mtu);
> +               }
>                 vport->dev->stats.tx_errors++;
>                 goto drop;
>         }
> --
> 2.1.4
>

^ permalink raw reply

* Re: [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
From: Toshiaki Makita @ 2016-11-30  7:25 UTC (permalink / raw)
  To: Nikita Yushchenko, Andy Duan, David S. Miller, Troy Kisky,
	Andrew Lunn, Eric Nelson, Philippe Reynes, Johannes Berg,
	netdev@vger.kernel.org
  Cc: Chris Healy, Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <dab27a3d-821c-8d07-ad98-2bdd8c442bc5@cogentembedded.com>

On 2016/11/30 15:36, Nikita Yushchenko wrote:
>> But I think it is not necessary since the driver don't support jumbo frame.
> 
> Hardcoded 1522 raises two separate issues.
> 
> (1) When DSA is in use, frames processed by FEC chip contain DSA tag and
> thus can be larger than hardcoded limit of 1522. This issue is not
> FEC-specific, any driver that hardcodes maximum frame size to 1522 (many
> do) will have this issue if used with DSA.
> 
> Clean solution for this must take into account that difference between
> MTU and max frame size is no longer known at compile time. Actually this
> is the case even without DSA, due to VLANs: max frame size is (MTU + 18)
> without VLANs, but (MTU + 22) with VLANs. However currently drivers tend
> to ignore this and hardcode 22.  With DSA, 22 is not enough, need to add
> switch-specific tag size to that.
> 
> Not yet sure how to handle this. DSA-specific API to find out tag size
> could be added, but generic solution should handle all cases of dynamic
> difference between MTU and max frame size, not only DSA.

BTW I'm trying to introduce envelope frames to solve this kind of problems.
http://marc.info/?t=147496691500005&r=1&w=2
http://marc.info/?t=147496691500003&r=1&w=2
http://marc.info/?t=147496691500002&r=1&w=2
http://marc.info/?t=147496691500004&r=1&w=2
http://marc.info/?t=147496691500001&r=1&w=2

It needs jumbo frame support of NICs though.

Regards,
Toshiaki Makita

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Alexei Starovoitov @ 2016-11-30  7:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, netdev@vger.kernel.org,
	Saeed Mahameed, Tariq Toukan

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
...
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> +               send_doorbell = false;

this is awesome idea!
I'm missing though why you need all these read_once/write_once.
It's a tx ring that is already protected by tx queue lock
or completion is happening without grabbing the lock?
Then how do we make sure that we don't race here
and indeed above prod_bell - ncons > 0 condition is safe?
Good description of algorithm would certainly help ;)

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Pravin Shelar @ 2016-11-30  7:34 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: Linux Kernel Network Developers, Jiri Benc, Eric Garver
In-Reply-To: <1480462253-114713-3-git-send-email-jarno@ovn.org>

On Tue, Nov 29, 2016 at 3:30 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
> Do not always set skb->protocol to be the ethertype of the L3 header.
> For a packet with non-accelerated VLAN tags skb->protocol needs to be
> the ethertype of the outermost non-accelerated VLAN ethertype.
>
> Any VLAN offloading is undone on the OVS netlink interface, and any
> VLAN tags added by userspace are non-accelerated, as are double tagged
> VLAN packets.
>
> Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
> Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

Looks much better now. Thanks for fixing it. I have one minor comment.

Acked-by: Pravin B Shelar <pshelar@ovn.org>

> ---
> v3: Set skb->protocol properly also for double tagged packets, as suggested
>     by Pravin.  This patch is no longer needed for the MTU test failure, as
>     the new patch 2/3 addresses that.
>
>  net/openvswitch/datapath.c |  1 -
>  net/openvswitch/flow.c     | 62 +++++++++++++++++++++++-----------------------
>  2 files changed, 31 insertions(+), 32 deletions(-)
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 08aa926..e2fe2e5 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -354,6 +354,7 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>                 res = parse_vlan_tag(skb, &key->eth.vlan);
>                 if (res <= 0)
>                         return res;
> +               skb->protocol = key->eth.vlan.tpid;
>         }
>
>         /* Parse inner vlan tag. */
> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>         if (res <= 0)
>                 return res;
>
> +       /* If the outer vlan tag was accelerated, skb->protocol should
> +        * refelect the inner vlan type. */
> +       if (!eth_type_vlan(skb->protocol))

Since you would be spinning another version, can you change this
condition to directly check for skb-vlan-tag rather than indirectly
checking for the vlan accelerated case?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox