Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] udp: fix SO_BINDTODEVICE
From: David Miller @ 2018-05-10 19:43 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, dnman, dsahern
In-Reply-To: <9445dd5d149af16463df4d0502b2667ee2b6f4e8.1525862461.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed,  9 May 2018 12:42:34 +0200

> Damir reported a breakage of SO_BINDTODEVICE for UDP sockets.
> In absence of VRF devices, after commit fb74c27735f0 ("net:
> ipv4: add second dif to udp socket lookups") the dif mismatch
> isn't fatal anymore for UDP socket lookup with non null
> sk_bound_dev_if, breaking SO_BINDTODEVICE semantics.
> 
> This changeset addresses the issue making the dif match mandatory
> again in the above scenario.
> 
> Reported-by: Damir Mansurov <dnman@oktetlabs.ru>
> Fixes: fb74c27735f0 ("net: ipv4: add second dif to udp socket lookups")
> Fixes: 1801b570dd2a ("net: ipv6: add second dif to udp socket lookups")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] ipv4: reset fnhe_mtu_locked after cache route flushed
From: David Miller @ 2018-05-10 19:43 UTC (permalink / raw)
  To: liuhangbin; +Cc: netdev, sd, sbrivio
In-Reply-To: <1525860404-15012-1-git-send-email-liuhangbin@gmail.com>

From: Hangbin Liu <liuhangbin@gmail.com>
Date: Wed,  9 May 2018 18:06:44 +0800

> After route cache is flushed via ipv4_sysctl_rtcache_flush(), we forget
> to reset fnhe_mtu_locked in rt_bind_exception(). When pmtu is updated
> in __ip_rt_update_pmtu(), it will return directly since the pmtu is
> still locked. e.g.
> 
> + ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
> PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
> From 10.10.0.254 icmp_seq=1 Frag needed and DF set (mtu = 0)
> 
> --- 10.10.1.1 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
> 
> + ip netns exec client ip route get 10.10.1.1
> 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
>     cache expires 599sec mtu lock 552
> + ip netns exec client ip route flush cache
> + ip netns exec client ip route get 10.10.1.1
> 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
>     cache
> + ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
> PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
> ping: local error: Message too long, mtu=576
> 
> --- 10.10.1.1 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
> 
> + ip netns exec client ip route get 10.10.1.1
> 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
>     cache
> 
> Fixes: d52e5a7e7ca49 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2] hv_netvsc: Fix net device attach on older Windows hosts
From: David Miller @ 2018-05-10 19:37 UTC (permalink / raw)
  To: mgamal; +Cc: sthemmin, netdev, haiyangz, linux-kernel, devel, vkuznets
In-Reply-To: <1525853854-8277-1-git-send-email-mgamal@redhat.com>

From: Mohammed Gamal <mgamal@redhat.com>
Date: Wed,  9 May 2018 10:17:34 +0200

> On older windows hosts the net_device instance is returned to
> the caller of rndis_filter_device_add() without having the presence
> bit set first. This would cause any subsequent calls to network device
> operations (e.g. MTU change, channel change) to fail after the device
> is detached once, returning -ENODEV.
> 
> Instead of returning the device instabce, we take the exit path where
> we call netif_device_attach()
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> 
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH 09/18] ptp: Remove pr_fmt duplicate logging prefixes
From: Richard Cochran @ 2018-05-10 19:35 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, linux-kernel
In-Reply-To: <5b315a623c63fd82df0540e27a130aaabd8c3b49.1525964385.git.joe@perches.com>

On Thu, May 10, 2018 at 08:45:35AM -0700, Joe Perches wrote:
> Converting pr_fmt from a simple define to use KBUILD_MODNAME added
> some duplicate logging prefixes to existing uses.
> 
> Remove them.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net] nfp: flower: remove headroom from max MTU calculation
From: David Miller @ 2018-05-10 19:28 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: oss-drivers, netdev, pieter.jansenvanvuuren
In-Reply-To: <20180509071858.11124-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Wed,  9 May 2018 00:18:58 -0700

> From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
> 
> Since commit 29a5dcae2790 ("nfp: flower: offload phys port MTU change") we
> take encapsulation headroom into account when calculating the max allowed
> MTU.  This is unnecessary as the max MTU advertised by firmware should have
> already accounted for encap headroom.
> 
> Subtracting headroom twice brings the max MTU below what's necessary for
> some deployments.
> 
> Fixes: 29a5dcae2790 ("nfp: flower: offload phys port MTU change")
> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
> Reviewed-by: John Hurley <john.hurley@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied.

^ permalink raw reply

* Re: [bpf-next v3 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Mathieu Xhonneux @ 2018-05-10 19:27 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, borkmann, ast, David Miller, shm, roopa, brouer, toke,
	john.fastabend
In-Reply-To: <20180510033427.20756-9-dsahern@gmail.com>

I'm quite interested in this helper to implement OAM features (through
other hooks, e.g. the BPF LWT hook). Do you have an idea about how it
behaves with ECMP routes (with IPv4 and/or IPv6) ? In IPv6, I'm
guessing that the returned gateway address is always a link-local
address ?

Thanks.

2018-05-10 5:34 GMT+02:00 David Ahern <dsahern@gmail.com>:
>
> Provide a helper for doing a FIB and neighbor lookup in the kernel
> tables from an XDP program. The helper provides a fastpath for forwarding
> packets. If the packet is a local delivery or for any reason is not a
> simple lookup and forward, the packet continues up the stack.
>
> If it is to be forwarded, the forwarding can be done directly if the
> neighbor is already known. If the neighbor does not exist, the first
> few packets go up the stack for neighbor resolution. Once resolved, the
> xdp program provides the fast path.
>
> On successful lookup the nexthop dmac, current device smac and egress
> device index are returned.
>
> The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
> are implemented in this patch. The API includes layer 4 parameters if
> the XDP program chooses to do deep packet inspection to allow compare
> against ACLs implemented as FIB rules.
>
> Header rewrite is left to the XDP program.
>
> The lookup takes 2 flags:
> - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
>   straight to the table associated with the device (expert setting for
>   those looking to maximize throughput)
>
> - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
>   Default is an ingress lookup.
>
> Initial performance numbers collected by Jesper, forwarded packets/sec:
>
>        Full stack    XDP FIB lookup    XDP Direct lookup
> IPv4   1,947,969       7,074,156          7,415,333
> IPv6   1,728,000       6,165,504          7,262,720
>
> These number are single CPU core forwarding on a Broadwell
> E5-1650 v4 @ 3.60GHz.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>  include/uapi/linux/bpf.h |  81 +++++++++++++-
>  net/core/filter.c        | 267 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 347 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d615c777b573..02e4112510f8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1828,6 +1828,33 @@ union bpf_attr {
>   *     Return
>   *             0 on success, or a negative error in case of failure.
>   *
> + *
> + * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
> + *     Description
> + *             Do FIB lookup in kernel tables using parameters in *params*.
> + *             If lookup is successful and result shows packet is to be
> + *             forwarded, the neighbor tables are searched for the nexthop.
> + *             If successful (ie., FIB lookup shows forwarding and nexthop
> + *             is resolved), the nexthop address is returned in ipv4_dst,
> + *             ipv6_dst or mpls_out based on family, smac is set to mac
> + *             address of egress device, dmac is set to nexthop mac address,
> + *             rt_metric is set to metric from route.
> + *
> + *             *plen* argument is the size of the passed in struct.
> + *             *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
> + *
> + *             **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
> + *             full lookup using FIB rules
> + *             **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress
> + *             perspective (default is ingress)
> + *
> + *             *ctx* is either **struct xdp_md** for XDP programs or
> + *             **struct sk_buff** tc cls_act programs.
> + *
> + *     Return
> + *             Egress device index on success, 0 if packet needs to continue
> + *             up the stack for further processing or a negative error in case
> + *             of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -1898,7 +1925,8 @@ union bpf_attr {
>         FN(xdp_adjust_tail),            \
>         FN(skb_get_xfrm_state),         \
>         FN(get_stack),                  \
> -       FN(skb_load_bytes_relative),
> +       FN(skb_load_bytes_relative),    \
> +       FN(fib_lookup),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -2321,4 +2349,55 @@ struct bpf_raw_tracepoint_args {
>         __u64 args[0];
>  };
>
> +/* DIRECT:  Skip the FIB rules and go to FIB table associated with device
> + * OUTPUT:  Do lookup from egress perspective; default is ingress
> + */
> +#define BPF_FIB_LOOKUP_DIRECT  BIT(0)
> +#define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
> +
> +struct bpf_fib_lookup {
> +       /* input */
> +       __u8    family;   /* network family, AF_INET, AF_INET6, AF_MPLS */
> +
> +       /* set if lookup is to consider L4 data - e.g., FIB rules */
> +       __u8    l4_protocol;
> +       __be16  sport;
> +       __be16  dport;
> +
> +       /* total length of packet from network header - used for MTU check */
> +       __u16   tot_len;
> +       __u32   ifindex;  /* L3 device index for lookup */
> +
> +       union {
> +               /* inputs to lookup */
> +               __u8    tos;            /* AF_INET  */
> +               __be32  flowlabel;      /* AF_INET6 */
> +
> +               /* output: metric of fib result */
> +               __u32 rt_metric;
> +       };
> +
> +       union {
> +               __be32          mpls_in;
> +               __be32          ipv4_src;
> +               __u32           ipv6_src[4];  /* in6_addr; network order */
> +       };
> +
> +       /* input to bpf_fib_lookup, *dst is destination address.
> +        * output: bpf_fib_lookup sets to gateway address
> +        */
> +       union {
> +               /* return for MPLS lookups */
> +               __be32          mpls_out[4];  /* support up to 4 labels */
> +               __be32          ipv4_dst;
> +               __u32           ipv6_dst[4];  /* in6_addr; network order */
> +       };
> +
> +       /* output */
> +       __be16  h_vlan_proto;
> +       __be16  h_vlan_TCI;
> +       __u8    smac[6];     /* ETH_ALEN */
> +       __u8    dmac[6];     /* ETH_ALEN */
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0baa715e4699..ca60d2872da4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -60,6 +60,10 @@
>  #include <net/xfrm.h>
>  #include <linux/bpf_trace.h>
>  #include <net/xdp_sock.h>
> +#include <linux/inetdevice.h>
> +#include <net/ip_fib.h>
> +#include <net/flow.h>
> +#include <net/arp.h>
>
>  /**
>   *     sk_filter_trim_cap - run a packet through a socket filter
> @@ -4032,6 +4036,265 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>  };
>  #endif
>
> +#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> +static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
> +                                 const struct neighbour *neigh,
> +                                 const struct net_device *dev)
> +{
> +       memcpy(params->dmac, neigh->ha, ETH_ALEN);
> +       memcpy(params->smac, dev->dev_addr, ETH_ALEN);
> +       params->h_vlan_TCI = 0;
> +       params->h_vlan_proto = 0;
> +
> +       return dev->ifindex;
> +}
> +#endif
> +
> +#if IS_ENABLED(CONFIG_INET)
> +static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> +                              u32 flags)
> +{
> +       struct in_device *in_dev;
> +       struct neighbour *neigh;
> +       struct net_device *dev;
> +       struct fib_result res;
> +       struct fib_nh *nh;
> +       struct flowi4 fl4;
> +       int err;
> +
> +       dev = dev_get_by_index_rcu(net, params->ifindex);
> +       if (unlikely(!dev))
> +               return -ENODEV;
> +
> +       /* verify forwarding is enabled on this interface */
> +       in_dev = __in_dev_get_rcu(dev);
> +       if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
> +               return 0;
> +
> +       if (flags & BPF_FIB_LOOKUP_OUTPUT) {
> +               fl4.flowi4_iif = 1;
> +               fl4.flowi4_oif = params->ifindex;
> +       } else {
> +               fl4.flowi4_iif = params->ifindex;
> +               fl4.flowi4_oif = 0;
> +       }
> +       fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
> +       fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
> +       fl4.flowi4_flags = 0;
> +
> +       fl4.flowi4_proto = params->l4_protocol;
> +       fl4.daddr = params->ipv4_dst;
> +       fl4.saddr = params->ipv4_src;
> +       fl4.fl4_sport = params->sport;
> +       fl4.fl4_dport = params->dport;
> +
> +       if (flags & BPF_FIB_LOOKUP_DIRECT) {
> +               u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
> +               struct fib_table *tb;
> +
> +               tb = fib_get_table(net, tbid);
> +               if (unlikely(!tb))
> +                       return 0;
> +
> +               err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
> +       } else {
> +               fl4.flowi4_mark = 0;
> +               fl4.flowi4_secid = 0;
> +               fl4.flowi4_tun_key.tun_id = 0;
> +               fl4.flowi4_uid = sock_net_uid(net, NULL);
> +
> +               err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
> +       }
> +
> +       if (err || res.type != RTN_UNICAST)
> +               return 0;
> +
> +       if (res.fi->fib_nhs > 1)
> +               fib_select_path(net, &res, &fl4, NULL);
> +
> +       nh = &res.fi->fib_nh[res.nh_sel];
> +
> +       /* do not handle lwt encaps right now */
> +       if (nh->nh_lwtstate)
> +               return 0;
> +
> +       dev = nh->nh_dev;
> +       if (unlikely(!dev))
> +               return 0;
> +
> +       if (nh->nh_gw)
> +               params->ipv4_dst = nh->nh_gw;
> +
> +       params->rt_metric = res.fi->fib_priority;
> +
> +       /* xdp and cls_bpf programs are run in RCU-bh so
> +        * rcu_read_lock_bh is not needed here
> +        */
> +       neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
> +       if (neigh)
> +               return bpf_fib_set_fwd_params(params, neigh, dev);
> +
> +       return 0;
> +}
> +#endif
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> +                              u32 flags)
> +{
> +       struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
> +       struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
> +       struct neighbour *neigh;
> +       struct net_device *dev;
> +       struct inet6_dev *idev;
> +       struct fib6_info *f6i;
> +       struct flowi6 fl6;
> +       int strict = 0;
> +       int oif;
> +
> +       /* link local addresses are never forwarded */
> +       if (rt6_need_strict(dst) || rt6_need_strict(src))
> +               return 0;
> +
> +       dev = dev_get_by_index_rcu(net, params->ifindex);
> +       if (unlikely(!dev))
> +               return -ENODEV;
> +
> +       idev = __in6_dev_get_safely(dev);
> +       if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
> +               return 0;
> +
> +       if (flags & BPF_FIB_LOOKUP_OUTPUT) {
> +               fl6.flowi6_iif = 1;
> +               oif = fl6.flowi6_oif = params->ifindex;
> +       } else {
> +               oif = fl6.flowi6_iif = params->ifindex;
> +               fl6.flowi6_oif = 0;
> +               strict = RT6_LOOKUP_F_HAS_SADDR;
> +       }
> +       fl6.flowlabel = params->flowlabel;
> +       fl6.flowi6_scope = 0;
> +       fl6.flowi6_flags = 0;
> +       fl6.mp_hash = 0;
> +
> +       fl6.flowi6_proto = params->l4_protocol;
> +       fl6.daddr = *dst;
> +       fl6.saddr = *src;
> +       fl6.fl6_sport = params->sport;
> +       fl6.fl6_dport = params->dport;
> +
> +       if (flags & BPF_FIB_LOOKUP_DIRECT) {
> +               u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
> +               struct fib6_table *tb;
> +
> +               tb = ipv6_stub->fib6_get_table(net, tbid);
> +               if (unlikely(!tb))
> +                       return 0;
> +
> +               f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
> +       } else {
> +               fl6.flowi6_mark = 0;
> +               fl6.flowi6_secid = 0;
> +               fl6.flowi6_tun_key.tun_id = 0;
> +               fl6.flowi6_uid = sock_net_uid(net, NULL);
> +
> +               f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict);
> +       }
> +
> +       if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
> +               return 0;
> +
> +       if (unlikely(f6i->fib6_flags & RTF_REJECT ||
> +           f6i->fib6_type != RTN_UNICAST))
> +               return 0;
> +
> +       if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
> +               f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
> +                                                      fl6.flowi6_oif, NULL,
> +                                                      strict);
> +
> +       if (f6i->fib6_nh.nh_lwtstate)
> +               return 0;
> +
> +       if (f6i->fib6_flags & RTF_GATEWAY)
> +               *dst = f6i->fib6_nh.nh_gw;
> +
> +       dev = f6i->fib6_nh.nh_dev;
> +       params->rt_metric = f6i->fib6_metric;
> +
> +       /* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
> +        * not needed here. Can not use __ipv6_neigh_lookup_noref here
> +        * because we need to get nd_tbl via the stub
> +        */
> +       neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
> +                                     ndisc_hashfn, dst, dev);
> +       if (neigh)
> +               return bpf_fib_set_fwd_params(params, neigh, dev);
> +
> +       return 0;
> +}
> +#endif
> +
> +BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
> +          struct bpf_fib_lookup *, params, int, plen, u32, flags)
> +{
> +       if (plen < sizeof(*params))
> +               return -EINVAL;
> +
> +       switch (params->family) {
> +#if IS_ENABLED(CONFIG_INET)
> +       case AF_INET:
> +               return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
> +                                          flags);
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +       case AF_INET6:
> +               return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
> +                                          flags);
> +#endif
> +       }
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
> +       .func           = bpf_xdp_fib_lookup,
> +       .gpl_only       = true,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_PTR_TO_MEM,
> +       .arg3_type      = ARG_CONST_SIZE,
> +       .arg4_type      = ARG_ANYTHING,
> +};
> +
> +BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
> +          struct bpf_fib_lookup *, params, int, plen, u32, flags)
> +{
> +       if (plen < sizeof(*params))
> +               return -EINVAL;
> +
> +       switch (params->family) {
> +#if IS_ENABLED(CONFIG_INET)
> +       case AF_INET:
> +               return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +       case AF_INET6:
> +               return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
> +#endif
> +       }
> +       return -ENOTSUPP;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
> +       .func           = bpf_skb_fib_lookup,
> +       .gpl_only       = true,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_PTR_TO_MEM,
> +       .arg3_type      = ARG_CONST_SIZE,
> +       .arg4_type      = ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  bpf_base_func_proto(enum bpf_func_id func_id)
>  {
> @@ -4181,6 +4444,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         case BPF_FUNC_skb_get_xfrm_state:
>                 return &bpf_skb_get_xfrm_state_proto;
>  #endif
> +       case BPF_FUNC_fib_lookup:
> +               return &bpf_skb_fib_lookup_proto;
>         default:
>                 return bpf_base_func_proto(func_id);
>         }
> @@ -4206,6 +4471,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_xdp_redirect_map_proto;
>         case BPF_FUNC_xdp_adjust_tail:
>                 return &bpf_xdp_adjust_tail_proto;
> +       case BPF_FUNC_fib_lookup:
> +               return &bpf_xdp_fib_lookup_proto;
>         default:
>                 return bpf_base_func_proto(func_id);
>         }
> --
> 2.11.0
>

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: fix added_by_user switchdev notification
From: David Miller @ 2018-05-10 19:27 UTC (permalink / raw)
  To: vivien.didelot
  Cc: netdev, linux-kernel, kernel, petrm, jiri, idosch, ivecera,
	stephen, andrew, f.fainelli, nikolay, bridge
In-Reply-To: <20180509030312.29548-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Tue,  8 May 2018 23:03:12 -0400

> Commit 161d82de1ff8 ("net: bridge: Notify about !added_by_user FDB
> entries") causes the below oops when bringing up a slave interface,
> because dsa_port_fdb_add is still scheduled, but with a NULL address.
> 
> To fix this, keep the dsa_slave_switchdev_event function agnostic of the
> notified info structure and handle the added_by_user flag in the
> specific dsa_slave_switchdev_event_work function.
 ...
> Fixes: 816a3bed9549 ("switchdev: Add fdb.added_by_user to switchdev notifications")
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied, thanks Vivien.

^ permalink raw reply

* Re: [PATCH] net/9p: fix spelling mistake: "suspsend" -> "suspend"
From: David Miller @ 2018-05-10 19:24 UTC (permalink / raw)
  To: colin.king
  Cc: ericvh, rminnich, lucho, v9fs-developer, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180509094833.15871-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed,  9 May 2018 10:48:33 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in dev_warn message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH] sctp: fix spelling mistake: "max_retans" -> "max_retrans"
From: David Miller @ 2018-05-10 19:24 UTC (permalink / raw)
  To: colin.king
  Cc: vyasevich, nhorman, marcelo.leitner, linux-sctp, netdev,
	kernel-janitors, linux-kernel
In-Reply-To: <20180508222428.24874-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue,  8 May 2018 23:24:28 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in error string
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH] [ATM] firestream: fix spelling mistake: "reseverd" -> "reserved"
From: David Miller @ 2018-05-10 19:24 UTC (permalink / raw)
  To: colin.king
  Cc: 3chas3, linux-atm-general, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180508220151.20725-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Tue,  8 May 2018 23:01:51 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in res_strings string array
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 0/2] qed*: Rdma fixes
From: David Miller @ 2018-05-10 19:22 UTC (permalink / raw)
  To: Michal.Kalderon; +Cc: netdev, linux-rdma, chad.dupuis, Sudarsana.Kalluru
In-Reply-To: <20180508182919.23590-1-Michal.Kalderon@cavium.com>

From: Michal Kalderon <Michal.Kalderon@cavium.com>
Date: Tue, 8 May 2018 21:29:17 +0300

> This patch series include two fixes for bugs related to rdma.
> The first has to do with loading the driver over an iWARP
> device. 
> The second fixes a previous commit that added proper link
> indication for iWARP / RoCE.
> 
> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
> Signed-off-by: Sudarsana Kalluru <Sudarsana.Kalluru@cavium.com>

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] tun: Do SIOCGSKNS out of rtnl_lock()
From: David Miller @ 2018-05-10 19:17 UTC (permalink / raw)
  To: ktkhai; +Cc: jasowang, edumazet, mst, brouer, peterpenkov96, sd, netdev
In-Reply-To: <152579647246.21100.10461408116587658568.stgit@localhost.localdomain>

From: Kirill Tkhai <ktkhai@virtuozzo.com>
Date: Tue, 08 May 2018 19:21:34 +0300

> Since net ns of tun device is assigned on the device creation,
> and it never changes, we do not need to use any lock to get it
> from alive tun.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH -next 0/6] net: Update static keys to modern api
From: David Miller @ 2018-05-10 19:14 UTC (permalink / raw)
  To: dave; +Cc: netdev, linux-kernel
In-Reply-To: <20180508160703.8125-1-dave@stgolabs.net>

From: Davidlohr Bueso <dave@stgolabs.net>
Date: Tue,  8 May 2018 09:06:57 -0700

> The following patches update pretty much all core net static key users
> to the modern api. Changes are mostly trivial conversion without affecting
> any semantics. The motivation is a resend of patches 1 and 2 from a while[1]
> back, and the rest are added patches, specific for -net.
> 
> Applies against today's linux-next. Compile tested only.
> 
> [1] lkml.kernel.org/r/20180326210929.5244-1-dave@stgolabs.net

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 00/10] net: stmmac: Be less dependent on Synopsys ID
From: David Miller @ 2018-05-10 19:06 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <cover.1525683832.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Tue,  8 May 2018 15:45:23 +0100

> This series was based on top of [1]. Please see my reply on [1] and let me
> know if I need to rebase against -next without this dependecy.
 ...
> [1] https://patchwork.ozlabs.org/patch/908618/

V1 of that patch is what ended up in my tree before you even posted
V2.  So if you have to update to what's in V2 you must send in a
relative patch against net-next to fix it up.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] net:sched: add gkprio scheduler
From: Michel Machado @ 2018-05-10 19:06 UTC (permalink / raw)
  To: Cong Wang
  Cc: Nishanth Devarajan, Jiri Pirko, Jamal Hadi Salim, David Miller,
	Linux Kernel Network Developers, Cody Doucette
In-Reply-To: <CAM_iQpX5iwRN5C7oQ3X9F0viG8Ya15C0gcYFR_CJDxqhXCPYqQ@mail.gmail.com>

On 05/10/2018 01:38 PM, Cong Wang wrote:
> On Wed, May 9, 2018 at 7:09 AM, Michel Machado <michel@digirati.com.br> wrote:
>> On 05/08/2018 10:24 PM, Cong Wang wrote:
>>>
>>> On Tue, May 8, 2018 at 5:59 AM, Michel Machado <michel@digirati.com.br>
>>> wrote:
>>>>>>
>>>>>> Overall it looks good to me, just one thing below:
>>>>>>
>>>>>>> +struct Qdisc_ops gkprio_qdisc_ops __read_mostly = {
>>>>>>> +       .id             =       "gkprio",
>>>>>>> +       .priv_size      =       sizeof(struct gkprio_sched_data),
>>>>>>> +       .enqueue        =       gkprio_enqueue,
>>>>>>> +       .dequeue        =       gkprio_dequeue,
>>>>>>> +       .peek           =       qdisc_peek_dequeued,
>>>>>>> +       .init           =       gkprio_init,
>>>>>>> +       .reset          =       gkprio_reset,
>>>>>>> +       .change         =       gkprio_change,
>>>>>>> +       .dump           =       gkprio_dump,
>>>>>>> +       .destroy        =       gkprio_destroy,
>>>>>>> +       .owner          =       THIS_MODULE,
>>>>>>> +};
>>>>>>
>>>>>>
>>>>>>
>>>>>> You probably want to add Qdisc_class_ops here so that you can
>>>>>> dump the stats of each internal queue.
>>>>
>>>>
>>>>
>>>> Hi Cong,
>>>>
>>>>      In the production scenario we are targeting, this priority queue must
>>>> be
>>>> classless; being classful would only bloat the code for us. I don't see
>>>> making this queue classful as a problem per se, but I suggest leaving it
>>>> as
>>>> a future improvement for when someone can come up with a useful scenario
>>>> for
>>>> it.
>>>
>>>
>>>
>>> Take a look at sch_prio, it is fairly simple since your internal
>>> queues are just an array... Per-queue stats are quite useful
>>> in production, we definitely want to observe which queues are
>>> full which are not.
>>>
>>
>> DSprio cannot add Qdisc_class_ops without a rewrite of other queue
>> disciplines, which doesn't seem desirable. Since the method cops->leaf is
>> required (see register_qdisc()), we would need to replace the array struct
>> sk_buff_head qdiscs[GKPRIO_MAX_PRIORITY] in struct gkprio_sched_data with
>> the array struct Qdisc *queues[GKPRIO_MAX_PRIORITY] to be able to return a
>> Qdisc in dsprio_leaf(). The problem with this change is that Qdisc does not
>> have a method to dequeue from its tail. This new method may not even make
>> sense in other queue disciplines. But without this method, gkprio_enqueue()
>> cannot drop the lowest priority packet when the queue is full and an
>> incoming packet has higher priority.
> 
> Sorry for giving you a bad example. Take a look at sch_fq_codel instead,
> it returns NULL for ->leaf() and maps its internal flows to classes.
> 
> I thought sch_prio uses internal qdiscs, but I was wrong, as you noticed
> it actually exposes them to user via classes.
> 
> My point is never to make it classful, just want to expose the useful stats,
> like how fq_codel dumps its internal flows.
> 
> 
>>
>> Nevertheless, I see your point on being able to observe the distribution of
>> queued packets per priority. A solution for that would be to add the array
>> __u32 qlen[GKPRIO_MAX_PRIORITY] in struct tc_gkprio_qopt. This solution even
>> avoids adding overhead in the critical paths of DSprio. Do you see a better
>> solution?
> 
> I believe you can return NULL for ->leaf() and don't need to worry about
> ->graft() either. ;)

Thank you for pointing sch_fq_codel out. We'll follow its example.

[ ]'s
Michel Machado

^ permalink raw reply

* Re: [PATCH net-next 01/10] net: stmmac: Let descriptor code set skbuff address
From: David Miller @ 2018-05-10 19:06 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <cfc227a5bce0caf2058a8efe1bf5071b9dec48e0.1525683833.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Tue,  8 May 2018 15:45:24 +0100

> Stop using if conditions depending on the GMAC version for setting the
> the descriptor skbuff address and use instead a helper implemented in
> the descriptor files.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>

With Spectre mitigations, indirect calls are extremely expensive.  Much
more expensive than conditional checks.

And since this is the descriptor setup in the fast paths of the driver,
I advise that you keep these conditionals.

^ permalink raw reply

* RE: [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
From: Keller, Jacob E @ 2018-05-10 18:42 UTC (permalink / raw)
  To: Benjamin Poirier, Kirsher, Jeffrey T
  Cc: Achim Mildenberger, olouvignes@gmail.com, jayanth@goubiq.com,
	ehabkost@redhat.com, postmodern.mod3@gmail.com,
	Bart.VanAssche@wdc.com, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20180510072835.5549-1-bpoirier@suse.com>

> -----Original Message-----
> From: Benjamin Poirier [mailto:bpoirier@suse.com]
> Sent: Thursday, May 10, 2018 12:29 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> Cc: Keller, Jacob E <jacob.e.keller@intel.com>; Achim Mildenberger
> <admin@fph.physik.uni-karlsruhe.de>; olouvignes@gmail.com;
> jayanth@goubiq.com; ehabkost@redhat.com; postmodern.mod3@gmail.com;
> Bart.VanAssche@wdc.com; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
> 
> There have been multiple reports of crashes that look like
> kernel: RIP: 0010:[<ffffffff8110303f>] timecounter_read+0xf/0x50
> [...]
> kernel: Call Trace:
> kernel:  [<ffffffffa0806b0f>] e1000e_phc_gettime+0x2f/0x60 [e1000e]
> kernel:  [<ffffffffa0806c5d>] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
> kernel:  [<ffffffff810992c5>] process_one_work+0x155/0x440
> kernel:  [<ffffffff81099e16>] worker_thread+0x116/0x4b0
> kernel:  [<ffffffff8109f422>] kthread+0xd2/0xf0
> kernel:  [<ffffffff8163184f>] ret_from_fork+0x3f/0x70
> 
> These can be traced back to the fact that e1000e_systim_reset() skips the
> timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL, which
> leads to a null deref in timecounter_read().
> 
> Commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) reworked
> e1000e_get_base_timinca() in such a way that it can return -EINVAL for
> e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
> 
> Some experimentation has shown that on I219 (e1000_pch_spt, "MAC: 12")
> adapters, the E1000_TSYNCRXCTL_SYSCFI flag is unstable; TSYNCRXCTL reads
> sometimes don't have the SYSCFI bit set. Retrying the read shortly after
> finds the bit to be set. This was observed at boot (probe) but also link up
> and link down.
> 
> Moreover, the phc (PTP Hardware Clock) seems to operate normally even after
> reads where SYSCFI=0. Therefore, remove this register read and
> unconditionally set the clock parameters.
> 
> Reported-by: Achim Mildenberger <admin@fph.physik.uni-karlsruhe.de>
> Message-Id: <20180425065243.g5mqewg5irkwgwgv@f2>
> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1075876
> Fixes: 83129b37ef35 ("e1000e: fix systim issues")
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index ec4a9759a6f2..3afb1f3b6f91 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -3546,15 +3546,12 @@ s32 e1000e_get_base_timinca(struct e1000_adapter
> *adapter, u32 *timinca)
>  		}
>  		break;
>  	case e1000_pch_spt:
> -		if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
> -			/* Stable 24MHz frequency */
> -			incperiod = INCPERIOD_24MHZ;
> -			incvalue = INCVALUE_24MHZ;
> -			shift = INCVALUE_SHIFT_24MHZ;
> -			adapter->cc.shift = shift;
> -			break;
> -		}
> -		return -EINVAL;
> +		/* Stable 24MHz frequency */
> +		incperiod = INCPERIOD_24MHZ;
> +		incvalue = INCVALUE_24MHZ;
> +		shift = INCVALUE_SHIFT_24MHZ;
> +		adapter->cc.shift = shift;
> +		break;
>  	case e1000_pch_cnp:
>  		if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
>  			/* Stable 24MHz frequency */
> --
> 2.16.3

Given testing showing that the clock operates fine regardless of the register read, I think this is probably fine. Normally I believe the register was used to check which frequency was in use, but it doesn't seem to serve that purpose here.

Thanks,
Jake

^ permalink raw reply

* Re: [RFC bpf-next 04/10] bpf: cfg: detect loop use domination information
From: John Fastabend @ 2018-05-10 18:17 UTC (permalink / raw)
  To: Jiong Wang, alexei.starovoitov, daniel; +Cc: netdev, oss-drivers
In-Reply-To: <1525688567-19618-5-git-send-email-jiong.wang@netronome.com>

On 05/07/2018 03:22 AM, Jiong Wang wrote:
> If one bb is dominating its predecessor, then there is loop.
> 
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> ---
>  kernel/bpf/cfg.c      | 22 ++++++++++++++++++++++
>  kernel/bpf/cfg.h      |  1 +
>  kernel/bpf/verifier.c |  8 ++++++++
>  3 files changed, 31 insertions(+)
> 
> diff --git a/kernel/bpf/cfg.c b/kernel/bpf/cfg.c
> index b50937a..90692e4 100644
> --- a/kernel/bpf/cfg.c
> +++ b/kernel/bpf/cfg.c
> @@ -568,6 +568,28 @@ int subprog_build_dom_info(struct bpf_subprog_info *subprog)
>  	return ret;
>  }
>  
> +bool subprog_has_loop(struct bpf_subprog_info *subprog)
> +{
> +	int lane_len = BITS_TO_LONGS(subprog->bb_num - 2);
> +	struct list_head *bb_list = &subprog->bbs;
> +	struct bb_node *bb, *entry_bb;
> +	struct edge_node *e;
> +
> +	entry_bb = entry_bb(bb_list);
> +	bb = bb_next(entry_bb);
> +	list_for_each_entry_from(bb, &exit_bb(bb_list)->l, l)
> +		list_for_each_entry(e, &bb->e_prevs, l) {
> +			struct bb_node *latch = e->src;
> +
> +			if (latch != entry_bb &&
> +			    test_bit(bb->idx,
> +				     subprog->dtree + latch->idx * lane_len))
> +				return true;
> +		}
> +
> +	return false;
> +}
> +

Because we are using this to guard against loops we need to detect
all loops not just reducible loops. And because (assuming my understanding
is correct) Tarjan's algorithm will only detect all loops when the
graph is reducible we need additional tests.

There are a couple options to fix this with varying levels of complexity.
Because I'm using this to build loop info structures to find induction
variables and show termination. After the loop structures are built we
could search for any back-edges not in valid loops. This would be similar
to the existing back-edge detection code but with an extra check to
allow edges that have been validated. I would need to check that this
doesn't have any escapes before actually proposing it though.

The other method would be to properly test for reducibility using one of
the algorithms for this. I think the most intuitive is to remove back-edges
and test the graph is acyclic. This would be run before the dom tree is
built. This is IMO what we should do, it seems the most "correct" way to
do this.

The most complex would be to handle irreducible programs using some of the
more complex methods. I really don't think this is necessary but in theory
at least we could use something like Havlak-Tarjan algorithm and allow
some programs with irreducible loops. This is likely overkill especially
in a first iteration.

Here is a sample that fails without this series, using original back-edge
detection, but is allowed with this patch,

SEC("classifier_tc_mark")
int _tc_mark(struct __sk_buff *ctx)
{
        void *data      = (void *)(unsigned long)ctx->data;
        void *data_end  = (void *)(unsigned long)ctx->data_end;
        void *data_meta = (void *)(unsigned long)ctx->data_meta;
        struct meta_info *meta = data_meta;
        volatile int mark = ctx->mark;

        mark += 1;

        if (meta + 1 > data) {
B:
                mark += 2;

                if (mark < ctx->mark * 3)
                        goto C;
        } else if (meta < data) {
C:
                mark += 1;
                if (mark < 1000)
                        goto  B;
        }

        return TC_ACT_OK;
}

A more concise example could be made but I just hacked on one of the
sample programs. This generates the CFG as follows (I have a patch
on top of your stack to print the CFG and DOM tables)

CFG:  65535[-1,-1] -> 0[0,9]  0[0,9] -> 3[20,20]  0[0,9] -> 1[10,18]  1[10,18] -> 4[21,28]  1[10,18] -> 2[19,19]  2[19,19] -> 5[29,30]  3[20,20] -> 5[29,30]  3[20,20] -> 4[21,28]  4[21,28] -> 1[10,18]  4[21,28] -> 5[29,30]  5[29,30] -> 65534[31,65534] 
DOM:
 1  0  0  0  0  0 
 1  1  0  0  0  0 
 1  1  1  0  0  0 
 1  0  0  1  0  0 
 1  0  0  0  1  0 
 1  0  0  0  0  1 

Here we have the loop 1[10,18]->4[21,28] and the back-edge 4[21,28]->1[10,18].
The notation is #idx[head_insn,tail_insn]. The above can then be imported
into dot notation and graphed if needed.

Jiong, please verify this analysis is correct.

Thanks,
John

^ permalink raw reply

* Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto
From: Borislav Petkov @ 2018-05-10 18:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Yonghong Song, mingo, torvalds, ast, daniel,
	linux-kernel, x86, netdev, kernel-team, Thomas Gleixner
In-Reply-To: <20180510175833.nujqinu26mydeape@ast-mbp.dhcp.thefacebook.com>

On Thu, May 10, 2018 at 10:58:35AM -0700, Alexei Starovoitov wrote:
> libbcc is a library. It's not only used by bcc scripts, but by production
> services that compile bpf programs with clang.

Let me get this straight: libbcc fails to compile because it
includes (through some long include chain) the kernel header
arch/x86/include/asm/cpufeature.h.

Is that it?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply

* Re: [PATCH ipsec-next] xfrm: Allow Output Mark to be Updated Using UPDSA
From: Nathan Harold @ 2018-05-10 18:14 UTC (permalink / raw)
  To: Eyal Birger; +Cc: netdev, Steffen Klassert, tobias
In-Reply-To: <20180510084459.598ce663@jimi>

That makes sense to me; the restriction about which you inquire is a
practical one rather than a philosophical one, which I will be happy
to see lifted.

With the new set_mark, a non-zero mask will indicate that the caller
has a set an "explicit" zero mark, which sidesteps the
currently-ambiguous situation; the logic can then become "if (set_mark
|| set_mark_mask) { // update mark and update mask}". There is a
question of the behavior for a caller who sets a set_mark and
set_mark_mask, then subsequently calls UPDSA with only a mark
(omitting the mask, or with explicit set_mask == 0). I think it's fair
and appropriate the mask be re-set to 0xFFFFFFFF (to avoid the
special-case of (if new_set_mask == 0 && set_mask != 0xFFFFFFFF). Of
course, this means that the inability to return to zero limitation
that I currently mention as being on the output_mark would transfer
under that proposal to the set_mark_mask. All of this is fix-able by
having the update take into account the presence or absence of the
XFRMAs sent rather than just looking at a built xfrm_state, but I'm
couldn't fathom any use cases for reverting the mark scheme back to an
"unused" state while the SA remains ACTIVE, so I think simpler is
better (same reasoning applied to the current change).

-Nathan

On Wed, May 9, 2018 at 10:44 PM, Eyal Birger <eyal.birger@gmail.com> wrote:
> Hi Nathan,
>
> On Wed,  9 May 2018 13:46:26 -0700
> Nathan Harold <nharold@google.com> wrote:
>
>> Allow UPDSA to change output_mark to permit
>> policy separation of packet routing decisions from
>> SA keying in systems that use mark-based routing.
>>
>> In the output_mark, used as a routing and firewall
>> mark for outbound packets, is made update-able which
>> allows routing decisions to be handled independently
>> of keying/SA creation. To maintain consistency with
>> other optional attributes, the output mark is only
>> updated if sent with a non-zero value. Once set, the
>> output mark may not be reset to zero, which ensures
>> that updating the SA does not require the mark to
>> be re-sent to avoid the value being clobbered.
>
> There is an attempt to extend the 'output_mark' to support the input
> direction and masking.
>
> In the proposed implementation, output_mark is converted to type 'struct
> xfrm_mark' where the semantics are as follows:
>
> - If mark is given by XFRMA_OUTPUT_MARK (renamed to XFRMA_SET_MARK)
>   then a new XFRMA_SET_MARK_MASK attribute is consulted to set the mask
>   value
> - if no XFRMA_SET_MARK_MASK attribute is provided, the mask is set to
>   0xffffffff
>
> Therefore, if the mask value is 0, we can regard the mark as 'not
> given'.
>
> My question is, in the context of this patch, it seems that the
> "Once set, the output mark may not be reset to zero" restriction may be
> lifted in favor of updating the mark only if the new mask is non zero.
>
> Does this make sense to you?
> Eyal

^ permalink raw reply

* Re: [PATCH net] tun: fix use after free for ptr_ring
From: Cong Wang @ 2018-05-10 18:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Linux Kernel Network Developers, LKML, Eric Dumazet,
	Michael S . Tsirkin
In-Reply-To: <1525849198-9786-1-git-send-email-jasowang@redhat.com>

On Tue, May 8, 2018 at 11:59 PM, Jason Wang <jasowang@redhat.com> wrote:
> We used to initialize ptr_ring during TUNSETIFF, this is because its
> size depends on the tx_queue_len of netdevice. And we try to clean it
> up when socket were detached from netdevice. A race were spotted when
> trying to do uninit during a read which will lead a use after free for
> pointer ring. Solving this by always initialize a zero size ptr_ring
> in open() and do resizing during TUNSETIFF, and then we can safely do
> cleanup during close(). With this, there's no need for the workaround
> that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak
> for tfile->tx_array").
>

Ah, I didn't know ptr_ring_init(0) could work... Nice patch!
Except one thing below.


> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index ef33950..298cb96 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -681,15 +681,6 @@ static void tun_queue_purge(struct tun_file *tfile)
>         skb_queue_purge(&tfile->sk.sk_error_queue);
>  }
>
> -static void tun_cleanup_tx_ring(struct tun_file *tfile)
> -{
> -       if (tfile->tx_ring.queue) {
> -               ptr_ring_cleanup(&tfile->tx_ring, tun_ptr_free);
> -               xdp_rxq_info_unreg(&tfile->xdp_rxq);
> -               memset(&tfile->tx_ring, 0, sizeof(tfile->tx_ring));
> -       }
> -}


I don't think you can totally remove ptr_ring_cleanup(), it should be
called unconditionally with your ptr_ring_init(0) trick, right?

^ permalink raw reply

* [PATCH net] ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
From: Andrey Ignatov @ 2018-05-10 17:59 UTC (permalink / raw)
  To: netdev, davem; +Cc: Andrey Ignatov, ast, eric.dumazet, kernel-team

Fix more memory leaks in ip_cmsg_send() callers. Part of them were fixed
earlier in 919483096bfe.

* udp_sendmsg one was there since the beginning when linux sources were
  first added to git;
* ping_v4_sendmsg one was copy/pasted in c319b4d76b9e.

Whenever return happens in udp_sendmsg() or ping_v4_sendmsg() IP options
have to be freed if they were allocated previously.

Add label so that future callers (if any) can use it instead of kfree()
before return that is easy to forget.

Fixes: c319b4d76b9e (net: ipv4: add IPPROTO_ICMP socket kind)
Signed-off-by: Andrey Ignatov <rdna@fb.com>
---
 net/ipv4/ping.c | 7 +++++--
 net/ipv4/udp.c  | 7 +++++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05e47d7..56a0106 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -775,8 +775,10 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.addr = faddr = daddr;
 
 	if (ipc.opt && ipc.opt->opt.srr) {
-		if (!daddr)
-			return -EINVAL;
+		if (!daddr) {
+			err = -EINVAL;
+			goto out_free;
+		}
 		faddr = ipc.opt->opt.faddr;
 	}
 	tos = get_rttos(&ipc, inet);
@@ -842,6 +844,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 out:
 	ip_rt_put(rt);
+out_free:
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59..dc4fc46 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -952,8 +952,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
 
 	if (ipc.opt && ipc.opt->opt.srr) {
-		if (!daddr)
-			return -EINVAL;
+		if (!daddr) {
+			err = -EINVAL;
+			goto out_free;
+		}
 		faddr = ipc.opt->opt.faddr;
 		connected = 0;
 	}
@@ -1074,6 +1076,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 out:
 	ip_rt_put(rt);
+out_free:
 	if (free)
 		kfree(ipc.opt);
 	if (!err)
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto
From: Alexei Starovoitov @ 2018-05-10 17:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Yonghong Song, mingo, torvalds, ast, daniel,
	linux-kernel, x86, netdev, kernel-team, Thomas Gleixner
In-Reply-To: <20180510162028.GB16130@pd.tnic>

On Thu, May 10, 2018 at 06:20:28PM +0200, Borislav Petkov wrote:
> On Thu, May 10, 2018 at 08:52:42AM -0700, Alexei Starovoitov wrote:
> > That makes me wonder what happened with "we do not break user space" rule?
> 
> As someone already pointed out on IRC, arch/x86/include/asm/cpufeature.h
> is solely a kernel header so nothing but kernel should include it. So
> forget the userspace breakage "argument".

libbcc is a library. It's not only used by bcc scripts, but by production
services that compile bpf programs with clang.
Without this kernel fix the companies won't be able to upgrade the kernel.
Even if we could somehow hack libbcc and workaround in user space
(which is not possible as already explained), it's a long dependency chain
to recompile and upgrade core services before upgrading the kernel
which makes even theoretical user space fix impractical.
I see no option, but to fix the kernel.
Regardless whether it's called user space breakage or kernel breakage.

^ permalink raw reply

* Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto
From: Gianluca Borello @ 2018-05-10 17:57 UTC (permalink / raw)
  To: bp
  Cc: Alexei Starovoitov, peterz, Yonghong Song, mingo, torvalds,
	Alexei Starovoitov, Daniel Borkmann, linux-kernel, x86,
	Linux Networking Development Mailing List, kernel-team, tglx
In-Reply-To: <20180510162028.GB16130@pd.tnic>

On Thu, May 10, 2018 at 9:28 AM Borislav Petkov <bp@alien8.de> wrote:

> As someone already pointed out on IRC, arch/x86/include/asm/cpufeature.h
> is solely a kernel header so nothing but kernel should include it. So
> forget the userspace breakage "argument".

For what is worth, I have the same exact problem in a relatively popular
open source system call tracer, and my attempt to fix the issue from user
space has been:
https://github.com/draios/sysdig/commit/2958eb1d52e047f4b93d1238be803e7c405bdec2

While I can definitely live with that (and I'd be happy to submit a patch
to samples/ following the same approach) and absolutely respect the
technical authority of the reviewers here, it would be nice to recognize
that these changes actually affect users to a certain degree, even if from
a technical point of view they don't break userspace.

 From a practical point of view, BPF is widely used from userspace programs
to access some kernel data structures to gather visibility information, and
even the simplest use case, such as including linux/sched.h to access some
task_struct members, ends up pulling in arch/x86/include/asm/cpufeature.h,
thus (ab)using that file. Adding these quirks definitely increases the
complexity a developer needs to keep in mind in order to take advantage of
a BPF based instrumentation.

Thanks

^ permalink raw reply

* Re: [PATCH net-next 1/1] net sched actions: fix invalid pointer dereferencing if skbedit flags missing
From: Cong Wang @ 2018-05-10 17:54 UTC (permalink / raw)
  To: Roman Mashak
  Cc: David Miller, Linux Kernel Network Developers, kernel,
	Jamal Hadi Salim, Jiri Pirko, Alexander Duyck
In-Reply-To: <1525900723-17625-1-git-send-email-mrv@mojatatu.com>

On Wed, May 9, 2018 at 2:18 PM, Roman Mashak <mrv@mojatatu.com> wrote:
> The caller calls action's ->init() and passes pointer to "struct tc_action *a",
> which later may initialized to point at the existing action, otherwise
> "struct tc_action *a" is still invalid, and therefore dereferencing it is error.

So technically we just need to check if(exists) before calling
tcf_idr_release(), right?


>
> So checking flags should be done as early as possible, before idr lookups.

In theory, this could break the case of binding an existing action.
But in practice it seems nonsense to pass invalid flags in this case
either, so probably this is okay.

BTW, this should be a candidate for -net and possibly -stable too.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox