Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Johannes Berg @ 2016-04-02 18:40 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, daniel, john.fastabend,
	brouer
In-Reply-To: <1459560118-5582-5-git-send-email-bblanco@plumgrid.com>

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
[...]
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}
> +	}

Why wouldn't this check be done in the generic code that calls
ndo_bpf_set()?

johannes

^ permalink raw reply

* Re: Question on rhashtable in worst-case scenario.
From: Johannes Berg @ 2016-04-02 18:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Greear, David Miller, linux-kernel, linux-wireless, netdev,
	tgraf
In-Reply-To: <20160402014653.GA19235@gondor.apana.org.au>

On Sat, 2016-04-02 at 09:46 +0800, Herbert Xu wrote:
> On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote:
> > 
> > 
> > I was thinking about that one - it's not obvious to me from the
> > code
> > how this "explicitly checking for dups" would be done or let's say
> > how
> > rhashtable differentiates. But since it seems to work for Ben until
> > hitting a certain number of identical keys, surely that's just me
> > not
> > understanding the code rather than anything else :)
> It's really simple, rhashtable_insert_fast does not check for dups
> while rhashtable_lookup_insert_* do.

Oh, ok, thanks :)

johannes

^ permalink raw reply

* [PATCH v2] net: remove unimplemented RTNH_F_PERVASIVE
From: Quentin Armitage @ 2016-04-02 16:51 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Quentin Armitage
In-Reply-To: <56FFED8B.7010701@cogentembedded.com>

Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
and couldn't be enabled since the required config parameter wasn't in
any Kconfig file (see commit d088dde7b196 ("ipv4: obsolete config in
kernel source (IP_ROUTE_PERVASIVE)")).

This commit removes all remaining references to RTNH_F_PERVASIVE.
Although this will cause userspace applications that were using the
flag to fail to build, they will be alerted to the fact that using
RTNH_F_PERVASIVE was not achieving anything.

Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
---
 include/uapi/linux/rtnetlink.h |    2 +-
 net/decnet/dn_fib.c            |    2 +-
 net/ipv4/fib_semantics.c       |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..58e6ba0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -339,7 +339,7 @@ struct rtnexthop {
 /* rtnh_flags */
 
 #define RTNH_F_DEAD		1	/* Nexthop is dead (used by multipath)	*/
-#define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
+					/* 2 was RTNH_F_PERVASIVE (never implemented) */
 #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
 #define RTNH_F_OFFLOAD		8	/* offloaded route */
 #define RTNH_F_LINKDOWN		16	/* carrier-down on nexthop */
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index df48034..c53aa74 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -243,7 +243,7 @@ out:
 	} else {
 		struct net_device *dev;
 
-		if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
+		if (nh->nh_flags & RTNH_F_ONLINK)
 			return -EINVAL;
 
 		dev = __dev_get_by_index(&init_net, nh->nh_oif);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e..3883860 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 	} else {
 		struct in_device *in_dev;
 
-		if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
+		if (nh->nh_flags & RTNH_F_ONLINK)
 			return -EINVAL;
 
 		rcu_read_lock();
-- 
1.7.7.6

^ permalink raw reply related

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Tom Herbert @ 2016-04-02 16:47 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer
In-Reply-To: <1459560118-5582-1-git-send-email-bblanco@plumgrid.com>

On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before even
> an skb has been allocated.
>
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
> program. A new userspace struct is defined containing only len as a field,
> with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to support drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver. No skb allocation is required,
> instead a static percpu skb is kept in the driver and minimally initialized
> for each driver frag.
> In patch 5, create a sample drop and count program. With single core,
> achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
> access, bpf array lookup, and increment.
>
Very nice! Do you think this hook will be sufficient to implement a
fast forward patch also?

Tom

> Interestingly, accessing packet data from the program did not have a
> noticeable impact on performance. Even so, future enhancements to
> prefetching / batching / page-allocs should hopefully improve the
> performance in this path.
>
> [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>
> Brenden Blanco (5):
>   bpf: add PHYS_DEV prog type for early driver filter
>   net: add ndo to set bpf prog in adapter rx
>   rtnl: add option for setting link bpf prog
>   mlx4: add support for fast rx drop bpf program
>   Add sample for adding simple drop program to link
>
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     |  18 +++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
>  include/linux/netdevice.h                      |   8 ++
>  include/uapi/linux/bpf.h                       |   5 +
>  include/uapi/linux/if_link.h                   |   1 +
>  kernel/bpf/verifier.c                          |   1 +
>  net/core/dev.c                                 |  12 ++
>  net/core/filter.c                              |  68 +++++++++++
>  net/core/rtnetlink.c                           |  10 ++
>  samples/bpf/Makefile                           |   4 +
>  samples/bpf/bpf_load.c                         |   8 ++
>  samples/bpf/netdrvx1_kern.c                    |  26 +++++
>  samples/bpf/netdrvx1_user.c                    | 155 +++++++++++++++++++++++++
>  14 files changed, 379 insertions(+)
>  create mode 100644 samples/bpf/netdrvx1_kern.c
>  create mode 100644 samples/bpf/netdrvx1_user.c
>
> --
> 2.8.0
>

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-02 16:39 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer
In-Reply-To: <1459560118-5582-2-git-send-email-bblanco@plumgrid.com>

On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a new
> context type, struct xdp_metadata, is exposed to userspace. So far only
> expose the readable packet length, and only in read mode.
>
This would eventually be a generic abstraction of receive descriptors?

> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
Is there a hard restriction that this could only work with physical devices?

> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/uapi/linux/bpf.h |  5 ++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 924f537..b8a4ef2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_KPROBE,
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
> +       BPF_PROG_TYPE_PHYS_DEV,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -367,6 +368,10 @@ struct __sk_buff {
>         __u32 tc_classid;
>  };
>
> +struct xdp_metadata {
> +       __u32 len;
> +};
> +
>  struct bpf_tunnel_key {
>         __u32 tunnel_id;
>         union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2e08f8e..804ca70 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_SOCKET_FILTER:
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_PHYS_DEV:
>                 return true;
>         default:
>                 return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b7177d0..c417db6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         /* check bounds */
> @@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_metadata))
> +               return false;
> +
> +       if (off % size != 0)
> +               return false;
> +
> +       if (size != 4)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool phys_dev_is_valid_access(int off, int size,
> +                                    enum bpf_access_type type)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_metadata, len):
> +               break;
> +       default:
> +               return false;
> +       }
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
> +                                          int dst_reg, int src_reg,
> +                                          int ctx_off,
> +                                          struct bpf_insn *insn_buf,
> +                                          struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_metadata, len):
> +               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> +
> +               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> +                                     offsetof(struct sk_buff, len));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto = sk_filter_func_proto,
>         .is_valid_access = sk_filter_is_valid_access,
> @@ -2222,6 +2278,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops phys_dev_ops = {
> +       .get_func_proto = phys_dev_func_proto,
> +       .is_valid_access = phys_dev_is_valid_access,
> +       .convert_ctx_access = bpf_phys_dev_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops = &sk_filter_ops,
>         .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2237,11 +2299,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list phys_dev_type __read_mostly = {
> +       .ops = &phys_dev_ops,
> +       .type = BPF_PROG_TYPE_PHYS_DEV,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&phys_dev_type);
>
>         return 0;
>  }
> --
> 2.8.0
>

^ permalink raw reply

* Re: [PATCH v2 net-next 01/11] net: add SOCK_RCU_FREE socket flag
From: Tom Herbert @ 2016-04-02 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Willem de Bruijn,
	Neal Cardwell, Maciej Żenczykowski
In-Reply-To: <1459525942-30399-2-git-send-email-edumazet@google.com>

On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> We want a generic way to insert an RCU grace period before socket
> freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
> much overhead.
>
> SLAB_DESTROY_BY_RCU strict rules force us to take a reference
> on the socket sk_refcnt, and it is a performance problem for UDP
> encapsulation, or TCP synflood behavior, as many CPUs might
> attempt the atomic operations on a shared sk_refcnt
>
> UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
> lookup can use traditional RCU rules, without refcount changes.
> They can set the flag only once hashed and visible by other cpus.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <tom@herbertland.com>
> ---
>  include/net/sock.h |  2 ++
>  net/core/sock.c    | 14 +++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 255d3e03727b..c88785a3e76c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -438,6 +438,7 @@ struct sock {
>                                                   struct sk_buff *skb);
>         void                    (*sk_destruct)(struct sock *sk);
>         struct sock_reuseport __rcu     *sk_reuseport_cb;
> +       struct rcu_head         sk_rcu;
>  };
>
>  #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
> @@ -720,6 +721,7 @@ enum sock_flags {
>                      */
>         SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
>         SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
> +       SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b67b9aedb230..238a94f879ca 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  }
>  EXPORT_SYMBOL(sk_alloc);
>
> -void sk_destruct(struct sock *sk)
> +/* Sockets having SOCK_RCU_FREE will call this function after one RCU
> + * grace period. This is the case for UDP sockets and TCP listeners.
> + */
> +static void __sk_destruct(struct rcu_head *head)
>  {
> +       struct sock *sk = container_of(head, struct sock, sk_rcu);
>         struct sk_filter *filter;
>
>         if (sk->sk_destruct)
> @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk)
>         sk_prot_free(sk->sk_prot_creator, sk);
>  }
>
> +void sk_destruct(struct sock *sk)
> +{
> +       if (sock_flag(sk, SOCK_RCU_FREE))
> +               call_rcu(&sk->sk_rcu, __sk_destruct);
> +       else
> +               __sk_destruct(&sk->sk_rcu);
> +}
> +
>  static void __sk_free(struct sock *sk)
>  {
>         if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt))
> --
> 2.8.0.rc3.226.g39d4020
>

Tested-by: Tom Herbert <tom@herbertland.com>

^ permalink raw reply

* Re: [PATCH v2 net-next 02/11] udp: no longer use SLAB_DESTROY_BY_RCU
From: Tom Herbert @ 2016-04-02 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Willem de Bruijn,
	Neal Cardwell, Maciej Żenczykowski
In-Reply-To: <1459525942-30399-3-git-send-email-edumazet@google.com>

On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> Tom Herbert would like not touching UDP socket refcnt for encapsulated
> traffic. For this to happen, we need to use normal RCU rules, with a grace
> period before freeing a socket. UDP sockets are not short lived in the
> high usage case, so the added cost of call_rcu() should not be a concern.
>
> This actually removes a lot of complexity in UDP stack.
>
> Multicast receives no longer need to hold a bucket spinlock.
>
> Note that ip early demux still needs to take a reference on the socket.
>
> Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
> but this might be changed later.
>
> Performance for a single UDP socket receiving flood traffic from
> many RX queues/cpus.
>
> Simple udp_rx using simple recvfrom() loop :
> 438 kpps instead of 374 kpps : 17 % increase of the peak rate.
>
> v2: Addressed Willem de Bruijn feedback in multicast handling
>  - keep early demux break in __udp4_lib_demux_lookup()
>
Works fine with UDP encapsulation also.

Tested-by: Tom Herbert <tom@herbertland.com>

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Willem de Bruijn <willemb@google.com>
> ---
>  include/linux/udp.h |   8 +-
>  include/net/sock.h  |  12 +--
>  include/net/udp.h   |   2 +-
>  net/ipv4/udp.c      | 293 ++++++++++++++++------------------------------------
>  net/ipv4/udp_diag.c |  18 ++--
>  net/ipv6/udp.c      | 196 ++++++++++++-----------------------
>  6 files changed, 171 insertions(+), 358 deletions(-)
>
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 87c094961bd5..32342754643a 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
>         return udp_sk(sk)->no_check6_rx;
>  }
>
> -#define udp_portaddr_for_each_entry(__sk, node, list) \
> -       hlist_nulls_for_each_entry(__sk, node, list, __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry(__sk, list) \
> +       hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
>
> -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \
> -       hlist_nulls_for_each_entry_rcu(__sk, node, list, __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry_rcu(__sk, list) \
> +       hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
>
>  #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index c88785a3e76c..c3a707d1cee8 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -178,7 +178,7 @@ struct sock_common {
>         int                     skc_bound_dev_if;
>         union {
>                 struct hlist_node       skc_bind_node;
> -               struct hlist_nulls_node skc_portaddr_node;
> +               struct hlist_node       skc_portaddr_node;
>         };
>         struct proto            *skc_prot;
>         possible_net_t          skc_net;
> @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk,
>         hlist_for_each_entry(__sk, list, sk_bind_node)
>
>  /**
> - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct offset
> + * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
>   * @tpos:      the type * to use as a loop cursor.
>   * @pos:       the &struct hlist_node to use as a loop cursor.
>   * @head:      the head for your list.
>   * @offset:    offset of hlist_node within the struct.
>   *
>   */
> -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset)                       \
> -       for (pos = (head)->first;                                              \
> -            (!is_a_nulls(pos)) &&                                             \
> +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset)                 \
> +       for (pos = rcu_dereference((head)->first);                             \
> +            pos != NULL &&                                                    \
>                 ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;});       \
> -            pos = pos->next)
> +            pos = rcu_dereference(pos->next))
>
>  static inline struct user_namespace *sk_user_ns(struct sock *sk)
>  {
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 92927f729ac8..d870ec1611c4 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -59,7 +59,7 @@ struct udp_skb_cb {
>   *     @lock:  spinlock protecting changes to head/count
>   */
>  struct udp_hslot {
> -       struct hlist_nulls_head head;
> +       struct hlist_head       head;
>         int                     count;
>         spinlock_t              lock;
>  } __attribute__((aligned(2 * sizeof(long))));
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 08eed5e16df0..0475aaf95040 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
>                                unsigned int log)
>  {
>         struct sock *sk2;
> -       struct hlist_nulls_node *node;
>         kuid_t uid = sock_i_uid(sk);
>
> -       sk_nulls_for_each(sk2, node, &hslot->head) {
> +       sk_for_each(sk2, &hslot->head) {
>                 if (net_eq(sock_net(sk2), net) &&
>                     sk2 != sk &&
>                     (bitmap || udp_sk(sk2)->udp_port_hash == num) &&
> @@ -177,12 +176,11 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num,
>                                                   bool match_wildcard))
>  {
>         struct sock *sk2;
> -       struct hlist_nulls_node *node;
>         kuid_t uid = sock_i_uid(sk);
>         int res = 0;
>
>         spin_lock(&hslot2->lock);
> -       udp_portaddr_for_each_entry(sk2, node, &hslot2->head) {
> +       udp_portaddr_for_each_entry(sk2, &hslot2->head) {
>                 if (net_eq(sock_net(sk2), net) &&
>                     sk2 != sk &&
>                     (udp_sk(sk2)->udp_port_hash == num) &&
> @@ -207,11 +205,10 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot,
>                                                     bool match_wildcard))
>  {
>         struct net *net = sock_net(sk);
> -       struct hlist_nulls_node *node;
>         kuid_t uid = sock_i_uid(sk);
>         struct sock *sk2;
>
> -       sk_nulls_for_each(sk2, node, &hslot->head) {
> +       sk_for_each(sk2, &hslot->head) {
>                 if (net_eq(sock_net(sk2), net) &&
>                     sk2 != sk &&
>                     sk2->sk_family == sk->sk_family &&
> @@ -333,17 +330,18 @@ found:
>                         goto fail_unlock;
>                 }
>
> -               sk_nulls_add_node_rcu(sk, &hslot->head);
> +               sk_add_node_rcu(sk, &hslot->head);
>                 hslot->count++;
>                 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>
>                 hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
>                 spin_lock(&hslot2->lock);
> -               hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
> +               hlist_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
>                                          &hslot2->head);
>                 hslot2->count++;
>                 spin_unlock(&hslot2->lock);
>         }
> +       sock_set_flag(sk, SOCK_RCU_FREE);
>         error = 0;
>  fail_unlock:
>         spin_unlock_bh(&hslot->lock);
> @@ -497,37 +495,27 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>                 struct sk_buff *skb)
>  {
>         struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         int score, badness, matches = 0, reuseport = 0;
> -       bool select_ok = true;
>         u32 hash = 0;
>
> -begin:
>         result = NULL;
>         badness = 0;
> -       udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
> +       udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>                 score = compute_score2(sk, net, saddr, sport,
>                                       daddr, hnum, dif);
>                 if (score > badness) {
> -                       result = sk;
> -                       badness = score;
>                         reuseport = sk->sk_reuseport;
>                         if (reuseport) {
>                                 hash = udp_ehashfn(net, daddr, hnum,
>                                                    saddr, sport);
> -                               if (select_ok) {
> -                                       struct sock *sk2;
> -
> -                                       sk2 = reuseport_select_sock(sk, hash, skb,
> +                               result = reuseport_select_sock(sk, hash, skb,
>                                                         sizeof(struct udphdr));
> -                                       if (sk2) {
> -                                               result = sk2;
> -                                               select_ok = false;
> -                                               goto found;
> -                                       }
> -                               }
> +                               if (result)
> +                                       return result;
>                                 matches = 1;
>                         }
> +                       badness = score;
> +                       result = sk;
>                 } else if (score == badness && reuseport) {
>                         matches++;
>                         if (reciprocal_scale(hash, matches) == 0)
> @@ -535,23 +523,6 @@ begin:
>                         hash = next_pseudo_random32(hash);
>                 }
>         }
> -       /*
> -        * if the nulls value we got at the end of this lookup is
> -        * not the expected one, we must restart lookup.
> -        * We probably met an item that was moved to another chain.
> -        */
> -       if (get_nulls_value(node) != slot2)
> -               goto begin;
> -       if (result) {
> -found:
> -               if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(compute_score2(result, net, saddr, sport,
> -                                 daddr, hnum, dif) < badness)) {
> -                       sock_put(result);
> -                       goto begin;
> -               }
> -       }
>         return result;
>  }
>
> @@ -563,15 +534,12 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
>                 int dif, struct udp_table *udptable, struct sk_buff *skb)
>  {
>         struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(dport);
>         unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
>         struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
>         int score, badness, matches = 0, reuseport = 0;
> -       bool select_ok = true;
>         u32 hash = 0;
>
> -       rcu_read_lock();
>         if (hslot->count > 10) {
>                 hash2 = udp4_portaddr_hash(net, daddr, hnum);
>                 slot2 = hash2 & udptable->mask;
> @@ -593,35 +561,27 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
>                                                   htonl(INADDR_ANY), hnum, dif,
>                                                   hslot2, slot2, skb);
>                 }
> -               rcu_read_unlock();
>                 return result;
>         }
>  begin:
>         result = NULL;
>         badness = 0;
> -       sk_nulls_for_each_rcu(sk, node, &hslot->head) {
> +       sk_for_each_rcu(sk, &hslot->head) {
>                 score = compute_score(sk, net, saddr, hnum, sport,
>                                       daddr, dport, dif);
>                 if (score > badness) {
> -                       result = sk;
> -                       badness = score;
>                         reuseport = sk->sk_reuseport;
>                         if (reuseport) {
>                                 hash = udp_ehashfn(net, daddr, hnum,
>                                                    saddr, sport);
> -                               if (select_ok) {
> -                                       struct sock *sk2;
> -
> -                                       sk2 = reuseport_select_sock(sk, hash, skb,
> +                               result = reuseport_select_sock(sk, hash, skb,
>                                                         sizeof(struct udphdr));
> -                                       if (sk2) {
> -                                               result = sk2;
> -                                               select_ok = false;
> -                                               goto found;
> -                                       }
> -                               }
> +                               if (result)
> +                                       return result;
>                                 matches = 1;
>                         }
> +                       result = sk;
> +                       badness = score;
>                 } else if (score == badness && reuseport) {
>                         matches++;
>                         if (reciprocal_scale(hash, matches) == 0)
> @@ -629,25 +589,6 @@ begin:
>                         hash = next_pseudo_random32(hash);
>                 }
>         }
> -       /*
> -        * if the nulls value we got at the end of this lookup is
> -        * not the expected one, we must restart lookup.
> -        * We probably met an item that was moved to another chain.
> -        */
> -       if (get_nulls_value(node) != slot)
> -               goto begin;
> -
> -       if (result) {
> -found:
> -               if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(compute_score(result, net, saddr, hnum, sport,
> -                                 daddr, dport, dif) < badness)) {
> -                       sock_put(result);
> -                       goto begin;
> -               }
> -       }
> -       rcu_read_unlock();
>         return result;
>  }
>  EXPORT_SYMBOL_GPL(__udp4_lib_lookup);
> @@ -663,13 +604,24 @@ static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb,
>                                  udptable, skb);
>  }
>
> +/* Must be called under rcu_read_lock().
> + * Does increment socket refcount.
> + */
> +#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_SOCKET) || \
> +    IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TPROXY)
>  struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
>                              __be32 daddr, __be16 dport, int dif)
>  {
> -       return __udp4_lib_lookup(net, saddr, sport, daddr, dport, dif,
> -                                &udp_table, NULL);
> +       struct sock *sk;
> +
> +       sk = __udp4_lib_lookup(net, saddr, sport, daddr, dport,
> +                              dif, &udp_table, NULL);
> +       if (sk && !atomic_inc_not_zero(&sk->sk_refcnt))
> +               sk = NULL;
> +       return sk;
>  }
>  EXPORT_SYMBOL_GPL(udp4_lib_lookup);
> +#endif
>
>  static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
>                                        __be16 loc_port, __be32 loc_addr,
> @@ -771,7 +723,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
>         sk->sk_err = err;
>         sk->sk_error_report(sk);
>  out:
> -       sock_put(sk);
> +       return;
>  }
>
>  void udp_err(struct sk_buff *skb, u32 info)
> @@ -1474,13 +1426,13 @@ void udp_lib_unhash(struct sock *sk)
>                 spin_lock_bh(&hslot->lock);
>                 if (rcu_access_pointer(sk->sk_reuseport_cb))
>                         reuseport_detach_sock(sk);
> -               if (sk_nulls_del_node_init_rcu(sk)) {
> +               if (sk_del_node_init_rcu(sk)) {
>                         hslot->count--;
>                         inet_sk(sk)->inet_num = 0;
>                         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>
>                         spin_lock(&hslot2->lock);
> -                       hlist_nulls_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
> +                       hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
>                         hslot2->count--;
>                         spin_unlock(&hslot2->lock);
>                 }
> @@ -1513,12 +1465,12 @@ void udp_lib_rehash(struct sock *sk, u16 newhash)
>
>                         if (hslot2 != nhslot2) {
>                                 spin_lock(&hslot2->lock);
> -                               hlist_nulls_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
> +                               hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
>                                 hslot2->count--;
>                                 spin_unlock(&hslot2->lock);
>
>                                 spin_lock(&nhslot2->lock);
> -                               hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
> +                               hlist_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
>                                                          &nhslot2->head);
>                                 nhslot2->count++;
>                                 spin_unlock(&nhslot2->lock);
> @@ -1697,35 +1649,6 @@ drop:
>         return -1;
>  }
>
> -static void flush_stack(struct sock **stack, unsigned int count,
> -                       struct sk_buff *skb, unsigned int final)
> -{
> -       unsigned int i;
> -       struct sk_buff *skb1 = NULL;
> -       struct sock *sk;
> -
> -       for (i = 0; i < count; i++) {
> -               sk = stack[i];
> -               if (likely(!skb1))
> -                       skb1 = (i == final) ? skb : skb_clone(skb, GFP_ATOMIC);
> -
> -               if (!skb1) {
> -                       atomic_inc(&sk->sk_drops);
> -                       UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,
> -                                        IS_UDPLITE(sk));
> -                       UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS,
> -                                        IS_UDPLITE(sk));
> -               }
> -
> -               if (skb1 && udp_queue_rcv_skb(sk, skb1) <= 0)
> -                       skb1 = NULL;
> -
> -               sock_put(sk);
> -       }
> -       if (unlikely(skb1))
> -               kfree_skb(skb1);
> -}
> -
>  /* For TCP sockets, sk_rx_dst is protected by socket lock
>   * For UDP, we use xchg() to guard against concurrent changes.
>   */
> @@ -1749,14 +1672,14 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>                                     struct udp_table *udptable,
>                                     int proto)
>  {
> -       struct sock *sk, *stack[256 / sizeof(struct sock *)];
> -       struct hlist_nulls_node *node;
> +       struct sock *sk, *first = NULL;
>         unsigned short hnum = ntohs(uh->dest);
>         struct udp_hslot *hslot = udp_hashslot(udptable, net, hnum);
> -       int dif = skb->dev->ifindex;
> -       unsigned int count = 0, offset = offsetof(typeof(*sk), sk_nulls_node);
>         unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
> -       bool inner_flushed = false;
> +       unsigned int offset = offsetof(typeof(*sk), sk_node);
> +       int dif = skb->dev->ifindex;
> +       struct hlist_node *node;
> +       struct sk_buff *nskb;
>
>         if (use_hash2) {
>                 hash2_any = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum) &
> @@ -1767,23 +1690,28 @@ start_lookup:
>                 offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node);
>         }
>
> -       spin_lock(&hslot->lock);
> -       sk_nulls_for_each_entry_offset(sk, node, &hslot->head, offset) {
> -               if (__udp_is_mcast_sock(net, sk,
> -                                       uh->dest, daddr,
> -                                       uh->source, saddr,
> -                                       dif, hnum)) {
> -                       if (unlikely(count == ARRAY_SIZE(stack))) {
> -                               flush_stack(stack, count, skb, ~0);
> -                               inner_flushed = true;
> -                               count = 0;
> -                       }
> -                       stack[count++] = sk;
> -                       sock_hold(sk);
> +       sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
> +               if (!__udp_is_mcast_sock(net, sk, uh->dest, daddr,
> +                                        uh->source, saddr, dif, hnum))
> +                       continue;
> +
> +               if (!first) {
> +                       first = sk;
> +                       continue;
>                 }
> -       }
> +               nskb = skb_clone(skb, GFP_ATOMIC);
>
> -       spin_unlock(&hslot->lock);
> +               if (unlikely(!nskb)) {
> +                       atomic_inc(&sk->sk_drops);
> +                       UDP_INC_STATS_BH(net, UDP_MIB_RCVBUFERRORS,
> +                                        IS_UDPLITE(sk));
> +                       UDP_INC_STATS_BH(net, UDP_MIB_INERRORS,
> +                                        IS_UDPLITE(sk));
> +                       continue;
> +               }
> +               if (udp_queue_rcv_skb(sk, nskb) > 0)
> +                       consume_skb(nskb);
> +       }
>
>         /* Also lookup *:port if we are using hash2 and haven't done so yet. */
>         if (use_hash2 && hash2 != hash2_any) {
> @@ -1791,16 +1719,13 @@ start_lookup:
>                 goto start_lookup;
>         }
>
> -       /*
> -        * do the slow work with no lock held
> -        */
> -       if (count) {
> -               flush_stack(stack, count, skb, count - 1);
> +       if (first) {
> +               if (udp_queue_rcv_skb(first, skb) > 0)
> +                       consume_skb(skb);
>         } else {
> -               if (!inner_flushed)
> -                       UDP_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
> -                                        proto == IPPROTO_UDPLITE);
> -               consume_skb(skb);
> +               kfree_skb(skb);
> +               UDP_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
> +                                proto == IPPROTO_UDPLITE);
>         }
>         return 0;
>  }
> @@ -1897,7 +1822,6 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>                                                  inet_compute_pseudo);
>
>                 ret = udp_queue_rcv_skb(sk, skb);
> -               sock_put(sk);
>
>                 /* a return value > 0 means to resubmit the input, but
>                  * it wants the return to be -protocol, or 0
> @@ -1958,49 +1882,24 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
>                                                   int dif)
>  {
>         struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(loc_port);
> -       unsigned int count, slot = udp_hashfn(net, hnum, udp_table.mask);
> +       unsigned int slot = udp_hashfn(net, hnum, udp_table.mask);
>         struct udp_hslot *hslot = &udp_table.hash[slot];
>
>         /* Do not bother scanning a too big list */
>         if (hslot->count > 10)
>                 return NULL;
>
> -       rcu_read_lock();
> -begin:
> -       count = 0;
>         result = NULL;
> -       sk_nulls_for_each_rcu(sk, node, &hslot->head) {
> -               if (__udp_is_mcast_sock(net, sk,
> -                                       loc_port, loc_addr,
> -                                       rmt_port, rmt_addr,
> -                                       dif, hnum)) {
> +       sk_for_each_rcu(sk, &hslot->head) {
> +               if (__udp_is_mcast_sock(net, sk, loc_port, loc_addr,
> +                                       rmt_port, rmt_addr, dif, hnum)) {
> +                       if (result)
> +                               return NULL;
>                         result = sk;
> -                       ++count;
> -               }
> -       }
> -       /*
> -        * if the nulls value we got at the end of this lookup is
> -        * not the expected one, we must restart lookup.
> -        * We probably met an item that was moved to another chain.
> -        */
> -       if (get_nulls_value(node) != slot)
> -               goto begin;
> -
> -       if (result) {
> -               if (count != 1 ||
> -                   unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(!__udp_is_mcast_sock(net, result,
> -                                                      loc_port, loc_addr,
> -                                                      rmt_port, rmt_addr,
> -                                                      dif, hnum))) {
> -                       sock_put(result);
> -                       result = NULL;
>                 }
>         }
> -       rcu_read_unlock();
> +
>         return result;
>  }
>
> @@ -2013,37 +1912,22 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
>                                             __be16 rmt_port, __be32 rmt_addr,
>                                             int dif)
>  {
> -       struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(loc_port);
>         unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
>         unsigned int slot2 = hash2 & udp_table.mask;
>         struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
>         INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
>         const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
> +       struct sock *sk;
>
> -       rcu_read_lock();
> -       result = NULL;
> -       udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
> -               if (INET_MATCH(sk, net, acookie,
> -                              rmt_addr, loc_addr, ports, dif))
> -                       result = sk;
> +       udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> +               if (INET_MATCH(sk, net, acookie, rmt_addr,
> +                              loc_addr, ports, dif))
> +                       return sk;
>                 /* Only check first socket in chain */
>                 break;
>         }
> -
> -       if (result) {
> -               if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(!INET_MATCH(sk, net, acookie,
> -                                             rmt_addr, loc_addr,
> -                                             ports, dif))) {
> -                       sock_put(result);
> -                       result = NULL;
> -               }
> -       }
> -       rcu_read_unlock();
> -       return result;
> +       return NULL;
>  }
>
>  void udp_v4_early_demux(struct sk_buff *skb)
> @@ -2051,7 +1935,7 @@ void udp_v4_early_demux(struct sk_buff *skb)
>         struct net *net = dev_net(skb->dev);
>         const struct iphdr *iph;
>         const struct udphdr *uh;
> -       struct sock *sk;
> +       struct sock *sk = NULL;
>         struct dst_entry *dst;
>         int dif = skb->dev->ifindex;
>         int ours;
> @@ -2083,11 +1967,9 @@ void udp_v4_early_demux(struct sk_buff *skb)
>         } else if (skb->pkt_type == PACKET_HOST) {
>                 sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
>                                              uh->source, iph->saddr, dif);
> -       } else {
> -               return;
>         }
>
> -       if (!sk)
> +       if (!sk || !atomic_inc_not_zero_hint(&sk->sk_refcnt, 2))
>                 return;
>
>         skb->sk = sk;
> @@ -2387,14 +2269,13 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
>
>         for (state->bucket = start; state->bucket <= state->udp_table->mask;
>              ++state->bucket) {
> -               struct hlist_nulls_node *node;
>                 struct udp_hslot *hslot = &state->udp_table->hash[state->bucket];
>
> -               if (hlist_nulls_empty(&hslot->head))
> +               if (hlist_empty(&hslot->head))
>                         continue;
>
>                 spin_lock_bh(&hslot->lock);
> -               sk_nulls_for_each(sk, node, &hslot->head) {
> +               sk_for_each(sk, &hslot->head) {
>                         if (!net_eq(sock_net(sk), net))
>                                 continue;
>                         if (sk->sk_family == state->family)
> @@ -2413,7 +2294,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
>         struct net *net = seq_file_net(seq);
>
>         do {
> -               sk = sk_nulls_next(sk);
> +               sk = sk_next(sk);
>         } while (sk && (!net_eq(sock_net(sk), net) || sk->sk_family != state->family));
>
>         if (!sk) {
> @@ -2622,12 +2503,12 @@ void __init udp_table_init(struct udp_table *table, const char *name)
>
>         table->hash2 = table->hash + (table->mask + 1);
>         for (i = 0; i <= table->mask; i++) {
> -               INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);
> +               INIT_HLIST_HEAD(&table->hash[i].head);
>                 table->hash[i].count = 0;
>                 spin_lock_init(&table->hash[i].lock);
>         }
>         for (i = 0; i <= table->mask; i++) {
> -               INIT_HLIST_NULLS_HEAD(&table->hash2[i].head, i);
> +               INIT_HLIST_HEAD(&table->hash2[i].head);
>                 table->hash2[i].count = 0;
>                 spin_lock_init(&table->hash2[i].lock);
>         }
> diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
> index df1966f3b6ec..3d5ccf4b1412 100644
> --- a/net/ipv4/udp_diag.c
> +++ b/net/ipv4/udp_diag.c
> @@ -36,10 +36,11 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
>                         const struct inet_diag_req_v2 *req)
>  {
>         int err = -EINVAL;
> -       struct sock *sk;
> +       struct sock *sk = NULL;
>         struct sk_buff *rep;
>         struct net *net = sock_net(in_skb->sk);
>
> +       rcu_read_lock();
>         if (req->sdiag_family == AF_INET)
>                 sk = __udp4_lib_lookup(net,
>                                 req->id.idiag_src[0], req->id.idiag_sport,
> @@ -54,9 +55,9 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
>                                 req->id.idiag_dport,
>                                 req->id.idiag_if, tbl, NULL);
>  #endif
> -       else
> -               goto out_nosk;
> -
> +       if (sk && !atomic_inc_not_zero(&sk->sk_refcnt))
> +               sk = NULL;
> +       rcu_read_unlock();
>         err = -ENOENT;
>         if (!sk)
>                 goto out_nosk;
> @@ -96,24 +97,23 @@ static void udp_dump(struct udp_table *table, struct sk_buff *skb,
>                      struct netlink_callback *cb,
>                      const struct inet_diag_req_v2 *r, struct nlattr *bc)
>  {
> -       int num, s_num, slot, s_slot;
>         struct net *net = sock_net(skb->sk);
> +       int num, s_num, slot, s_slot;
>
>         s_slot = cb->args[0];
>         num = s_num = cb->args[1];
>
>         for (slot = s_slot; slot <= table->mask; s_num = 0, slot++) {
> -               struct sock *sk;
> -               struct hlist_nulls_node *node;
>                 struct udp_hslot *hslot = &table->hash[slot];
> +               struct sock *sk;
>
>                 num = 0;
>
> -               if (hlist_nulls_empty(&hslot->head))
> +               if (hlist_empty(&hslot->head))
>                         continue;
>
>                 spin_lock_bh(&hslot->lock);
> -               sk_nulls_for_each(sk, node, &hslot->head) {
> +               sk_for_each(sk, &hslot->head) {
>                         struct inet_sock *inet = inet_sk(sk);
>
>                         if (!net_eq(sock_net(sk), net))
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 8125931106be..b28c0dc63c63 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -213,37 +213,28 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>                 struct sk_buff *skb)
>  {
>         struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         int score, badness, matches = 0, reuseport = 0;
> -       bool select_ok = true;
>         u32 hash = 0;
>
> -begin:
>         result = NULL;
>         badness = -1;
> -       udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
> +       udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>                 score = compute_score2(sk, net, saddr, sport,
>                                       daddr, hnum, dif);
>                 if (score > badness) {
> -                       result = sk;
> -                       badness = score;
>                         reuseport = sk->sk_reuseport;
>                         if (reuseport) {
>                                 hash = udp6_ehashfn(net, daddr, hnum,
>                                                     saddr, sport);
> -                               if (select_ok) {
> -                                       struct sock *sk2;
>
> -                                       sk2 = reuseport_select_sock(sk, hash, skb,
> +                               result = reuseport_select_sock(sk, hash, skb,
>                                                         sizeof(struct udphdr));
> -                                       if (sk2) {
> -                                               result = sk2;
> -                                               select_ok = false;
> -                                               goto found;
> -                                       }
> -                               }
> +                               if (result)
> +                                       return result;
>                                 matches = 1;
>                         }
> +                       result = sk;
> +                       badness = score;
>                 } else if (score == badness && reuseport) {
>                         matches++;
>                         if (reciprocal_scale(hash, matches) == 0)
> @@ -251,27 +242,10 @@ begin:
>                         hash = next_pseudo_random32(hash);
>                 }
>         }
> -       /*
> -        * if the nulls value we got at the end of this lookup is
> -        * not the expected one, we must restart lookup.
> -        * We probably met an item that was moved to another chain.
> -        */
> -       if (get_nulls_value(node) != slot2)
> -               goto begin;
> -
> -       if (result) {
> -found:
> -               if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(compute_score2(result, net, saddr, sport,
> -                                 daddr, hnum, dif) < badness)) {
> -                       sock_put(result);
> -                       goto begin;
> -               }
> -       }
>         return result;
>  }
>
> +/* rcu_read_lock() must be held */
>  struct sock *__udp6_lib_lookup(struct net *net,
>                                       const struct in6_addr *saddr, __be16 sport,
>                                       const struct in6_addr *daddr, __be16 dport,
> @@ -279,15 +253,12 @@ struct sock *__udp6_lib_lookup(struct net *net,
>                                       struct sk_buff *skb)
>  {
>         struct sock *sk, *result;
> -       struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(dport);
>         unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
>         struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
>         int score, badness, matches = 0, reuseport = 0;
> -       bool select_ok = true;
>         u32 hash = 0;
>
> -       rcu_read_lock();
>         if (hslot->count > 10) {
>                 hash2 = udp6_portaddr_hash(net, daddr, hnum);
>                 slot2 = hash2 & udptable->mask;
> @@ -309,34 +280,26 @@ struct sock *__udp6_lib_lookup(struct net *net,
>                                                   &in6addr_any, hnum, dif,
>                                                   hslot2, slot2, skb);
>                 }
> -               rcu_read_unlock();
>                 return result;
>         }
>  begin:
>         result = NULL;
>         badness = -1;
> -       sk_nulls_for_each_rcu(sk, node, &hslot->head) {
> +       sk_for_each_rcu(sk, &hslot->head) {
>                 score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
>                 if (score > badness) {
> -                       result = sk;
> -                       badness = score;
>                         reuseport = sk->sk_reuseport;
>                         if (reuseport) {
>                                 hash = udp6_ehashfn(net, daddr, hnum,
>                                                     saddr, sport);
> -                               if (select_ok) {
> -                                       struct sock *sk2;
> -
> -                                       sk2 = reuseport_select_sock(sk, hash, skb,
> +                               result = reuseport_select_sock(sk, hash, skb,
>                                                         sizeof(struct udphdr));
> -                                       if (sk2) {
> -                                               result = sk2;
> -                                               select_ok = false;
> -                                               goto found;
> -                                       }
> -                               }
> +                               if (result)
> +                                       return result;
>                                 matches = 1;
>                         }
> +                       result = sk;
> +                       badness = score;
>                 } else if (score == badness && reuseport) {
>                         matches++;
>                         if (reciprocal_scale(hash, matches) == 0)
> @@ -344,25 +307,6 @@ begin:
>                         hash = next_pseudo_random32(hash);
>                 }
>         }
> -       /*
> -        * if the nulls value we got at the end of this lookup is
> -        * not the expected one, we must restart lookup.
> -        * We probably met an item that was moved to another chain.
> -        */
> -       if (get_nulls_value(node) != slot)
> -               goto begin;
> -
> -       if (result) {
> -found:
> -               if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> -                       result = NULL;
> -               else if (unlikely(compute_score(result, net, hnum, saddr, sport,
> -                                       daddr, dport, dif) < badness)) {
> -                       sock_put(result);
> -                       goto begin;
> -               }
> -       }
> -       rcu_read_unlock();
>         return result;
>  }
>  EXPORT_SYMBOL_GPL(__udp6_lib_lookup);
> @@ -382,12 +326,24 @@ static struct sock *__udp6_lib_lookup_skb(struct sk_buff *skb,
>                                  udptable, skb);
>  }
>
> +/* Must be called under rcu_read_lock().
> + * Does increment socket refcount.
> + */
> +#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_SOCKET) || \
> +    IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TPROXY)
>  struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be16 sport,
>                              const struct in6_addr *daddr, __be16 dport, int dif)
>  {
> -       return __udp6_lib_lookup(net, saddr, sport, daddr, dport, dif, &udp_table, NULL);
> +       struct sock *sk;
> +
> +       sk =  __udp6_lib_lookup(net, saddr, sport, daddr, dport,
> +                               dif, &udp_table, NULL);
> +       if (sk && !atomic_inc_not_zero(&sk->sk_refcnt))
> +               sk = NULL;
> +       return sk;
>  }
>  EXPORT_SYMBOL_GPL(udp6_lib_lookup);
> +#endif
>
>  /*
>   *     This should be easy, if there is something there we
> @@ -585,7 +541,7 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
>         sk->sk_err = err;
>         sk->sk_error_report(sk);
>  out:
> -       sock_put(sk);
> +       return;
>  }
>
>  static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> @@ -747,33 +703,6 @@ static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
>         return true;
>  }
>
> -static void flush_stack(struct sock **stack, unsigned int count,
> -                       struct sk_buff *skb, unsigned int final)
> -{
> -       struct sk_buff *skb1 = NULL;
> -       struct sock *sk;
> -       unsigned int i;
> -
> -       for (i = 0; i < count; i++) {
> -               sk = stack[i];
> -               if (likely(!skb1))
> -                       skb1 = (i == final) ? skb : skb_clone(skb, GFP_ATOMIC);
> -               if (!skb1) {
> -                       atomic_inc(&sk->sk_drops);
> -                       UDP6_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,
> -                                         IS_UDPLITE(sk));
> -                       UDP6_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS,
> -                                         IS_UDPLITE(sk));
> -               }
> -
> -               if (skb1 && udpv6_queue_rcv_skb(sk, skb1) <= 0)
> -                       skb1 = NULL;
> -               sock_put(sk);
> -       }
> -       if (unlikely(skb1))
> -               kfree_skb(skb1);
> -}
> -
>  static void udp6_csum_zero_error(struct sk_buff *skb)
>  {
>         /* RFC 2460 section 8.1 says that we SHOULD log
> @@ -792,15 +721,15 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>                 const struct in6_addr *saddr, const struct in6_addr *daddr,
>                 struct udp_table *udptable, int proto)
>  {
> -       struct sock *sk, *stack[256 / sizeof(struct sock *)];
> +       struct sock *sk, *first = NULL;
>         const struct udphdr *uh = udp_hdr(skb);
> -       struct hlist_nulls_node *node;
>         unsigned short hnum = ntohs(uh->dest);
>         struct udp_hslot *hslot = udp_hashslot(udptable, net, hnum);
> -       int dif = inet6_iif(skb);
> -       unsigned int count = 0, offset = offsetof(typeof(*sk), sk_nulls_node);
> +       unsigned int offset = offsetof(typeof(*sk), sk_node);
>         unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
> -       bool inner_flushed = false;
> +       int dif = inet6_iif(skb);
> +       struct hlist_node *node;
> +       struct sk_buff *nskb;
>
>         if (use_hash2) {
>                 hash2_any = udp6_portaddr_hash(net, &in6addr_any, hnum) &
> @@ -811,27 +740,32 @@ start_lookup:
>                 offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node);
>         }
>
> -       spin_lock(&hslot->lock);
> -       sk_nulls_for_each_entry_offset(sk, node, &hslot->head, offset) {
> -               if (__udp_v6_is_mcast_sock(net, sk,
> -                                          uh->dest, daddr,
> -                                          uh->source, saddr,
> -                                          dif, hnum) &&
> -                   /* If zero checksum and no_check is not on for
> -                    * the socket then skip it.
> -                    */
> -                   (uh->check || udp_sk(sk)->no_check6_rx)) {
> -                       if (unlikely(count == ARRAY_SIZE(stack))) {
> -                               flush_stack(stack, count, skb, ~0);
> -                               inner_flushed = true;
> -                               count = 0;
> -                       }
> -                       stack[count++] = sk;
> -                       sock_hold(sk);
> +       sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
> +               if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr,
> +                                           uh->source, saddr, dif, hnum))
> +                       continue;
> +               /* If zero checksum and no_check is not on for
> +                * the socket then skip it.
> +                */
> +               if (!uh->check && !udp_sk(sk)->no_check6_rx)
> +                       continue;
> +               if (!first) {
> +                       first = sk;
> +                       continue;
> +               }
> +               nskb = skb_clone(skb, GFP_ATOMIC);
> +               if (unlikely(!nskb)) {
> +                       atomic_inc(&sk->sk_drops);
> +                       UDP6_INC_STATS_BH(net, UDP_MIB_RCVBUFERRORS,
> +                                         IS_UDPLITE(sk));
> +                       UDP6_INC_STATS_BH(net, UDP_MIB_INERRORS,
> +                                         IS_UDPLITE(sk));
> +                       continue;
>                 }
> -       }
>
> -       spin_unlock(&hslot->lock);
> +               if (udpv6_queue_rcv_skb(sk, nskb) > 0)
> +                       consume_skb(nskb);
> +       }
>
>         /* Also lookup *:port if we are using hash2 and haven't done so yet. */
>         if (use_hash2 && hash2 != hash2_any) {
> @@ -839,13 +773,13 @@ start_lookup:
>                 goto start_lookup;
>         }
>
> -       if (count) {
> -               flush_stack(stack, count, skb, count - 1);
> +       if (first) {
> +               if (udpv6_queue_rcv_skb(first, skb) > 0)
> +                       consume_skb(skb);
>         } else {
> -               if (!inner_flushed)
> -                       UDP6_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
> -                                         proto == IPPROTO_UDPLITE);
> -               consume_skb(skb);
> +               kfree_skb(skb);
> +               UDP6_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
> +                                 proto == IPPROTO_UDPLITE);
>         }
>         return 0;
>  }
> @@ -853,10 +787,10 @@ start_lookup:
>  int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>                    int proto)
>  {
> +       const struct in6_addr *saddr, *daddr;
>         struct net *net = dev_net(skb->dev);
> -       struct sock *sk;
>         struct udphdr *uh;
> -       const struct in6_addr *saddr, *daddr;
> +       struct sock *sk;
>         u32 ulen = 0;
>
>         if (!pskb_may_pull(skb, sizeof(struct udphdr)))
> @@ -910,7 +844,6 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>                 int ret;
>
>                 if (!uh->check && !udp_sk(sk)->no_check6_rx) {
> -                       sock_put(sk);
>                         udp6_csum_zero_error(skb);
>                         goto csum_error;
>                 }
> @@ -920,7 +853,6 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>                                                  ip6_compute_pseudo);
>
>                 ret = udpv6_queue_rcv_skb(sk, skb);
> -               sock_put(sk);
>
>                 /* a return value > 0 means to resubmit the input */
>                 if (ret > 0)
> --
> 2.8.0.rc3.226.g39d4020
>

^ permalink raw reply

* Re: [PATCH] net: remove unimplemented RTNH_F_PERVASIVE
From: Sergei Shtylyov @ 2016-04-02 16:04 UTC (permalink / raw)
  To: Quentin Armitage, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <1459586603-24006-1-git-send-email-quentin@armitage.org.uk>

Hello.

On 4/2/2016 11:43 AM, Quentin Armitage wrote:

> Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
> and couldn't be enabled since the required config parameter wasn't in
> any Kconfig file (see commit d088dde7b).

    scripts/checkpatch.pl now enforces certain commit citing format, your 
doesn't match it, i.e. you need 12-digit SHA1 and ("<commit summary") after that.

> This commit removes all remaining references to RTNH_F_PERVASIVE.
> Although this will cause userspace applications that were using the
> flag to fail to build, they will be alerted to the fact that using
> RTNH_F_PERVASIVE was not achieving anything.
>
> Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
[...]

> diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
> index df48034..f5660c6 100644
> --- a/net/decnet/dn_fib.c
> +++ b/net/decnet/dn_fib.c
> @@ -243,7 +243,7 @@ out:
>   	} else {
>   		struct net_device *dev;
>
> -		if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
> +		if (nh->nh_flags&RTNH_F_ONLINK)

    Please enclose & into spaces like below.

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e..3883860 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
>   	} else {
>   		struct in_device *in_dev;
>
> -		if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
> +		if (nh->nh_flags & RTNH_F_ONLINK)
>   			return -EINVAL;
>
>   		rcu_read_lock();

MBR, Sergei

^ permalink raw reply

* net: memory leak due to CLONE_NEWNET
From: Dmitry Vyukov @ 2016-04-02 13:55 UTC (permalink / raw)
  To: David S. Miller, Nicolas Dichtel, Thomas Graf, netdev, LKML,
	Eric Dumazet
  Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin

Hello,

The following program leads to memory leaks in:

unreferenced object 0xffff88005c10d208 (size 96):
  comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s)
  hex dump (first 32 bytes):
    e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00  .1.-............
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
  backtrace:
    [<ffffffff8679bb23>] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
    [<     inline     >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
    [<     inline     >] slab_post_alloc_hook mm/slab.h:406
    [<     inline     >] slab_alloc_node mm/slub.c:2602
    [<     inline     >] slab_alloc mm/slub.c:2610
    [<ffffffff8179b4c0>] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
    [<     inline     >] kmalloc include/linux/slab.h:478
    [<     inline     >] tc_action_net_init include/net/act_api.h:122
    [<ffffffff8574e62e>] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593
    [<ffffffff8564ffc9>] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
    [<ffffffff85650474>] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
    [<ffffffff85651a56>] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
    [<ffffffff813d01bf>] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
    [<ffffffff813d0b69>] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
    [<     inline     >] SYSC_unshare kernel/fork.c:2019
    [<ffffffff81365283>] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
    [<ffffffff867b9400>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88005c10e1c8 (size 96):
  comm "a.out", pid 10753, jiffies 4296778620 (age 43.117s)
  hex dump (first 32 bytes):
    e8 0b 85 2d 00 88 ff ff 0f 00 00 00 00 00 ad de  ...-............
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
  backtrace:
    [<ffffffff8679bb23>] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
    [<     inline     >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
    [<     inline     >] slab_post_alloc_hook mm/slab.h:406
    [<     inline     >] slab_alloc_node mm/slub.c:2602
    [<     inline     >] slab_alloc mm/slub.c:2610
    [<ffffffff8179b4c0>] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
    [<     inline     >] kmalloc include/linux/slab.h:478
    [<     inline     >] tc_action_net_init include/net/act_api.h:122
    [<ffffffff85753f7e>] ife_init_net+0x15e/0x450 net/sched/act_ife.c:838
    [<ffffffff8564ffc9>] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
    [<ffffffff85650474>] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
    [<ffffffff85651a56>] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
    [<ffffffff813d01bf>] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
    [<ffffffff813d0b69>] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
    [<     inline     >] SYSC_unshare kernel/fork.c:2019
    [<ffffffff81365283>] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
    [<ffffffff867b9400>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff880025a55b08 (size 96):
  comm "a.out", pid 10702, jiffies 4296768144 (age 61.526s)
  hex dump (first 32 bytes):
    28 ed 55 2b 00 88 ff ff 0f 00 00 00 00 00 00 00  (.U+............
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
  backtrace:
    [<ffffffff8679bb23>] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
    [<     inline     >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
    [<     inline     >] slab_post_alloc_hook mm/slab.h:406
    [<     inline     >] slab_alloc_node mm/slub.c:2602
    [<     inline     >] slab_alloc mm/slub.c:2610
    [<ffffffff8179b4c0>] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
    [<     inline     >] kmalloc include/linux/slab.h:478
    [<     inline     >] tc_action_net_init include/net/act_api.h:122
    [<ffffffff8574890e>] nat_init_net+0x15e/0x450 net/sched/act_nat.c:311
    [<ffffffff8564ffc9>] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
    [<ffffffff85650474>] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
    [<ffffffff85651a56>] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
    [<ffffffff813d01bf>] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
    [<ffffffff813d0b69>] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
    [<     inline     >] SYSC_unshare kernel/fork.c:2019
    [<ffffffff81365283>] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
    [<ffffffff867b9400>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
    [<ffffffffffffffff>] 0xffffffffffffffff


#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <linux/sched.h>

int main()
{
        int pid, status;

        pid = fork();
        if (pid == 0) {
                unshare(CLONE_NEWNET);
                exit(0);
        }
        while (waitpid(pid, &status, 0) != pid) {
        }
        return 0;
}


grep "kmalloc-96" /proc/slabinfo confirms the leak.

I am on commit 05cf8077e54b20dddb756eaa26f3aeb5c38dd3cf (Apr 1).

^ permalink raw reply

* [PATCH] net: remove unimplemented RTNH_F_PERVASIVE
From: Quentin Armitage @ 2016-04-02  8:43 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: Quentin Armitage

Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
and couldn't be enabled since the required config parameter wasn't in
any Kconfig file (see commit d088dde7b).

This commit removes all remaining references to RTNH_F_PERVASIVE.
Although this will cause userspace applications that were using the
flag to fail to build, they will be alerted to the fact that using
RTNH_F_PERVASIVE was not achieving anything.

Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
---
 include/uapi/linux/rtnetlink.h |    2 +-
 net/decnet/dn_fib.c            |    2 +-
 net/ipv4/fib_semantics.c       |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..58e6ba0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -339,7 +339,7 @@ struct rtnexthop {
 /* rtnh_flags */
 
 #define RTNH_F_DEAD		1	/* Nexthop is dead (used by multipath)	*/
-#define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
+					/* 2 was RTNH_F_PERVASIVE (never implemented) */
 #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
 #define RTNH_F_OFFLOAD		8	/* offloaded route */
 #define RTNH_F_LINKDOWN		16	/* carrier-down on nexthop */
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index df48034..f5660c6 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -243,7 +243,7 @@ out:
 	} else {
 		struct net_device *dev;
 
-		if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
+		if (nh->nh_flags&RTNH_F_ONLINK)
 			return -EINVAL;
 
 		dev = __dev_get_by_index(&init_net, nh->nh_oif);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e..3883860 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 	} else {
 		struct in_device *in_dev;
 
-		if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
+		if (nh->nh_flags & RTNH_F_ONLINK)
 			return -EINVAL;
 
 		rcu_read_lock();
-- 
1.7.7.6

^ permalink raw reply related

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Jesper Dangaard Brouer @ 2016-04-02  8:23 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459560118-5582-5-git-send-email-bblanco@plumgrid.com>


First of all, I'm very happy to see people start working on this!
Thanks you Brenden!

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
[...]
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	if (budget <= 0)
>  		return polled;
>  
> +	prog = READ_ONCE(priv->prog);
> +
>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +

I think you need to DMA sync RX-page before you can safely access
packet data in page (on all arch's).

> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

AFAIK length here covers all the frags[n].page, thus potentially
causing the BPF program to access memory out of bound (crash).

Having several page fragments is AFAIK an optimization for jumbo-frames
on PowerPC (which is a bit annoying for you use-case ;-)).


> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* re:
From: Ali Saeed @ 2016-04-02  8:16 UTC (permalink / raw)




I have a project request...

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Alexander Duyck @ 2016-04-02  6:02 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, Alex Duyck, Herbert Xu, Tom Herbert, Jesse Gross,
	Eric Dumazet, Netdev
In-Reply-To: <20160401.221632.846129753053296932.davem@davemloft.net>

On Fri, Apr 1, 2016 at 7:16 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Fri, 1 Apr 2016 12:58:41 -0700
>
>> RFC 6864 is pretty explicit about this, IPv4 ID used only for
>> fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
>>
>> The goal with this change is to try and keep most of the existing
>> behavior in tact without violating this rule?  I would think the
>> sequence number should give you the ability to infer a drop in the
>> case of TCP.  In the case of UDP tunnels we are now getting a bit more
>> data since we were ignoring the outer IP header ID before.
>
> When retransmits happen, the sequence numbers are the same.  But you
> can then use the IP ID to see exactly what happened.  You can even
> tell if multiple retransmits got reordered.
>
> Eric's use case is extremely useful, and flat out eliminates ambiguity
> when analyzing TCP traces.

I'm not really sure the IP ID is really all that useful.  On a 10G
link it takes about 80ms for it to wrap using an MTU of 1500.  That
value is going to keep dropping as we move up to 40G and 100G.  That
was one of the motivations behind RFC 6864 because it starts becoming
a bottle-neck if you want to keep the IP IDs truly unique.  In
addition while this change would allow you to combined disjointed IP
IDs I don't think you would lose the re-transmission as there would
likely be a gap in sequence numbers that would cause the flow to be
flushed from GRO, and it isn't as if we can prepend the retransmit to
the aggregated frame, we are always adding to the tail.

I would think the most likely case where this change would foul up any
IP IDs would be the garbage-in/garbage-out case like the IPv6 to IPv4
translation that is using the fixed IP ID of 0.  I agree that it isn't
desirable, however at the same time per RFC 6864 there is nothing
there to say we cannot do that.  In addition it would likely help to
mitigate some of the impact of IP ID on things like SLHC compression
since the resegmented frames would be incrementing so it would reduce
the number of times the IP ID would have to be updated.

- Alex

^ permalink raw reply

* Re: [PATCH] RDS: sync congestion map updating
From: santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA @ 2016-04-02  4:30 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Wengang Wang,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20160402011459.GC8565-2ukJVAZIZ/Y@public.gmane.org>



On 4/1/16 6:14 PM, Leon Romanovsky wrote:
> On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote:
>> (cc-ing netdev)
>> On 3/30/2016 7:59 PM, Wengang Wang wrote:
>>>
>>>
>>> 在 2016年03月31日 09:51, Wengang Wang 写道:
>>>>
>>>>
>>>> 在 2016年03月31日 01:16, santosh shilimkar 写道:
>>>>> Hi Wengang,
>>>>>
>>>>> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>>>>>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>>>>>> Problem is found that some among a lot of parallel RDS
>>>>>>> communications hang.
>>>>>>> In my test ten or so among 33 communications hang. The send
>>>>>>> requests got
>>>>>>> -ENOBUF error meaning the peer socket (port) is congested. But
>>>>>>> meanwhile,
>>>>>>> peer socket (port) is not congested.
>>>>>>>
>>>>>>> The congestion map updating can happen in two paths: one is in
>>>>>>> rds_recvmsg path
>>>>>>> and the other is when it receives packets from the hardware. There
>>>>>>> is no
>>>>>>> synchronization when updating the congestion map. So a bit
>>>>>>> operation (clearing)
>>>>>>> in the rds_recvmsg path can be skipped by another bit operation
>>>>>>> (setting) in
>>>>>>> hardware packet receving path.
>>>>>>>
>>>
>>> To be more detailed.  Here, the two paths (user calls recvmsg and
>>> hardware receives data) are for different rds socks. thus the
>>> rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
>>> map.
>>>
>> For archive purpose, let me try to conclude the thread. I synced
>> with Wengang offlist and came up with below fix. I was under
>> impression that __set_bit_le() was atmoic version. After fixing
>> it like patch(end of the email), the bug gets addressed.
>>
>> I will probably send this as fix for stable as well.
>>
>>
>>  From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
>> From: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Date: Wed, 30 Mar 2016 23:26:47 -0700
>> Subject: [PATCH] RDS: Fix the atomicity for congestion map update
>>
>> Two different threads with different rds sockets may be in
>> rds_recv_rcvbuf_delta() via receive path. If their ports
>> both map to the same word in the congestion map, then
>> using non-atomic ops to update it could cause the map to
>> be incorrect. Lets use atomics to avoid such an issue.
>>
>> Full credit to Wengang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> for
>> finding the issue, analysing it and also pointing out
>> to offending code with spin lock based fix.
>
> I'm glad that you solved the issue without spinlocks.
> Out of curiosity, I see that this patch is needed to be sent
> to Dave and applied by him. Is it right?
>
Right. I was planning send this one along with one more fix
together on netdev for Dave to pick it up.

> ➜  linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
> Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> (supporter:RDS -
> RELIABLE DATAGRAM SOCKETS)
> "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> (maintainer:NETWORKING
> [GENERAL])
> netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
> rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org (moderated list:RDS - RELIABLE DATAGRAM
> SOCKETS)
> linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list)
>
>>
>> Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>
> Reviewed-by: Leon Romanovsky <leon-2ukJVAZIZ/Y@public.gmane.org>
>
Thanks for review.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Alexei Starovoitov @ 2016-04-02  2:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459562911.6473.299.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> > bpf programs require a skb context to navigate the packet, build a
> > percpu fake skb with the minimal fields. This avoids the costly
> > allocation for packets that end up being dropped.
> > 
> 
> 
> > +		/* A bpf program gets first chance to drop the packet. It may
> > +		 * read bytes but not past the end of the frag. A non-zero
> > +		 * return indicates packet should be dropped.
> > +		 */
> > +		if (prog) {
> > +			struct ethhdr *ethh;
> > +
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> > +				priv->stats.rx_dropped++;
> > +				goto next;
> > +			}
> > +		}
> > +
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.

yep.
my reading of that part was that num_frags > 1 is only for large
mtu sizes, so if we limit this for num_frags==1 only for now
we should be ok and it's still applicable for most of the use cases ?

> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.

yes. good point. I bet it was copy pasted from few lines below.
Should be trivial to convert it to percpu.

>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.

yes, could be per ring as well.
My guess we're hitting 14.5Mpps limit for empty bpf program
and for program that actually looks into the packet because we're
hitting 10G phy limit of 40G nic. Since physically 40G nic
consists of four 10G phys. There will be the same problem
with 100G and 50G nics. Both will be hitting 25G phy limit.
We need to vary packets somehow. Hopefully Or can explain that
bit of hw design.
Jesper's experiments with mlx4 showed the same 14.5Mpps limit
when sender blasting the same packet over and over again.
Great to see the experiments converging.

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Eric Dumazet @ 2016-04-02  2:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: aduyck, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160402021917.GA19570@gondor.apana.org.au>

On Sat, 2016-04-02 at 10:19 +0800, Herbert Xu wrote:
> On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote:
> > On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> > > 
> > > We could easily fix that by adding a feature bit to control this,
> > > something like SKB_GSO_TCP_FIXEDID.
> > 
> > I understood the patch allowed to aggregate 4 segments having
> > 
> > ID=12   ID=40   ID=80  ID=1000
> 
> Right.  But I haven't seen any justification that we should aggregate
> such packets at all.  The only valid case that I have seen is for
> the case of unchanging IDs, no?

Presumably repeats of "DF=1 ID=0" should be what we really need to
handle.

On my wish list, having some reordering logic in GRO would be far more
interesting than these minor details ;)

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Eric Dumazet @ 2016-04-02  2:21 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.duyck, aduyck, herbert, tom, jesse, edumazet, netdev
In-Reply-To: <20160401.221632.846129753053296932.davem@davemloft.net>

On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Fri, 1 Apr 2016 12:58:41 -0700
> 
> > RFC 6864 is pretty explicit about this, IPv4 ID used only for
> > fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
> > 
> > The goal with this change is to try and keep most of the existing
> > behavior in tact without violating this rule?  I would think the
> > sequence number should give you the ability to infer a drop in the
> > case of TCP.  In the case of UDP tunnels we are now getting a bit more
> > data since we were ignoring the outer IP header ID before.
> 
> When retransmits happen, the sequence numbers are the same.  But you
> can then use the IP ID to see exactly what happened.  You can even
> tell if multiple retransmits got reordered.
> 
> Eric's use case is extremely useful, and flat out eliminates ambiguity
> when analyzing TCP traces.

Yes, our team (including Van Jacobson ;) ) would be sad to not have
sequential IP ID (but then we don't have them for IPv6 ;) )

Since the cost of generating them is pretty small (inet->inet_id
counter), we probably should keep them in linux. Their usage will phase
out as IPv6 wins the Internet war...

^ permalink raw reply

* Re: [PATCH v2 net-next] net/core: generic support for disabling netdev features down stack
From: Michał Mirosław @ 2016-04-02  2:21 UTC (permalink / raw)
  To: Jarod Wilson
  Cc: Linux Kernel, David S. Miller, Eric Dumazet, Jay Vosburgh,
	Veaceslav Falico, Andy Gospodarek, Jiri Pirko,
	Nikolay Aleksandrov, Michal Kubecek, Alexander Duyck, netdev
In-Reply-To: <1446519359-21400-1-git-send-email-jarod@redhat.com>

Hi,

Sorry for digging up an old patch, but... ;-)

dev_disable_lro() is a leftover from ancient times. If you read commit
27660515a,
there is a hint where it should go. Please, read on if you'd like to
fix this properly.

2015-11-03 3:55 GMT+01:00 Jarod Wilson <jarod@redhat.com>:
> There are some netdev features, which when disabled on an upper device,
> such as a bonding master or a bridge, must be disabled and cannot be
> re-enabled on underlying devices.
[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6288,6 +6288,44 @@ static void rollback_registered(struct net_device *dev)
>         list_del(&single);
>  }
>
> +static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
> +       struct net_device *upper, netdev_features_t features)
> +{
> +       netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES;
> +       netdev_features_t feature;
> +
> +       for_each_netdev_feature(&upper_disables, feature) {
> +               if (!(upper->wanted_features & feature)
> +                   && (features & feature)) {
> +                       netdev_dbg(lower, "Dropping feature %pNF, upper dev %s has it off.\n",
> +                                  &feature, upper->name);
> +                       features &= ~feature;
> +               }
> +       }

You could do this once:

disable_features = ~upper->features & features & NETIF_F_UPPER_DISABLES;
if (features & disable_features)
  netdev_dbg(...)
features &= ~disable_features;

> +
> +       return features;
> +}
[...]
> @@ -6370,6 +6410,10 @@ int __netdev_update_features(struct net_device *dev)
>         /* driver might be less strict about feature dependencies */
>         features = netdev_fix_features(dev, features);
>
> +       /* some features can't be enabled if they're off an an upper device */
> +       netdev_for_each_upper_dev_rcu(dev, upper, iter)
> +               features = netdev_sync_upper_features(dev, upper, features);
> +
>         if (dev->features == features)
>                 return 0;
>

This should go into netdev_fix_features(), as it is a one single place,
where are feature dependencies are verified.

[...]
> @@ -6386,6 +6430,12 @@ int __netdev_update_features(struct net_device *dev)
>                 return -1;
>         }
>
> +       /* some features must be disabled on lower devices when disabled
> +        * on an upper device (think: bonding master or bridge)
> +        */
> +       netdev_for_each_lower_dev(dev, lower, iter)
> +               netdev_sync_lower_features(dev, lower, features);
> +
[...]

This should be equal in resulting flags to:

   netdev_for_each_lower_dev(dev, lower...)
     netdev_update_features(lower);

because netdev_fix_features(lower) will see already changed dev->features.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Herbert Xu @ 2016-04-02  2:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: aduyck, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <1459563333.6473.302.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 01, 2016 at 07:15:33PM -0700, Eric Dumazet wrote:
> On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> > 
> > We could easily fix that by adding a feature bit to control this,
> > something like SKB_GSO_TCP_FIXEDID.
> 
> I understood the patch allowed to aggregate 4 segments having
> 
> ID=12   ID=40   ID=80  ID=1000

Right.  But I haven't seen any justification that we should aggregate
such packets at all.  The only valid case that I have seen is for
the case of unchanging IDs, no?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: David Miller @ 2016-04-02  2:16 UTC (permalink / raw)
  To: alexander.duyck
  Cc: eric.dumazet, aduyck, herbert, tom, jesse, edumazet, netdev
In-Reply-To: <CAKgT0UcLvAF9gTzcaR=GZyTkM2UQ6ymDmnisjgZiG8hTE+PdLA@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 1 Apr 2016 12:58:41 -0700

> RFC 6864 is pretty explicit about this, IPv4 ID used only for
> fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
> 
> The goal with this change is to try and keep most of the existing
> behavior in tact without violating this rule?  I would think the
> sequence number should give you the ability to infer a drop in the
> case of TCP.  In the case of UDP tunnels we are now getting a bit more
> data since we were ignoring the outer IP header ID before.

When retransmits happen, the sequence numbers are the same.  But you
can then use the IP ID to see exactly what happened.  You can even
tell if multiple retransmits got reordered.

Eric's use case is extremely useful, and flat out eliminates ambiguity
when analyzing TCP traces.

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Eric Dumazet @ 2016-04-02  2:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: aduyck, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160402015709.GA19365@gondor.apana.org.au>

On Sat, 2016-04-02 at 09:57 +0800, Herbert Xu wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > I do not particularly care, but it is worth mentioning that GRO+TSO
> > would not be idempotent anymore.
> 
> We could easily fix that by adding a feature bit to control this,
> something like SKB_GSO_TCP_FIXEDID.

I understood the patch allowed to aggregate 4 segments having

ID=12   ID=40   ID=80  ID=1000

-> resulting GRO packet with 4 segments and ID=12.  TSO would generate
12,13,14,15   or 12,12,12,12 with your flag ?

(Before the patch no aggregation took place and exact same packets were
forwarded with 12, 40, 80, 1000)

As I said, I am not sure we should care :)

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Eric Dumazet @ 2016-04-02  2:08 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459560118-5582-5-git-send-email-bblanco@plumgrid.com>

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 


> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +


1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
frame. 

Still you pass a single fragment but total 'length' here : BPF program
can read past the end of this first fragment and panic the box.

Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
mean.

2) priv->stats.rx_dropped is shared by all the RX queues -> false
sharing.

   This is probably the right time to add a rx_dropped field in struct
mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
higher speed links.

^ permalink raw reply

* Re: [RESEND PATCH V4 2/3] IB/hns: Add HiSilicon RoCE driver support
From: Leon Romanovsky @ 2016-04-02  1:58 UTC (permalink / raw)
  To: Lijun Ou
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	jiri-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	gongyangming-hv44wF8Li93QT0dZR+AlfA,
	xiaokun-hv44wF8Li93QT0dZR+AlfA,
	tangchaofei-hv44wF8Li93QT0dZR+AlfA,
	haifeng.wei-hv44wF8Li93QT0dZR+AlfA,
	yisen.zhuang-hv44wF8Li93QT0dZR+AlfA,
	yankejian-hv44wF8Li93QT0dZR+AlfA,
	lisheng011-hv44wF8Li93QT0dZR+AlfA,
	charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <1459502492-19891-3-git-send-email-oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

On Fri, Apr 01, 2016 at 05:21:31PM +0800, Lijun Ou wrote:
> The driver for HiSilicon RoCE is a platform driver.
> The driver will support multiple versions of hardware. Currently only "v1"
> for hip06 SoC is supported.
> The driver includes two parts: common driver and hardware-specific
> operations. hns_roce_v1_hw.c and hns_roce_v1_hw.h are files for
> hardware-specific operations only for v1 engine, and other files(.c and .h)
> for common algorithm and common hardware operations.
> 
> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Wei Hu(Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Znlong <zhaonenglong-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  MAINTAINERS                                        |    8 +
>  drivers/infiniband/Kconfig                         |    1 +
>  drivers/infiniband/hw/Makefile                     |    1 +
>  drivers/infiniband/hw/hisilicon/hns/Kconfig        |   10 +
>  drivers/infiniband/hw/hisilicon/hns/Makefile       |    9 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c  |  110 +

We are not adding name of company (hisilicon) for infiniband HW drivers
drivers/infiniband/hw/hisilicon/hns/hns_roce_ah.c
--->
drivers/infiniband/hw/hns/hns_roce_ah.c


>  .../infiniband/hw/hisilicon/hns/hns_roce_alloc.c   |  239 ++
 ^^^^^^
Please fix you paths.

>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.c |  338 +++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cmd.h |   80 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_common.h  |  308 +++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_cq.c  |  436 +++
>  .../infiniband/hw/hisilicon/hns/hns_roce_device.h  |  794 ++++++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.c  |  758 ++++++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_eq.h  |  132 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.c |  578 ++++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_icm.h |  112 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_main.c    | 1097 ++++++++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_mr.c  |  605 +++++
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_pd.c  |  124 +
>  drivers/infiniband/hw/hisilicon/hns/hns_roce_qp.c  |  841 ++++++
>  .../infiniband/hw/hisilicon/hns/hns_roce_user.h    |   31 +
>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.c   | 2832 ++++++++++++++++++++
>  .../infiniband/hw/hisilicon/hns/hns_roce_v1_hw.h   |  985 +++++++
                                          ^^^^^^
Do you support v1 of RoCE or v1 of your HW?

>  23 files changed, 10429 insertions(+)

Please appreciate the effort needed to review such large patch and
invest time and effort to divide this to number of small easy review patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Herbert Xu @ 2016-04-02  1:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: aduyck, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <1459536543.6473.289.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I do not particularly care, but it is worth mentioning that GRO+TSO
> would not be idempotent anymore.

We could easily fix that by adding a feature bit to control this,
something like SKB_GSO_TCP_FIXEDID.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Question on rhashtable in worst-case scenario.
From: Herbert Xu @ 2016-04-02  1:46 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Ben Greear, David Miller, linux-kernel, linux-wireless, netdev,
	tgraf
In-Reply-To: <1459546450.3342.22.camel@sipsolutions.net>

On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote:
>
> I was thinking about that one - it's not obvious to me from the code
> how this "explicitly checking for dups" would be done or let's say how
> rhashtable differentiates. But since it seems to work for Ben until
> hitting a certain number of identical keys, surely that's just me not
> understanding the code rather than anything else :)

It's really simple, rhashtable_insert_fast does not check for dups
while rhashtable_lookup_insert_* do.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox