Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 26/30] net: sk_buff rbnode reorg
From: Christoph Paasch @ 2018-10-18 16:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, gregkh, netdev, stable, Eric Dumazet, soheil,
	weiwan, willemb
In-Reply-To: <20180913145902.17531-27-sthemmin@microsoft.com>

Hello,

On Thu, Sep 13, 2018 at 8:00 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream
>
> skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
>
> Current uses (TCP receive ofo queue and netem) need to save/restore
> tstamp, while skb->dev is either NULL (TCP) or a constant for a given
> queue (netem).
>
> Since we plan using an RB tree for TCP retransmit queue to speedup SACK
> processing with large BDP, this patch exchanges skb->dev and
> skb->tstamp.
>
> This saves some overhead in both TCP and netem.
>
> v2: removes the swtstamp field from struct tcp_skb_cb
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Wei Wang <weiwan@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/skbuff.h                  |  24 ++--
>  include/net/inet_frag.h                 |   3 +-
>  net/ipv4/inet_fragment.c                |  16 ++-
>  net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
>  net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
>  net/ipv6/reassembly.c                   |   1 +
>  6 files changed, 129 insertions(+), 98 deletions(-)

this change is missing the pieces in sch_netem.c that Eric did as well.

Thus, there is a panic that can be hit à la:
https://github.com/multipath-tcp/mptcp/issues/285#issuecomment-431001559


The following diff fixes it (I can submit a proper patch later today).
Storing of the tstamp can be changed as well (as Eric did in the
original patch), but that's not causing any issues. I see a backport
to v4.9 has been submitted as well...
---
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2a2ab6bfe5d8..3d325b840802 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
  skb->next = NULL;
  skb->prev = NULL;
  skb->tstamp = netem_skb_cb(skb)->tstamp_save;
+ /* skb->dev shares skb->rbnode area,
+ * we need to restore its value.
+ */
+ skb->dev = qdisc_dev(sch);

 #ifdef CONFIG_NET_CLS_ACT
  /*





>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2837e55df03e..f64e88444082 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -663,23 +663,27 @@ struct sk_buff {
>                         struct sk_buff          *prev;
>
>                         union {
> -                               ktime_t         tstamp;
> -                               u64             skb_mstamp;
> +                               struct net_device       *dev;
> +                               /* Some protocols might use this space to store information,
> +                                * while device pointer would be NULL.
> +                                * UDP receive path is one user.
> +                                */
> +                               unsigned long           dev_scratch;
>                         };
>                 };
> -               struct rb_node  rbnode; /* used in netem & tcp stack */
> +               struct rb_node          rbnode; /* used in netem, ip4 defrag, and tcp stack */
> +               struct list_head        list;
>         };
> -       struct sock             *sk;
>
>         union {
> -               struct net_device       *dev;
> -               /* Some protocols might use this space to store information,
> -                * while device pointer would be NULL.
> -                * UDP receive path is one user.
> -                */
> -               unsigned long           dev_scratch;
> +               struct sock             *sk;
>                 int                     ip_defrag_offset;
>         };
> +
> +       union {
> +               ktime_t         tstamp;
> +               u64             skb_mstamp;
> +       };
>         /*
>          * This is the control buffer. It is free to use for every
>          * layer. Please put your private variables there. If you
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index ed07e3786d98..e4c71a7644be 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -75,7 +75,8 @@ struct inet_frag_queue {
>         struct timer_list       timer;
>         spinlock_t              lock;
>         refcount_t              refcnt;
> -       struct sk_buff          *fragments;
> +       struct sk_buff          *fragments;  /* Used in IPv6. */
> +       struct rb_root          rb_fragments; /* Used in IPv4. */
>         struct sk_buff          *fragments_tail;
>         ktime_t                 stamp;
>         int                     len;
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index c9e35b81d093..6904cbb7de1a 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
>         fp = q->fragments;
>         nf = q->net;
>         f = nf->f;
> -       while (fp) {
> -               struct sk_buff *xp = fp->next;
> -
> -               sum_truesize += fp->truesize;
> -               kfree_skb(fp);
> -               fp = xp;
> +       if (fp) {
> +               do {
> +                       struct sk_buff *xp = fp->next;
> +
> +                       sum_truesize += fp->truesize;
> +                       kfree_skb(fp);
> +                       fp = xp;
> +               } while (fp);
> +       } else {
> +               sum_truesize = skb_rbtree_purge(&q->rb_fragments);
>         }
>         sum = sum_truesize + f->qsize;
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 960bf5eab59f..0e8f8de77e71 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
>  {
>         struct inet_frag_queue *frag = from_timer(frag, t, timer);
>         const struct iphdr *iph;
> -       struct sk_buff *head;
> +       struct sk_buff *head = NULL;
>         struct net *net;
>         struct ipq *qp;
>         int err;
> @@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
>
>         ipq_kill(qp);
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
> -
> -       head = qp->q.fragments;
> -
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
>
> -       if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
> +       if (!qp->q.flags & INET_FRAG_FIRST_IN)
>                 goto out;
>
> +       /* sk_buff::dev and sk_buff::rbnode are unionized. So we
> +        * pull the head out of the tree in order to be able to
> +        * deal with head->dev.
> +        */
> +       if (qp->q.fragments) {
> +               head = qp->q.fragments;
> +               qp->q.fragments = head->next;
> +       } else {
> +               head = skb_rb_first(&qp->q.rb_fragments);
> +               if (!head)
> +                       goto out;
> +               rb_erase(&head->rbnode, &qp->q.rb_fragments);
> +               memset(&head->rbnode, 0, sizeof(head->rbnode));
> +               barrier();
> +       }
> +       if (head == qp->q.fragments_tail)
> +               qp->q.fragments_tail = NULL;
> +
> +       sub_frag_mem_limit(qp->q.net, head->truesize);
> +
>         head->dev = dev_get_by_index_rcu(net, qp->iif);
>         if (!head->dev)
>                 goto out;
> @@ -179,16 +196,16 @@ static void ip_expire(struct timer_list *t)
>             (skb_rtable(head)->rt_type != RTN_LOCAL))
>                 goto out;
>
> -       skb_get(head);
>         spin_unlock(&qp->q.lock);
>         icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
> -       kfree_skb(head);
>         goto out_rcu_unlock;
>
>  out:
>         spin_unlock(&qp->q.lock);
>  out_rcu_unlock:
>         rcu_read_unlock();
> +       if (head)
> +               kfree_skb(head);
>         ipq_put(qp);
>  }
>
> @@ -231,7 +248,7 @@ static int ip_frag_too_far(struct ipq *qp)
>         end = atomic_inc_return(&peer->rid);
>         qp->rid = end;
>
> -       rc = qp->q.fragments && (end - start) > max;
> +       rc = qp->q.fragments_tail && (end - start) > max;
>
>         if (rc) {
>                 struct net *net;
> @@ -245,7 +262,6 @@ static int ip_frag_too_far(struct ipq *qp)
>
>  static int ip_frag_reinit(struct ipq *qp)
>  {
> -       struct sk_buff *fp;
>         unsigned int sum_truesize = 0;
>
>         if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
> @@ -253,20 +269,14 @@ static int ip_frag_reinit(struct ipq *qp)
>                 return -ETIMEDOUT;
>         }
>
> -       fp = qp->q.fragments;
> -       do {
> -               struct sk_buff *xp = fp->next;
> -
> -               sum_truesize += fp->truesize;
> -               kfree_skb(fp);
> -               fp = xp;
> -       } while (fp);
> +       sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
>         sub_frag_mem_limit(qp->q.net, sum_truesize);
>
>         qp->q.flags = 0;
>         qp->q.len = 0;
>         qp->q.meat = 0;
>         qp->q.fragments = NULL;
> +       qp->q.rb_fragments = RB_ROOT;
>         qp->q.fragments_tail = NULL;
>         qp->iif = 0;
>         qp->ecn = 0;
> @@ -278,7 +288,8 @@ static int ip_frag_reinit(struct ipq *qp)
>  static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>  {
>         struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
> -       struct sk_buff *prev, *next;
> +       struct rb_node **rbn, *parent;
> +       struct sk_buff *skb1;
>         struct net_device *dev;
>         unsigned int fragsize;
>         int flags, offset;
> @@ -341,58 +352,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>         if (err)
>                 goto err;
>
> -       /* Find out which fragments are in front and at the back of us
> -        * in the chain of fragments so far.  We must know where to put
> -        * this fragment, right?
> -        */
> -       prev = qp->q.fragments_tail;
> -       if (!prev || prev->ip_defrag_offset < offset) {
> -               next = NULL;
> -               goto found;
> -       }
> -       prev = NULL;
> -       for (next = qp->q.fragments; next != NULL; next = next->next) {
> -               if (next->ip_defrag_offset >= offset)
> -                       break;  /* bingo! */
> -               prev = next;
> -       }
> +       /* Note : skb->rbnode and skb->dev share the same location. */
> +       dev = skb->dev;
> +       /* Makes sure compiler wont do silly aliasing games */
> +       barrier();
>
> -found:
>         /* RFC5722, Section 4, amended by Errata ID : 3089
>          *                          When reassembling an IPv6 datagram, if
>          *   one or more its constituent fragments is determined to be an
>          *   overlapping fragment, the entire datagram (and any constituent
>          *   fragments) MUST be silently discarded.
>          *
> -        * We do the same here for IPv4.
> +        * We do the same here for IPv4 (and increment an snmp counter).
>          */
>
> -       /* Is there an overlap with the previous fragment? */
> -       if (prev &&
> -           (prev->ip_defrag_offset + prev->len) > offset)
> -               goto discard_qp;
> -
> -       /* Is there an overlap with the next fragment? */
> -       if (next && next->ip_defrag_offset < end)
> -               goto discard_qp;
> +       /* Find out where to put this fragment.  */
> +       skb1 = qp->q.fragments_tail;
> +       if (!skb1) {
> +               /* This is the first fragment we've received. */
> +               rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
> +               qp->q.fragments_tail = skb;
> +       } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
> +               /* This is the common/special case: skb goes to the end. */
> +               /* Detect and discard overlaps. */
> +               if (offset < (skb1->ip_defrag_offset + skb1->len))
> +                       goto discard_qp;
> +               /* Insert after skb1. */
> +               rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
> +               qp->q.fragments_tail = skb;
> +       } else {
> +               /* Binary search. Note that skb can become the first fragment, but
> +                * not the last (covered above). */
> +               rbn = &qp->q.rb_fragments.rb_node;
> +               do {
> +                       parent = *rbn;
> +                       skb1 = rb_to_skb(parent);
> +                       if (end <= skb1->ip_defrag_offset)
> +                               rbn = &parent->rb_left;
> +                       else if (offset >= skb1->ip_defrag_offset + skb1->len)
> +                               rbn = &parent->rb_right;
> +                       else /* Found an overlap with skb1. */
> +                               goto discard_qp;
> +               } while (*rbn);
> +               /* Here we have parent properly set, and rbn pointing to
> +                * one of its NULL left/right children. Insert skb. */
> +               rb_link_node(&skb->rbnode, parent, rbn);
> +       }
> +       rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
>
> -       /* Note : skb->ip_defrag_offset and skb->dev share the same location */
> -       dev = skb->dev;
>         if (dev)
>                 qp->iif = dev->ifindex;
> -       /* Makes sure compiler wont do silly aliasing games */
> -       barrier();
>         skb->ip_defrag_offset = offset;
>
> -       /* Insert this fragment in the chain of fragments. */
> -       skb->next = next;
> -       if (!next)
> -               qp->q.fragments_tail = skb;
> -       if (prev)
> -               prev->next = skb;
> -       else
> -               qp->q.fragments = skb;
> -
>         qp->q.stamp = skb->tstamp;
>         qp->q.meat += skb->len;
>         qp->ecn |= ecn;
> @@ -414,7 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>                 unsigned long orefdst = skb->_skb_refdst;
>
>                 skb->_skb_refdst = 0UL;
> -               err = ip_frag_reasm(qp, prev, dev);
> +               err = ip_frag_reasm(qp, skb, dev);
>                 skb->_skb_refdst = orefdst;
>                 return err;
>         }
> @@ -431,15 +442,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>         return err;
>  }
>
> -
>  /* Build a new IP datagram from all its fragments. */
> -
> -static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
> +static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
>                          struct net_device *dev)
>  {
>         struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
>         struct iphdr *iph;
> -       struct sk_buff *fp, *head = qp->q.fragments;
> +       struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
> +       struct sk_buff **nextp; /* To build frag_list. */
> +       struct rb_node *rbn;
>         int len;
>         int ihlen;
>         int err;
> @@ -453,25 +464,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>                 goto out_fail;
>         }
>         /* Make the one we just received the head. */
> -       if (prev) {
> -               head = prev->next;
> -               fp = skb_clone(head, GFP_ATOMIC);
> +       if (head != skb) {
> +               fp = skb_clone(skb, GFP_ATOMIC);
>                 if (!fp)
>                         goto out_nomem;
> -
> -               fp->next = head->next;
> -               if (!fp->next)
> +               rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
> +               if (qp->q.fragments_tail == skb)
>                         qp->q.fragments_tail = fp;
> -               prev->next = fp;
> -
> -               skb_morph(head, qp->q.fragments);
> -               head->next = qp->q.fragments->next;
> -
> -               consume_skb(qp->q.fragments);
> -               qp->q.fragments = head;
> +               skb_morph(skb, head);
> +               rb_replace_node(&head->rbnode, &skb->rbnode,
> +                               &qp->q.rb_fragments);
> +               consume_skb(head);
> +               head = skb;
>         }
>
> -       WARN_ON(!head);
>         WARN_ON(head->ip_defrag_offset != 0);
>
>         /* Allocate a new buffer for the datagram. */
> @@ -496,24 +502,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>                 clone = alloc_skb(0, GFP_ATOMIC);
>                 if (!clone)
>                         goto out_nomem;
> -               clone->next = head->next;
> -               head->next = clone;
>                 skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
>                 skb_frag_list_init(head);
>                 for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
>                         plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
>                 clone->len = clone->data_len = head->data_len - plen;
> -               head->data_len -= clone->len;
> -               head->len -= clone->len;
> +               skb->truesize += clone->truesize;
>                 clone->csum = 0;
>                 clone->ip_summed = head->ip_summed;
>                 add_frag_mem_limit(qp->q.net, clone->truesize);
> +               skb_shinfo(head)->frag_list = clone;
> +               nextp = &clone->next;
> +       } else {
> +               nextp = &skb_shinfo(head)->frag_list;
>         }
>
> -       skb_shinfo(head)->frag_list = head->next;
>         skb_push(head, head->data - skb_network_header(head));
>
> -       for (fp=head->next; fp; fp = fp->next) {
> +       /* Traverse the tree in order, to build frag_list. */
> +       rbn = rb_next(&head->rbnode);
> +       rb_erase(&head->rbnode, &qp->q.rb_fragments);
> +       while (rbn) {
> +               struct rb_node *rbnext = rb_next(rbn);
> +               fp = rb_to_skb(rbn);
> +               rb_erase(rbn, &qp->q.rb_fragments);
> +               rbn = rbnext;
> +               *nextp = fp;
> +               nextp = &fp->next;
> +               fp->prev = NULL;
> +               memset(&fp->rbnode, 0, sizeof(fp->rbnode));
>                 head->data_len += fp->len;
>                 head->len += fp->len;
>                 if (head->ip_summed != fp->ip_summed)
> @@ -524,7 +541,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>         }
>         sub_frag_mem_limit(qp->q.net, head->truesize);
>
> +       *nextp = NULL;
>         head->next = NULL;
> +       head->prev = NULL;
>         head->dev = dev;
>         head->tstamp = qp->q.stamp;
>         IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
> @@ -552,6 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
>         qp->q.fragments = NULL;
> +       qp->q.rb_fragments = RB_ROOT;
>         qp->q.fragments_tail = NULL;
>         return 0;
>
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 1d2f07cde01a..82ce0d0f54bf 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -471,6 +471,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
>                                           head->csum);
>
>         fq->q.fragments = NULL;
> +       fq->q.rb_fragments = RB_ROOT;
>         fq->q.fragments_tail = NULL;
>
>         return true;
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index afaad60dc2ac..ede0061b6f5d 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -472,6 +472,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
>         __IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
>         rcu_read_unlock();
>         fq->q.fragments = NULL;
> +       fq->q.rb_fragments = RB_ROOT;
>         fq->q.fragments_tail = NULL;
>         return 1;
>
> --
> 2.18.0
>

^ permalink raw reply related

* [PATCH net-next] bnxt_en: Copy and paste bug in extended tx_stats
From: Dan Carpenter @ 2018-10-18  8:02 UTC (permalink / raw)
  To: Michael Chan; +Cc: David S. Miller, netdev, kernel-janitors

The struct type was copied from the line before but it should be "tx"
instead of "rx".  I have reviewed the code and I can't immediately see
that this bug causes a runtime issue.

Fixes: 36e53349b60b ("bnxt_en: Add additional extended port statistics.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
This is from static analysis and I don't have a way to test it.

 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 0fe57e36912b..498b373c992d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1471,7 +1471,7 @@ struct bnxt {
 	struct rx_port_stats	*hw_rx_port_stats;
 	struct tx_port_stats	*hw_tx_port_stats;
 	struct rx_port_stats_ext	*hw_rx_port_stats_ext;
-	struct rx_port_stats_ext	*hw_tx_port_stats_ext;
+	struct tx_port_stats_ext	*hw_tx_port_stats_ext;
 	dma_addr_t		hw_rx_port_stats_map;
 	dma_addr_t		hw_tx_port_stats_map;
 	dma_addr_t		hw_rx_port_stats_ext_map;
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
From: Peter Zijlstra @ 2018-10-18  8:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev
In-Reply-To: <55f86215-44a8-2bb8-b1d0-a77a142dc697@iogearbox.net>

On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote:

> Wouldn't this then also allow the kernel side to use smp_store_release()
> when it updates the head? We'd be pretty much at the model as described
> in Documentation/core-api/circular-buffers.rst.
> 
> Meaning, rough pseudo-code diff would look as:
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5d3cf40..3d96275 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  	 *
>  	 * See perf_output_begin().
>  	 */
> -	smp_wmb(); /* B, matches C */
> -	rb->user_page->data_head = head;
> +
> +	/* B, matches C */
> +	smp_store_release(&rb->user_page->data_head, head);

Yes, this would be correct.

The reason we didn't do this is because smp_store_release() ends up
being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if
they have a cheaper smp_wmb(). Most notably ARM.

(ARM64 OTOH would like to have smp_store_release() there I imagine;
while x86 doesn't care either way around).

A similar concern exists for the smp_load_acquire() I proposed for the
userspace side, ARM would have to resort to smp_mb() in that situation,
instead of the cheaper smp_rmb().

The smp_store_release() on the userspace side will actually be of equal
cost or cheaper, since it already has an smp_mb(). Most notably, x86 can
avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder
(it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC
instead of SYNC.

^ permalink raw reply

* [net PATCH] net: sched: Fix for duplicate class dump
From: Phil Sutter @ 2018-10-18  8:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet

When dumping classes by parent, kernel would return classes twice:

| # tc qdisc add dev lo root prio
| # tc class show dev lo
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| # tc class show dev lo parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:

This comes from qdisc_match_from_root() potentially returning the root
qdisc itself if its handle matched. Though in that case, root's classes
were already dumped a few lines above.

Fixes: cb395b2010879 ("net: sched: optimize class dumps")
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 net/sched/sch_api.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 6684641ea3448..3dc0acf542454 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -2059,7 +2059,8 @@ static int tc_dump_tclass_root(struct Qdisc *root, struct sk_buff *skb,
 
 	if (tcm->tcm_parent) {
 		q = qdisc_match_from_root(root, TC_H_MAJ(tcm->tcm_parent));
-		if (q && tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
+		if (q && q != root &&
+		    tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
 			return -1;
 		return 0;
 	}
-- 
2.19.0

^ permalink raw reply related

* Re: [PATCH net] net/sched: properly init chain in case of multiple control actions
From: Davide Caratti @ 2018-10-18  8:38 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jiri Pirko, Jamal Hadi Salim, David Miller,
	Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUbUPgHn1LSZD2v+QhFmesyOn9cAYh-5TRCug-+ULyPgg@mail.gmail.com>

On Wed, 2018-10-17 at 22:35 -0700, Cong Wang wrote:
> > (well, after some more thinking I looked again at that patch and yes, it
> > lacked the most important thing:)
> 
> Hmm, as I said, I am not sure if the logic is correct, if we have two different
> goto actions, we must have two pointers.
> 
> I will re-think about it tomorrow. (I am at a conference so don't have much
> time on reviewing this.)
> 
> Thanks.

sure, ok. In the meanwhile, I will post a V2 that:

- adds the missing test that avoids having 'goto action' in the primary
and in the fallback control action at the same time
- fixes a very silly bug that made it fail the TDC 'gact' selftest 

regards,
-- 
davide

^ permalink raw reply

* Re: bond: take rcu lock in netpoll_send_skb_on_dev
From: Eran Ben Elisha @ 2018-10-18  8:40 UTC (permalink / raw)
  To: Cong Wang
  Cc: Dave Jones, Linux Kernel Network Developers, Tariq Toukan,
	Saeed Mahameed
In-Reply-To: <CAM_iQpUjdPuM3mH-j94gTga3v4Ts=DdKOiCoG2wQct-tiZaOaQ@mail.gmail.com>



On 10/18/2018 8:46 AM, Cong Wang wrote:
> On Mon, Oct 15, 2018 at 4:36 AM Eran Ben Elisha <eranbe@mellanox.com> wrote:
>> Hi,
>>
>> This suggested fix introduced a regression while using netconsole module
>> with mlx5_core module loaded.
> 
> It is already reported here:
> https://marc.info/?l=linux-kernel&m=153917359528669&w=2
> 
> 
>>
>> During irq handling, we hit a warning that this rcu_read_lock_bh cannot
>> be taken inside an IRQ.
> 
> Yes, I mentioned the same even before this patch was sent out:
> https://marc.info/?l=linux-netdev&m=153816136624679&w=2

Thanks Cong.
 From the discussion, I understand that the solution shouldn't be 
touching  netpoll_send_skb_on_dev. Some modules/drivers logging will now 
trigger traces while netconsole is loaded.

DaveJ,
Can you please submit a proper fix or at least revert the current one in 
the meanwhile.

Thanks,
Eran

> 
> Thanks.
> 

^ permalink raw reply

* [PATCH] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Sebastian Andrzej Siewior @ 2018-10-18  8:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, virtualization, tglx, Toshiaki Makita, Michael S. Tsirkin,
	Jason Wang, David S. Miller
In-Reply-To: <20181016114414.23ea73c3@xeon-e3>

on 32bit, lockdep notices that virtnet_open() and refill_work() invoke
try_fill_recv() from process context while virtnet_receive() invokes the
same function from BH context. The problem that the seqcounter within
u64_stats_update_begin() may deadlock if it is interrupted by BH and
then acquired again.

Introduce u64_stats_update_begin_bh() which disables BH on 32bit
architectures. Since the BH might interrupt the worker, this new
function should not limited to SMP like the others which are expected
to be used in softirq.

With this change we might lose increments but this is okay. The
important part that the two 32bit parts of the 64bit counter are not
corrupted.

Fixes: 461f03dc99cf6 ("virtio_net: Add kick stats").
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/virtio_net.c       |  4 ++--
 include/linux/u64_stats_sync.h | 16 ++++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dab504ec5e502..fbcfb4d272336 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1206,9 +1206,9 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 			break;
 	} while (rq->vq->num_free);
 	if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
-		u64_stats_update_begin(&rq->stats.syncp);
+		u64_stats_update_begin_bh(&rq->stats.syncp);
 		rq->stats.kicks++;
-		u64_stats_update_end(&rq->stats.syncp);
+		u64_stats_update_end_bh(&rq->stats.syncp);
 	}
 
 	return !oom;
diff --git a/include/linux/u64_stats_sync.h b/include/linux/u64_stats_sync.h
index a27604f99ed04..46b6ad6175628 100644
--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -90,6 +90,22 @@ static inline void u64_stats_update_end(struct u64_stats_sync *syncp)
 #endif
 }
 
+static inline void u64_stats_update_begin_bh(struct u64_stats_sync *syncp)
+{
+#if BITS_PER_LONG==32
+	local_bh_disable();
+	write_seqcount_begin(&syncp->seq);
+#endif
+}
+
+static inline void u64_stats_update_end_bh(struct u64_stats_sync *syncp)
+{
+#if BITS_PER_LONG==32
+	write_seqcount_end(&syncp->seq);
+	local_bh_enable();
+#endif
+}
+
 static inline unsigned long
 u64_stats_update_begin_irqsave(struct u64_stats_sync *syncp)
 {
-- 
2.19.1

^ permalink raw reply related

* Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Sebastian Andrzej Siewior @ 2018-10-18  8:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization, tglx, Michael S. Tsirkin,
	David S. Miller
In-Reply-To: <a281371f-dd20-2036-d0a8-1081c2f6a452@redhat.com>

On 2018-10-17 14:48:02 [+0800], Jason Wang wrote:
> 
> On 2018/10/17 上午9:13, Toshiaki Makita wrote:
> > I'm not sure what condition triggered this warning.

If the seqlock is acquired once in softirq and then in process context
again it is enough evidence for lockdep to trigger this warning.

> > Toshiaki Makita
> 
> 
> Or maybe NAPI is enabled unexpectedly somewhere?
> 
> Btw, the schedule_delayed_work() in virtnet_open() is also suspicious, if
> the work is executed before virtnet_napi_enable(), there will be a deadloop
> for napi_disable().

something like this? It is also likely if it runs OOM on queue 2, it
will run OOM again on queue 3.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fbcfb4d272336..87d6ec4765270 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1263,22 +1263,22 @@ static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
 		container_of(work, struct virtnet_info, refill.work);
-	bool still_empty;
+	int still_empty = 0;
 	int i;
 
 	for (i = 0; i < vi->curr_queue_pairs; i++) {
 		struct receive_queue *rq = &vi->rq[i];
 
 		napi_disable(&rq->napi);
-		still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
+		if (!try_fill_recv(vi, rq, GFP_KERNEL))
+		    still_empty++;
 		virtnet_napi_enable(rq->vq, &rq->napi);
-
-		/* In theory, this can happen: if we don't get any buffers in
-		 * we will *never* try to fill again.
-		 */
-		if (still_empty)
-			schedule_delayed_work(&vi->refill, HZ/2);
 	}
+	/* In theory, this can happen: if we don't get any buffers in
+	 * we will *never* try to fill again.
+	 */
+	if (still_empty)
+		schedule_delayed_work(&vi->refill, HZ/2);
 }
 
 static int virtnet_receive(struct receive_queue *rq, int budget,
@@ -1407,12 +1407,13 @@ static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int i, err;
+	int need_refill = 0;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		if (i < vi->curr_queue_pairs)
 			/* Make sure we have some buffers: if oom use wq. */
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
-				schedule_delayed_work(&vi->refill, 0);
+				need_refill++;
 
 		err = xdp_rxq_info_reg(&vi->rq[i].xdp_rxq, dev, i);
 		if (err < 0)
@@ -1428,6 +1429,8 @@ static int virtnet_open(struct net_device *dev)
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
+	if (need_refill)
+		schedule_delayed_work(&vi->refill, 0);
 
 	return 0;
 }
@@ -2236,6 +2239,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
 	int err, i;
+	int need_refill = 0;
 
 	err = init_vqs(vi);
 	if (err)
@@ -2246,13 +2250,15 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 	if (netif_running(vi->dev)) {
 		for (i = 0; i < vi->curr_queue_pairs; i++)
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
-				schedule_delayed_work(&vi->refill, 0);
+				need_refill++;
 
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
 					       &vi->sq[i].napi);
 		}
+		if (need_refill)
+			schedule_delayed_work(&vi->refill, 0);
 	}
 
 	netif_device_attach(vi->dev);

> Thanks

Sebastian

^ permalink raw reply related

* Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Sebastian Andrzej Siewior @ 2018-10-18  9:08 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Jason Wang, netdev, virtualization, tglx, Michael S. Tsirkin,
	David S. Miller
In-Reply-To: <4849237d-38f5-9840-4ab9-4419de31db85@lab.ntt.co.jp>

On 2018-10-18 18:00:05 [+0900], Toshiaki Makita wrote:
> On 2018/10/18 17:47, Sebastian Andrzej Siewior wrote:
> > On 2018-10-17 14:48:02 [+0800], Jason Wang wrote:
> >>
> >> On 2018/10/17 上午9:13, Toshiaki Makita wrote:
> >>> I'm not sure what condition triggered this warning.
> > 
> > If the seqlock is acquired once in softirq and then in process context
> > again it is enough evidence for lockdep to trigger this warning.
> 
> No. As I said that should not happen because of NAPI guard.
Again: lockdep saw the lock in softirq context once and in process
context once and this is what triggers the warning. It does not matter
if NAPI is enabled or not during the access in process context. If you
want to allow this you need further lockdep annotation…

… but: refill_work() disables NAPI for &vi->rq[1] and refills + updates
stats while NAPI is enabled for &vi->rq[0].

Sebastian

^ permalink raw reply

* Re: [PATCH] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Sebastian Andrzej Siewior @ 2018-10-18  9:11 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Stephen Hemminger, netdev, virtualization, tglx,
	Michael S. Tsirkin, Jason Wang, David S. Miller
In-Reply-To: <0dce3017-9973-af2d-f324-6fe8b5eb5469@lab.ntt.co.jp>

On 2018-10-18 18:06:57 [+0900], Toshiaki Makita wrote:
> NACK. Again, this race should not happen because of NAPI guard.
> We need to investigate why this warning happened.

I tried to explain this. Please see
	20181018090812.rry5qgnqxxrjxaii@linutronix.de

Sebastian

^ permalink raw reply

* Re: [PATCH v3 lora-next 1/5] regmap: Add regmap_noinc_write API
From: Mark Brown @ 2018-10-18 17:18 UTC (permalink / raw)
  To: Andreas Färber
  Cc: Ben Whitten, starnight, hasnain.virk, netdev, liuxuenetmail,
	shess, Ben Whitten, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Stefan Rehm, linux-spi@vger.kernel.org
In-Reply-To: <2908e406-a2e3-29cb-579c-949ddd383e97@suse.de>

[-- Attachment #1: Type: text/plain, Size: 346 bytes --]

On Thu, Oct 18, 2018 at 06:59:52PM +0200, Andreas Färber wrote:

> Mark, please take this one through your tree - I'll rebase the LoRa
> parts on linux-next then.

I don't have a copy of it, if I didn't apply it already and send a pull
request I probably got confused and thought I'd done that already, sorry
- can someone resend please?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Sebastian Andrzej Siewior @ 2018-10-18  9:30 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Jason Wang, netdev, virtualization, tglx, Michael S. Tsirkin,
	David S. Miller
In-Reply-To: <55f14915-744b-e11c-bc50-87a872218479@lab.ntt.co.jp>

On 2018-10-18 18:19:21 [+0900], Toshiaki Makita wrote:
> On 2018/10/18 18:08, Sebastian Andrzej Siewior wrote:
> > Again: lockdep saw the lock in softirq context once and in process
> > context once and this is what triggers the warning. It does not matter
> > if NAPI is enabled or not during the access in process context. If you
> > want to allow this you need further lockdep annotation…
> > 
> > … but: refill_work() disables NAPI for &vi->rq[1] and refills + updates
> > stats while NAPI is enabled for &vi->rq[0].
> 
> Do you mean this is false positive? rq[0] and rq[1] never race with each
> other...

Why? So you can't refill rq[1] and then be interrupted and process NAPI
for rq[0]?

But as I said. If lockdep saw the lock in acquired in softirq (what it
did) _and_ in process context (what it did as well) _once_ then this is
enough evidence for the warning.
If you claim that this can not happen due to NAPI guard [0] then this is
something lockdep does not know about.

[0] which I currently don't understand and therefore sent the patch [1]
    as Jason pointed out that in the ->ndo_open case the work is
    scheduled and then NAPI is enabled (which means the worker could
    disable NAPI and refill but before it finishes, ->ndo_open would
    continue and enable NAPI)).
[1] 20181018084753.wefvsypdevbzoadg@linutronix.de

Sebastian

^ permalink raw reply

* [PATCH net-next] net: ethernet: ti: cpsw: don't flush mcast entries while switch promisc mode
From: Ivan Khoronzhuk @ 2018-10-18 18:00 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: linux-omap, netdev, linux-kernel, Ivan Khoronzhuk

No need now to flush mcast entries in switch mode while toggling to
promiscuous mode. It's not needed as vlan reg_mcast = ALL_PORTS
and mcast/vlan ports = ALL_PORTS, the same happening for vlan
unreg_mcast, it's set to ALL_PORT_MASK just after calling promisc
mode routine by calling set allmulti. I suppose main reason to flush
them is to use unreg_mcast to receive all to host port. Thus, now, all
mcast packets are received anyway and no reason to flush mcast entries
unsafely, as they were synced with __dev_mc_sync() previously and are
not restored. Another way is to _dev_mc_unsync() them, but no need.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---

Based on net-next/master
Tasted on am572x EVM and BBB

 drivers/net/ethernet/ti/cpsw.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 226be2a56c1f..0e475020a674 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -638,9 +638,6 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 			} while (time_after(timeout, jiffies));
 			cpsw_ale_control_set(ale, 0, ALE_AGEOUT, 1);

-			/* Clear all mcast from ALE */
-			cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS, -1);
-
 			/* Flood All Unicast Packets to Host port */
 			cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
 			dev_dbg(&ndev->dev, "promiscuity enabled\n");
-- 
2.17.1

^ permalink raw reply related

* pull request (net): ipsec 2018-10-18
From: Steffen Klassert @ 2018-10-18 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev

1) Free the xfrm interface gro_cells when deleting the
   interface, otherwise we leak it. From Li RongQing.

2) net/core/flow.c does not exist anymore, so remove it
   from the MAINTAINERS file.

3) Fix a slab-out-of-bounds in _decode_session6.
   From Alexei Starovoitov.

4) Fix RCU protection when policies inserted into
   thei bydst lists. From Florian Westphal.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 92d7c74b6f72a8a7d04970d5dcfb99673daaf91d:

  Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth (2018-10-01 22:40:39 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 9dffff200fd178f11dd50eb1fd8ccd0650c9284e:

  xfrm: policy: use hlist rcu variants on insert (2018-10-11 13:24:46 +0200)

----------------------------------------------------------------
Alexei Starovoitov (1):
      net/xfrm: fix out-of-bounds packet access

Florian Westphal (1):
      xfrm: policy: use hlist rcu variants on insert

Li RongQing (1):
      xfrm: fix gro_cells leak when remove virtual xfrm interfaces

Steffen Klassert (1):
      MAINTAINERS: Remove net/core/flow.c

 MAINTAINERS               | 1 -
 net/ipv6/xfrm6_policy.c   | 4 ++--
 net/xfrm/xfrm_interface.c | 3 +++
 net/xfrm/xfrm_policy.c    | 8 ++++----
 4 files changed, 9 insertions(+), 7 deletions(-)

^ permalink raw reply

* [PATCH 1/4] xfrm: fix gro_cells leak when remove virtual xfrm interfaces
From: Steffen Klassert @ 2018-10-18 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018102521.24997-1-steffen.klassert@secunet.com>

From: Li RongQing <lirongqing@baidu.com>

The device gro_cells has been initialized, it should be freed,
otherwise it will be leaked

Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_interface.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 31acc6f33d98..6f05e831a73e 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -116,6 +116,9 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct xfrm_if *xi)
 
 static void xfrmi_dev_free(struct net_device *dev)
 {
+	struct xfrm_if *xi = netdev_priv(dev);
+
+	gro_cells_destroy(&xi->gro_cells);
 	free_percpu(dev->tstats);
 }
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH 2/4] MAINTAINERS: Remove net/core/flow.c
From: Steffen Klassert @ 2018-10-18 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018102521.24997-1-steffen.klassert@secunet.com>

net/core/flow.c does not exist anymore, so remove it
from the IPSEC NETWORKING section of the MAINTAINERS
file.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index dcb0191c4f54..4ff21dac9b45 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10130,7 +10130,6 @@ L:	netdev@vger.kernel.org
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git
 S:	Maintained
-F:	net/core/flow.c
 F:	net/xfrm/
 F:	net/key/
 F:	net/ipv4/xfrm*
-- 
2.17.1

^ permalink raw reply related

* [PATCH 3/4] net/xfrm: fix out-of-bounds packet access
From: Steffen Klassert @ 2018-10-18 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018102521.24997-1-steffen.klassert@secunet.com>

From: Alexei Starovoitov <ast@kernel.org>

BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
net/ipv6/xfrm6_policy.c:161
Read of size 1 at addr ffff8801d882eec7 by task syz-executor1/6667
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
  __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
  xfrm_decode_session include/net/xfrm.h:1232 [inline]
  vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
  __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
  netdev_start_xmit include/linux/netdevice.h:4322 [inline]
  xmit_one net/core/dev.c:3217 [inline]
  dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
  __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3836

Reported-by: syzbot+acffccec848dc13fe459@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv6/xfrm6_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ef3defaf43b9..d35bcf92969c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -146,8 +146,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)
 	fl6->daddr = reverse ? hdr->saddr : hdr->daddr;
 	fl6->saddr = reverse ? hdr->daddr : hdr->saddr;
 
-	while (nh + offset + 1 < skb->data ||
-	       pskb_may_pull(skb, nh + offset + 1 - skb->data)) {
+	while (nh + offset + sizeof(*exthdr) < skb->data ||
+	       pskb_may_pull(skb, nh + offset + sizeof(*exthdr) - skb->data)) {
 		nh = skb_network_header(skb);
 		exthdr = (struct ipv6_opt_hdr *)(nh + offset);
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH 4/4] xfrm: policy: use hlist rcu variants on insert
From: Steffen Klassert @ 2018-10-18 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018102521.24997-1-steffen.klassert@secunet.com>

From: Florian Westphal <fw@strlen.de>

bydst table/list lookups use rcu, so insertions must use rcu versions.

Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_policy.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index f094d4b3520d..119a427d9b2b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -632,9 +632,9 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 				break;
 		}
 		if (newpos)
-			hlist_add_behind(&policy->bydst, newpos);
+			hlist_add_behind_rcu(&policy->bydst, newpos);
 		else
-			hlist_add_head(&policy->bydst, chain);
+			hlist_add_head_rcu(&policy->bydst, chain);
 	}
 
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
@@ -774,9 +774,9 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 			break;
 	}
 	if (newpos)
-		hlist_add_behind(&policy->bydst, newpos);
+		hlist_add_behind_rcu(&policy->bydst, newpos);
 	else
-		hlist_add_head(&policy->bydst, chain);
+		hlist_add_head_rcu(&policy->bydst, chain);
 	__xfrm_policy_link(policy, dir);
 
 	/* After previous checking, family can either be AF_INET or AF_INET6 */
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] net: ethernet: fec: Add missing SPEED_
From: Florian Fainelli @ 2018-10-18 18:39 UTC (permalink / raw)
  To: Corentin Labbe, andrew, davem, fugang.duan; +Cc: linux-kernel, netdev
In-Reply-To: <1539875100-11121-1-git-send-email-clabbe@baylibre.com>

On 10/18/2018 08:05 AM, Corentin Labbe wrote:
> Since commit 58056c1e1b0e ("net: ethernet: Use phy_set_max_speed() to limit advertised speed"), the fec driver is unable to get any link.
> This is due to missing SPEED_.

But SPEED_1000 is defined in include/uapi/linux/ethtool.h as 1000, so
surely this would amount to the same code paths being taken or am I
missing something here?

> 
> Fixes: 58056c1e1b0e ("net: ethernet: Use phy_set_max_speed() to limit advertised speed")
> Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index a17cc97..75fd7c9 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1946,7 +1946,7 @@ static int fec_enet_mii_probe(struct net_device *ndev)
>  
>  	/* mask with MAC supported features */
>  	if (fep->quirks & FEC_QUIRK_HAS_GBIT) {
> -		phy_set_max_speed(phy_dev, 1000);
> +		phy_set_max_speed(phy_dev, SPEED_1000);
>  		phy_remove_link_mode(phy_dev,
>  				     ETHTOOL_LINK_MODE_1000baseT_Half_BIT);
>  #if !defined(CONFIG_M5272)
> @@ -1954,7 +1954,7 @@ static int fec_enet_mii_probe(struct net_device *ndev)
>  #endif
>  	}
>  	else
> -		phy_set_max_speed(phy_dev, 100);
> +		phy_set_max_speed(phy_dev, SPEED_100);
>  
>  	fep->link = 0;
>  	fep->full_duplex = 0;
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH] brcmfmac: fix spelling mistake "Retreiving" -> "Retrieving"
From: Arend van Spriel @ 2018-10-18 10:49 UTC (permalink / raw)
  To: Colin King, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, Kalle Valo, David S . Miller, Pieter-Paul Giesberts,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20181016174342.1867-1-colin.king@canonical.com>

On 10/16/2018 7:43 PM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> Trivial fix to spelling mistake in brcmf_err error message.
>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/wireless/broadcom/brcm80211/brcmfmac/common.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: [PATCH] net: ethernet: fec: Add missing SPEED_
From: Florian Fainelli @ 2018-10-18 18:55 UTC (permalink / raw)
  To: LABBE Corentin; +Cc: andrew, davem, fugang.duan, linux-kernel, netdev
In-Reply-To: <20181018184715.GA31736@Red>

On 10/18/2018 11:47 AM, LABBE Corentin wrote:
> On Thu, Oct 18, 2018 at 11:39:24AM -0700, Florian Fainelli wrote:
>> On 10/18/2018 08:05 AM, Corentin Labbe wrote:
>>> Since commit 58056c1e1b0e ("net: ethernet: Use phy_set_max_speed() to limit advertised speed"), the fec driver is unable to get any link.
>>> This is due to missing SPEED_.
>>
>> But SPEED_1000 is defined in include/uapi/linux/ethtool.h as 1000, so
>> surely this would amount to the same code paths being taken or am I
>> missing something here?
> 
> The bisect session pointed your patch, reverting it fix the issue.
> BUT since the fix seemed trivial I sent the patch without more test then compile it.
> Sorry, I have just found some minutes ago that it didnt fix the issue.
> 
> But your patch is still the cause for sure.
> 

What you are writing is really lowering the confidence level, first
Andrew is the author of that patch, and second "just compiling" and
pretending this fixes a problem when it does not is not quite what I
would expect.

I don't have a problem helping you find the solution or the right fix
though, even if it is not my patch, but please get the author and actual
problem right so we can move forward in confidence, thanks!
-- 
Florian

^ permalink raw reply

* [PATCH 2/3] xfrm: use correct size to initialise sp->ovec
From: Steffen Klassert @ 2018-10-18 10:56 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018105654.32149-1-steffen.klassert@secunet.com>

From: Li RongQing <lirongqing@baidu.com>

This place should want to initialize array, not a element,
so it should be sizeof(array) instead of sizeof(element)

but now this array only has one element, so no error in
this condition that XFRM_MAX_OFFLOAD_DEPTH is 1

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index be3520e429c9..684c0bc01e2c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -131,7 +131,7 @@ struct sec_path *secpath_dup(struct sec_path *src)
 	sp->len = 0;
 	sp->olen = 0;
 
-	memset(sp->ovec, 0, sizeof(sp->ovec[XFRM_MAX_OFFLOAD_DEPTH]));
+	memset(sp->ovec, 0, sizeof(sp->ovec));
 
 	if (src) {
 		int i;
-- 
2.17.1

^ permalink raw reply related

* [PATCH 3/3] xfrm: use complete IPv6 addresses for hash
From: Steffen Klassert @ 2018-10-18 10:56 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018105654.32149-1-steffen.klassert@secunet.com>

From: Michal Kubecek <mkubecek@suse.cz>

In some environments it is common that many hosts share the same lower half
of their IPv6 addresses (in particular ::1). As __xfrm6_addr_hash() and
__xfrm6_daddr_saddr_hash() calculate the hash only from the lower halves,
as much as 1/3 of the hosts ends up in one hashtable chain which harms the
performance.

Use complete IPv6 addresses when calculating the hashes. Rather than just
adding two more words to the xor, use jhash2() for consistency with
__xfrm6_pref_hash() and __xfrm6_dpref_spref_hash().

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_hash.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_hash.h b/net/xfrm/xfrm_hash.h
index 61be810389d8..ce66323102f9 100644
--- a/net/xfrm/xfrm_hash.h
+++ b/net/xfrm/xfrm_hash.h
@@ -13,7 +13,7 @@ static inline unsigned int __xfrm4_addr_hash(const xfrm_address_t *addr)
 
 static inline unsigned int __xfrm6_addr_hash(const xfrm_address_t *addr)
 {
-	return ntohl(addr->a6[2] ^ addr->a6[3]);
+	return jhash2((__force u32 *)addr->a6, 4, 0);
 }
 
 static inline unsigned int __xfrm4_daddr_saddr_hash(const xfrm_address_t *daddr,
@@ -26,8 +26,7 @@ static inline unsigned int __xfrm4_daddr_saddr_hash(const xfrm_address_t *daddr,
 static inline unsigned int __xfrm6_daddr_saddr_hash(const xfrm_address_t *daddr,
 						    const xfrm_address_t *saddr)
 {
-	return ntohl(daddr->a6[2] ^ daddr->a6[3] ^
-		     saddr->a6[2] ^ saddr->a6[3]);
+	return __xfrm6_addr_hash(daddr) ^ __xfrm6_addr_hash(saddr);
 }
 
 static inline u32 __bits2mask32(__u8 bits)
-- 
2.17.1

^ permalink raw reply related

* [PATCH 1/3] xfrm: remove unnecessary check in xfrmi_get_stats64
From: Steffen Klassert @ 2018-10-18 10:56 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20181018105654.32149-1-steffen.klassert@secunet.com>

From: Li RongQing <lirongqing@baidu.com>

if tstats of a device is not allocated, this device is not
registered correctly and can not be used.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_interface.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index dc5b20bf29cf..abafd49cc65d 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -561,9 +561,6 @@ static void xfrmi_get_stats64(struct net_device *dev,
 {
 	int cpu;
 
-	if (!dev->tstats)
-		return;
-
 	for_each_possible_cpu(cpu) {
 		struct pcpu_sw_netstats *stats;
 		struct pcpu_sw_netstats tmp;
-- 
2.17.1

^ permalink raw reply related

* pull request (net-next): ipsec-next 2018-10-18
From: Steffen Klassert @ 2018-10-18 10:56 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev

1) Remove an unnecessary dev->tstats check in xfrmi_get_stats64.
   From Li RongQing.

2) We currently do a sizeof(element) instead of a sizeof(array)
   check when initializing the ovec array of the secpath.
   Currently this array can have only one element, so code is
   OK but error-prone. Change this to do a sizeof(array)
   check so that we can add more elements in future.
   From Li RongQing.

3) Improve xfrm IPv6 address hashing by using the complete IPv6
   addresses for a hash. From Michal Kubecek.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit abf1a08ff3237a27188ff8cc2904f2cea893af55:

  net: vhost: remove bad code line (2018-10-07 21:31:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to 8d4b6bce2559755cf2db6513a267fccdfbf7c3ab:

  xfrm: use complete IPv6 addresses for hash (2018-10-15 10:09:18 +0200)

----------------------------------------------------------------
Li RongQing (2):
      xfrm: remove unnecessary check in xfrmi_get_stats64
      xfrm: use correct size to initialise sp->ovec

Michal Kubecek (1):
      xfrm: use complete IPv6 addresses for hash

 net/xfrm/xfrm_hash.h      | 5 ++---
 net/xfrm/xfrm_input.c     | 2 +-
 net/xfrm/xfrm_interface.c | 3 ---
 3 files changed, 3 insertions(+), 7 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox