Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
From: Nishanth Devarajan @ 2018-06-26  0:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David Miller, netdev,
	doucette, michel
In-Reply-To: <CAKgT0UcGCK9dc-zOaiMGnWDVCPx=JQ_-q53Z1i2NThdhJp6YLg@mail.gmail.com>

On Sat, Jun 23, 2018 at 03:10:32PM -0700, Alexander Duyck wrote:
> On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan <ndev2021@gmail.com> wrote:
> > net/sched: add skbprio scheduler
> >
> > Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> > according to their skb->priority field. Although Skbprio can be employed in any
> > scenario in which a higher skb->priority field means a higher priority packet,
> > Skbprio was concieved as a solution for denial-of-service defenses that need to
> > route packets with different priorities.
> 
> Really this description is not very good. Reading it I was thinking to
> myself "why do we need this, prio already does this". It wasn't until
> I read through the code that I figured out that you are basically
> adding dropping of lower priority frames.
> 

OK, I'll take Jamal's suggestion on this and write up a new description.

> >
> > v2
> > *Use skb->priority field rather than DS field. Rename queueing discipline as
> > SKB Priority Queue (previously Gatekeeper Priority Queue).
> >
> > *Queueing discipline is made classful to expose Skbprio's internal priority
> > queues.
> >
> > Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
> > Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com>
> > Reviewed-by: Cody Doucette <doucette@bu.edu>
> > Reviewed-by: Michel Machado <michel@digirati.com.br>
> > ---
> >  include/uapi/linux/pkt_sched.h |  15 ++
> >  net/sched/Kconfig              |  13 ++
> >  net/sched/Makefile             |   1 +
> >  net/sched/sch_skbprio.c        | 347 +++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 376 insertions(+)
> >  create mode 100644 net/sched/sch_skbprio.c
> >
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 37b5096..6fd07e8 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> > @@ -124,6 +124,21 @@ struct tc_fifo_qopt {
> >         __u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo */
> >  };
> >
> > +/* SKBPRIO section */
> > +
> > +/*
> > + * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
> > + * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
> > + * to map one to one the DS field of IPV4 and IPV6 headers.
> > + * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
> > + */
> > +
> > +#define SKBPRIO_MAX_PRIORITY 64
> > +
> > +struct tc_skbprio_qopt {
> > +       __u32   limit;          /* Queue length in packets. */
> > +};
> > +
> >  /* PRIO section */
> >
> >  #define TCQ_PRIO_BANDS 16
> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> > index a01169f..9ac4b53 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -240,6 +240,19 @@ config NET_SCH_MQPRIO
> >
> >           If unsure, say N.
> >
> > +config NET_SCH_SKBPRIO
> > +       tristate "SKB priority queue scheduler (SKBPRIO)"
> > +       help
> > +         Say Y here if you want to use the SKB priority queue
> > +         scheduler. This schedules packets according to skb->priority,
> > +         which is useful for request packets in DoS mitigation systems such
> > +         as Gatekeeper.
> > +
> > +         To compile this driver as a module, choose M here: the module will
> > +         be called sch_skbprio.
> > +
> > +         If unsure, say N.
> > +
> >  config NET_SCH_CHOKE
> >         tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
> >         help
> > diff --git a/net/sched/Makefile b/net/sched/Makefile
> > index 8811d38..a4d8893 100644
> > --- a/net/sched/Makefile
> > +++ b/net/sched/Makefile
> > @@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)   += sch_netem.o
> >  obj-$(CONFIG_NET_SCH_DRR)      += sch_drr.o
> >  obj-$(CONFIG_NET_SCH_PLUG)     += sch_plug.o
> >  obj-$(CONFIG_NET_SCH_MQPRIO)   += sch_mqprio.o
> > +obj-$(CONFIG_NET_SCH_SKBPRIO)  += sch_skbprio.o
> >  obj-$(CONFIG_NET_SCH_CHOKE)    += sch_choke.o
> >  obj-$(CONFIG_NET_SCH_QFQ)      += sch_qfq.o
> >  obj-$(CONFIG_NET_SCH_CODEL)    += sch_codel.o
> > diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
> > new file mode 100644
> > index 0000000..5e89446
> > --- /dev/null
> > +++ b/net/sched/sch_skbprio.c
> > @@ -0,0 +1,347 @@
> > +/*
> > + * net/sched/sch_skbprio.c  SKB Priority Queue.
> > + *
> > + *             This program is free software; you can redistribute it and/or
> > + *             modify it under the terms of the GNU General Public License
> > + *             as published by the Free Software Foundation; either version
> > + *             2 of the License, or (at your option) any later version.
> > + *
> > + * Authors:    Nishanth Devarajan, <ndev2021@gmail.com>
> > + *             Cody Doucette, <doucette@bu.edu>
> > + *             original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
> > + */
> > +
> > +#include <linux/string.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/types.h>
> > +#include <linux/kernel.h>
> > +#include <linux/errno.h>
> > +#include <linux/skbuff.h>
> > +#include <net/pkt_sched.h>
> > +#include <net/sch_generic.h>
> > +#include <net/inet_ecn.h>
> > +
> > +
> > +/*       SKB Priority Queue
> > + *     =================================
> > + *
> > + * This qdisc schedules a packet according to skb->priority, where a higher
> > + * value places the packet closer to the exit of the queue. When the queue is
> > + * full, the lowest priority packet in the queue is dropped to make room for
> > + * the packet to be added if it has higher priority. If the packet to be added
> > + * has lower priority than all packets in the queue, it is dropped.
> > + *
> > + * Without the SKB priority queue, queue length limits must be imposed
> > + * for individual queues, and there is no easy way to enforce a global queue
> > + * length limit across all priorities. With the SKBprio queue, a global
> > + * queue length limit can be enforced while not restricting the queue lengths
> > + * of individual priorities.
> > + *
> > + * This is especially useful for a denial-of-service defense system like
> > + * Gatekeeper, which prioritizes packets in flows that demonstrate expected
> > + * behavior of legitimate users. The queue is flexible to allow any number
> > + * of packets of any priority up to the global limit of the scheduler
> > + * without risking resource overconsumption by a flood of low priority packets.
> > + *
> > + * The Gatekeeper codebase is found here:
> > + *
> > + *             https://github.com/AltraMayor/gatekeeper
> 
> Do we really need to be linking to third party projects in the code? I
> am not even really sure it is related here since I cannot find
> anything that references the queuing discipline.
> 
> Also this seems like it should be in the Documentation/networking
> folder rather than being stored in the code itself.
>

The Skbprio qdisc was designed from Gatekeeper's functionality, and we intended
for it to represent Gatekeeper's deployment in production. But I understand
that this should be in Documentation, will change this in v3.  
 
> > + */
> > +
> > +struct skbprio_sched_data {
> > +       /* Parameters. */
> > +       u32 max_limit;
> > +
> > +       /* Queue state. */
> > +       struct sk_buff_head qdiscs[SKBPRIO_MAX_PRIORITY];
> > +       struct gnet_stats_queue qstats[SKBPRIO_MAX_PRIORITY];
> > +       u16 highest_prio;
> > +       u16 lowest_prio;
> > +};
> > +
> > +static u16 calc_new_high_prio(const struct skbprio_sched_data *q)
> > +{
> > +       int prio;
> > +
> > +       for (prio = q->highest_prio - 1; prio >= q->lowest_prio; prio--) {
> > +               if (!skb_queue_empty(&q->qdiscs[prio]))
> > +                       return prio;
> > +       }
> > +
> > +       /* SKB queue is empty, return 0 (default highest priority setting). */
> > +       return 0;
> > +}
> > +
> > +static u16 calc_new_low_prio(const struct skbprio_sched_data *q)
> > +{
> > +       int prio;
> > +
> > +       for (prio = q->lowest_prio + 1; prio <= q->highest_prio; prio++) {
> > +               if (!skb_queue_empty(&q->qdiscs[prio]))
> > +                       return prio;
> > +       }
> > +
> > +       /* SKB queue is empty, return SKBPRIO_MAX_PRIORITY - 1
> > +        * (default lowest priority setting).
> > +        */
> > +       return SKBPRIO_MAX_PRIORITY - 1;
> > +}
> > +
> > +static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > +                         struct sk_buff **to_free)
> > +{
> > +       const unsigned int max_priority = SKBPRIO_MAX_PRIORITY - 1;
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       struct sk_buff_head *qdisc;
> > +       struct sk_buff_head *lp_qdisc;
> > +       struct sk_buff *to_drop;
> > +       u16 prio, lp;
> > +
> > +       /* Obtain the priority of @skb. */
> > +       prio = min(skb->priority, max_priority);
> > +
> > +       qdisc = &q->qdiscs[prio];
> > +       if (sch->q.qlen < q->max_limit) {
> > +               __skb_queue_tail(qdisc, skb);
> > +               qdisc_qstats_backlog_inc(sch, skb);
> > +               q->qstats[prio].backlog += qdisc_pkt_len(skb);
> > +
> > +               /* Check to update highest and lowest priorities. */
> > +               if (prio > q->highest_prio)
> > +                       q->highest_prio = prio;
> > +
> > +               if (prio < q->lowest_prio)
> > +                       q->lowest_prio = prio;
> > +
> > +               sch->q.qlen++;
> > +               return NET_XMIT_SUCCESS;
> > +       }
> > +
> > +       /* If this packet has the lowest priority, drop it. */
> > +       lp = q->lowest_prio;
> > +       if (prio <= lp) {
> > +               q->qstats[prio].drops++;
> > +               return qdisc_drop(skb, sch, to_free);
> > +       }
> > +
> > +       /* Drop the packet at the tail of the lowest priority qdisc. */
> > +       lp_qdisc = &q->qdiscs[lp];
> > +       to_drop = __skb_dequeue_tail(lp_qdisc);
> 
> Is there a specific reason for dropping tail instead of head? I would
> think you might see better latency numbers with this if you were to
> focus on dropping older packets from the lowest priority instead of
> the newest ones. Then things like TCP would likely be able to respond
> more quickly to the congestion effects and would be able to more
> accurately determine actual round trip time versus buffer delay.
> 

You make a good point that TCP would respond better to congestion by dropping
at the head. But the idea of dropping at the tail was to relieve immediate 
congestion (from a denial of service standpoint) prioritizing older packets
in the queue over ingress attack packets. And since Skbprio is designed as a
denial of service defense, wouldn't this be a better solution?

> > +       BUG_ON(!to_drop);
> > +       qdisc_qstats_backlog_dec(sch, to_drop);
> > +       qdisc_drop(to_drop, sch, to_free);
> > +
> > +       q->qstats[lp].backlog -= qdisc_pkt_len(to_drop);
> > +       q->qstats[lp].drops++;
> > +
> > +       __skb_queue_tail(qdisc, skb);
> > +       qdisc_qstats_backlog_inc(sch, skb);
> > +       q->qstats[prio].backlog += qdisc_pkt_len(skb);
> > +
> 
> You could probably just take care of enqueueing first, and then take
> care of the dequeue. That way you can drop the references to skb
> earlier from the stack or free up whatever register is storing it.
>

OK, will change this in v3.
 
> > +       /* Check to update highest and lowest priorities. */
> > +       if (skb_queue_empty(lp_qdisc)) {
> > +               if (q->lowest_prio == q->highest_prio) {
> > +                       BUG_ON(sch->q.qlen);
> > +                       q->lowest_prio = prio;
> > +                       q->highest_prio = prio;
> > +               } else {
> > +                       q->lowest_prio = calc_new_low_prio(q);
> > +               }
> > +       }
> > +
> > +       if (prio > q->highest_prio)
> > +               q->highest_prio = prio;
> > +
> > +       return NET_XMIT_CN;
> > +}
> > +
> > +static struct sk_buff *skbprio_dequeue(struct Qdisc *sch)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       struct sk_buff_head *hpq = &q->qdiscs[q->highest_prio];
> > +       struct sk_buff *skb = __skb_dequeue(hpq);
> > +
> > +       if (unlikely(!skb))
> > +               return NULL;
> > +
> > +       sch->q.qlen--;
> > +       qdisc_qstats_backlog_dec(sch, skb);
> > +       qdisc_bstats_update(sch, skb);
> > +
> > +       q->qstats[q->highest_prio].backlog -= qdisc_pkt_len(skb);
> > +
> > +       /* Update highest priority field. */
> > +       if (skb_queue_empty(hpq)) {
> > +               if (q->lowest_prio == q->highest_prio) {
> > +                       BUG_ON(sch->q.qlen);
> > +                       q->highest_prio = 0;
> > +                       q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
> > +               } else {
> > +                       q->highest_prio = calc_new_high_prio(q);
> > +               }
> > +       }
> > +       return skb;
> > +}
> > +
> > +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
> > +                       struct netlink_ext_ack *extack)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       struct tc_skbprio_qopt *ctl = nla_data(opt);
> > +       const unsigned int min_limit = 1;
> > +
> > +       if (ctl->limit == (typeof(ctl->limit))-1)
> > +               q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
> > +       else if (ctl->limit < min_limit ||
> > +               ctl->limit > qdisc_dev(sch)->tx_queue_len)
> > +               return -EINVAL;
> > +       else
> > +               q->max_limit = ctl->limit;
> > +
> > +       return 0;
> > +}
> > +
> > +static int skbprio_init(struct Qdisc *sch, struct nlattr *opt,
> > +                       struct netlink_ext_ack *extack)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       const unsigned int min_limit = 1;
> > +       int prio;
> > +
> > +       /* Initialise all queues, one for each possible priority. */
> > +       for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
> > +               __skb_queue_head_init(&q->qdiscs[prio]);
> > +
> > +       memset(&q->qstats, 0, sizeof(q->qstats));
> > +       q->highest_prio = 0;
> > +       q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
> > +       if (!opt) {
> > +               q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
> > +               return 0;
> > +       }
> > +       return skbprio_change(sch, opt, extack);
> > +}
> > +
> > +static int skbprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       struct tc_skbprio_qopt opt;
> > +
> > +       opt.limit = q->max_limit;
> > +
> > +       if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
> > +               return -1;
> > +
> > +       return skb->len;
> > +}
> > +
> > +static void skbprio_reset(struct Qdisc *sch)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       int prio;
> > +
> > +       sch->qstats.backlog = 0;
> > +       sch->q.qlen = 0;
> > +
> > +       for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
> > +               __skb_queue_purge(&q->qdiscs[prio]);
> > +
> > +       memset(&q->qstats, 0, sizeof(q->qstats));
> > +       q->highest_prio = 0;
> > +       q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
> > +}
> > +
> > +static void skbprio_destroy(struct Qdisc *sch)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       int prio;
> > +
> > +       for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
> > +               __skb_queue_purge(&q->qdiscs[prio]);
> > +}
> > +
> > +static struct Qdisc *skbprio_leaf(struct Qdisc *sch, unsigned long arg)
> > +{
> > +       return NULL;
> > +}
> > +
> > +static unsigned long skbprio_find(struct Qdisc *sch, u32 classid)
> > +{
> > +       return 0;
> > +}
> > +
> > +static int skbprio_dump_class(struct Qdisc *sch, unsigned long cl,
> > +                            struct sk_buff *skb, struct tcmsg *tcm)
> > +{
> > +       tcm->tcm_handle |= TC_H_MIN(cl);
> > +       return 0;
> > +}
> > +
> > +static int skbprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
> > +                                  struct gnet_dump *d)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       if (gnet_stats_copy_queue(d, NULL, &q->qstats[cl - 1],
> > +               q->qstats[cl - 1].qlen) < 0)
> > +               return -1;
> > +       return 0;
> > +}
> > +
> > +static void skbprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> > +{
> > +       unsigned int i;
> > +
> > +       if (arg->stop)
> > +               return;
> > +
> > +       for (i = 0; i < SKBPRIO_MAX_PRIORITY; i++) {
> > +               if (arg->count < arg->skip) {
> > +                       arg->count++;
> > +                       continue;
> > +               }
> > +               if (arg->fn(sch, i + 1, arg) < 0) {
> > +                       arg->stop = 1;
> > +                       break;
> > +               }
> > +               arg->count++;
> > +       }
> > +}
> > +
> > +static const struct Qdisc_class_ops skbprio_class_ops = {
> > +       .leaf           =       skbprio_leaf,
> > +       .find           =       skbprio_find,
> > +       .dump           =       skbprio_dump_class,
> > +       .dump_stats     =       skbprio_dump_class_stats,
> > +       .walk           =       skbprio_walk,
> > +};
> > +
> > +static struct Qdisc_ops skbprio_qdisc_ops __read_mostly = {
> > +       .cl_ops         =       &skbprio_class_ops,
> > +       .id             =       "skbprio",
> > +       .priv_size      =       sizeof(struct skbprio_sched_data),
> > +       .enqueue        =       skbprio_enqueue,
> > +       .dequeue        =       skbprio_dequeue,
> > +       .peek           =       qdisc_peek_dequeued,
> > +       .init           =       skbprio_init,
> > +       .reset          =       skbprio_reset,
> > +       .change         =       skbprio_change,
> > +       .dump           =       skbprio_dump,
> > +       .destroy        =       skbprio_destroy,
> > +       .owner          =       THIS_MODULE,
> > +};
> > +
> > +static int __init skbprio_module_init(void)
> > +{
> > +       return register_qdisc(&skbprio_qdisc_ops);
> > +}
> > +
> > +static void __exit skbprio_module_exit(void)
> > +{
> > +       unregister_qdisc(&skbprio_qdisc_ops);
> > +}
> > +
> > +module_init(skbprio_module_init)
> > +module_exit(skbprio_module_exit)
> > +
> > +MODULE_LICENSE("GPL");
> > --
> > 1.9.1
> >

^ permalink raw reply

* [PATCH net-next 6/6] selftests: forwarding: README: Require diagrams
From: Petr Machata @ 2018-06-26  0:08 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

ASCII art diagrams are well suited for presenting the topology that a
test uses while being easy to embed directly in the test file iteslf.
They make the information very easy to grasp even for simple topologies,
and for more complex ones they are almost essential, as figuring out the
interconnects from the script itself proves to be difficult.

Therefore state the requirement for topology ASCII art in README.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 tools/testing/selftests/net/forwarding/README | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/README b/tools/testing/selftests/net/forwarding/README
index 4a0964c42860..b8a2af8fcfb7 100644
--- a/tools/testing/selftests/net/forwarding/README
+++ b/tools/testing/selftests/net/forwarding/README
@@ -46,6 +46,8 @@ Guidelines for Writing Tests
 
 o Where possible, reuse an existing topology for different tests instead
   of recreating the same topology.
+o Tests that use anything but the most trivial topologies should include
+  an ASCII art showing the topology.
 o Where possible, IPv6 and IPv4 addresses shall conform to RFC 3849 and
   RFC 5737, respectively.
 o Where possible, tests shall be written so that they can be reused by
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 5/6] selftests: forwarding: Test multipath tunneling
From: Petr Machata @ 2018-06-26  0:08 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

Add a GRE-tunneling test such that there are two tunnels involved, with
a multipath route listing both as next hops. Similarly to
router_multipath.sh, test that the distribution of traffic to the
tunnels honors the configured weights.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 .../selftests/net/forwarding/gre_multipath.sh      | 354 +++++++++++++++++++++
 1 file changed, 354 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh

diff --git a/tools/testing/selftests/net/forwarding/gre_multipath.sh b/tools/testing/selftests/net/forwarding/gre_multipath.sh
new file mode 100755
index 000000000000..982cc8c23200
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/gre_multipath.sh
@@ -0,0 +1,354 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test traffic distribution when a wECMP route forwards traffic to two GRE
+# tunnels.
+#
+# +-------------------------+
+# | H1                      |
+# |               $h1 +     |
+# |      192.0.2.1/28 |     |
+# |  2001:db8:1::1/64 |     |
+# +-------------------|-----+
+#                     |
+# +-------------------|------------------------+
+# | SW1               |                        |
+# |              $ol1 +                        |
+# |      192.0.2.2/28                          |
+# |  2001:db8:1::2/64                          |
+# |                                            |
+# |  + g1a (gre)          + g1b (gre)          |
+# |    loc=192.0.2.65       loc=192.0.2.81     |
+# |    rem=192.0.2.66 --.   rem=192.0.2.82 --. |
+# |    tos=inherit      |   tos=inherit      | |
+# |  .------------------'                    | |
+# |  |                    .------------------' |
+# |  v                    v                    |
+# |  + $ul1.111 (vlan)    + $ul1.222 (vlan)    |
+# |  | 192.0.2.129/28     | 192.0.2.145/28     |
+# |   \                  /                     |
+# |    \________________/                      |
+# |            |                               |
+# |            + $ul1                          |
+# +------------|-------------------------------+
+#              |
+# +------------|-------------------------------+
+# | SW2        + $ul2                          |
+# |     _______|________                       |
+# |    /                \                      |
+# |   /                  \                     |
+# |  + $ul2.111 (vlan)    + $ul2.222 (vlan)    |
+# |  ^ 192.0.2.130/28     ^ 192.0.2.146/28     |
+# |  |                    |                    |
+# |  |                    '------------------. |
+# |  '------------------.                    | |
+# |  + g2a (gre)        | + g2b (gre)        | |
+# |    loc=192.0.2.66   |   loc=192.0.2.82   | |
+# |    rem=192.0.2.65 --'   rem=192.0.2.81 --' |
+# |    tos=inherit          tos=inherit        |
+# |                                            |
+# |              $ol2 +                        |
+# |     192.0.2.17/28 |                        |
+# |  2001:db8:2::1/64 |                        |
+# +-------------------|------------------------+
+#                     |
+# +-------------------|-----+
+# | H2                |     |
+# |               $h2 +     |
+# |     192.0.2.18/28       |
+# |  2001:db8:2::2/64       |
+# +-------------------------+
+
+ALL_TESTS="
+	ping_ipv4
+	ping_ipv6
+	multipath_ipv4
+	multipath_ipv6
+	multipath_ipv6_l4
+"
+
+NUM_NETIFS=6
+source lib.sh
+
+h1_create()
+{
+	simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+	ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2
+	ip route add vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+}
+
+h1_destroy()
+{
+	ip route del vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+	ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2
+	simple_if_fini $h1 192.0.2.1/28
+}
+
+sw1_create()
+{
+	simple_if_init $ol1 192.0.2.2/28 2001:db8:1::2/64
+	__simple_if_init $ul1 v$ol1
+	vlan_create $ul1 111 v$ol1 192.0.2.129/28
+	vlan_create $ul1 222 v$ol1 192.0.2.145/28
+
+	tunnel_create g1a gre 192.0.2.65 192.0.2.66 tos inherit dev v$ol1
+	__simple_if_init g1a v$ol1 192.0.2.65/32
+	ip route add vrf v$ol1 192.0.2.66/32 via 192.0.2.130
+
+	tunnel_create g1b gre 192.0.2.81 192.0.2.82 tos inherit dev v$ol1
+	__simple_if_init g1b v$ol1 192.0.2.81/32
+	ip route add vrf v$ol1 192.0.2.82/32 via 192.0.2.146
+
+	ip route add vrf v$ol1 192.0.2.16/28 \
+	   nexthop dev g1a \
+	   nexthop dev g1b
+	ip route add vrf v$ol1 2001:db8:2::/64 \
+	   nexthop dev g1a \
+	   nexthop dev g1b
+
+	tc qdisc add dev $ul1 clsact
+	tc filter add dev $ul1 egress pref 111 prot 802.1q \
+	   flower vlan_id 111 action pass
+	tc filter add dev $ul1 egress pref 222 prot 802.1q \
+	   flower vlan_id 222 action pass
+}
+
+sw1_destroy()
+{
+	tc qdisc del dev $ul1 clsact
+
+	ip route del vrf v$ol1 2001:db8:2::/64
+	ip route del vrf v$ol1 192.0.2.16/28
+
+	ip route del vrf v$ol1 192.0.2.82/32 via 192.0.2.146
+	__simple_if_fini g1b 192.0.2.81/32
+	tunnel_destroy g1b
+
+	ip route del vrf v$ol1 192.0.2.66/32 via 192.0.2.130
+	__simple_if_fini g1a 192.0.2.65/32
+	tunnel_destroy g1a
+
+	vlan_destroy $ul1 222
+	vlan_destroy $ul1 111
+	__simple_if_fini $ul1
+	simple_if_fini $ol1 192.0.2.2/28 2001:db8:1::2/64
+}
+
+sw2_create()
+{
+	simple_if_init $ol2 192.0.2.17/28 2001:db8:2::1/64
+	__simple_if_init $ul2 v$ol2
+	vlan_create $ul2 111 v$ol2 192.0.2.130/28
+	vlan_create $ul2 222 v$ol2 192.0.2.146/28
+
+	tunnel_create g2a gre 192.0.2.66 192.0.2.65 tos inherit dev v$ol2
+	__simple_if_init g2a v$ol2 192.0.2.66/32
+	ip route add vrf v$ol2 192.0.2.65/32 via 192.0.2.129
+
+	tunnel_create g2b gre 192.0.2.82 192.0.2.81 tos inherit dev v$ol2
+	__simple_if_init g2b v$ol2 192.0.2.82/32
+	ip route add vrf v$ol2 192.0.2.81/32 via 192.0.2.145
+
+	ip route add vrf v$ol2 192.0.2.0/28 \
+	   nexthop dev g2a \
+	   nexthop dev g2b
+	ip route add vrf v$ol2 2001:db8:1::/64 \
+	   nexthop dev g2a \
+	   nexthop dev g2b
+}
+
+sw2_destroy()
+{
+	ip route del vrf v$ol2 2001:db8:1::/64
+	ip route del vrf v$ol2 192.0.2.0/28
+
+	ip route del vrf v$ol2 192.0.2.81/32 via 192.0.2.145
+	__simple_if_fini g2b 192.0.2.82/32
+	tunnel_destroy g2b
+
+	ip route del vrf v$ol2 192.0.2.65/32 via 192.0.2.129
+	__simple_if_fini g2a 192.0.2.66/32
+	tunnel_destroy g2a
+
+	vlan_destroy $ul2 222
+	vlan_destroy $ul2 111
+	__simple_if_fini $ul2
+	simple_if_fini $ol2 192.0.2.17/28 2001:db8:2::1/64
+}
+
+h2_create()
+{
+	simple_if_init $h2 192.0.2.18/28 2001:db8:2::2/64
+	ip route add vrf v$h2 192.0.2.0/28 via 192.0.2.17
+	ip route add vrf v$h2 2001:db8:1::/64 via 2001:db8:2::1
+}
+
+h2_destroy()
+{
+	ip route del vrf v$h2 2001:db8:1::/64 via 2001:db8:2::1
+	ip route del vrf v$h2 192.0.2.0/28 via 192.0.2.17
+	simple_if_fini $h2 192.0.2.18/28 2001:db8:2::2/64
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	ol1=${NETIFS[p2]}
+
+	ul1=${NETIFS[p3]}
+	ul2=${NETIFS[p4]}
+
+	ol2=${NETIFS[p5]}
+	h2=${NETIFS[p6]}
+
+	vrf_prepare
+	h1_create
+	sw1_create
+	sw2_create
+	h2_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	h2_destroy
+	sw2_destroy
+	sw1_destroy
+	h1_destroy
+	vrf_cleanup
+}
+
+multipath4_test()
+{
+	local what=$1; shift
+	local weight1=$1; shift
+	local weight2=$1; shift
+
+	sysctl_set net.ipv4.fib_multipath_hash_policy 1
+	ip route replace vrf v$ol1 192.0.2.16/28 \
+	   nexthop dev g1a weight $weight1 \
+	   nexthop dev g1b weight $weight2
+
+	local t0_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t0_222=$(tc_rule_stats_get $ul1 222 egress)
+
+	ip vrf exec v$h1 \
+	   $MZ $h1 -q -p 64 -A 192.0.2.1 -B 192.0.2.18 \
+	       -d 1msec -t udp "sp=1024,dp=0-32768"
+
+	local t1_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t1_222=$(tc_rule_stats_get $ul1 222 egress)
+
+	local d111=$((t1_111 - t0_111))
+	local d222=$((t1_222 - t0_222))
+	multipath_eval "$what" $weight1 $weight2 $d111 $d222
+
+	ip route replace vrf v$ol1 192.0.2.16/28 \
+	   nexthop dev g1a \
+	   nexthop dev g1b
+	sysctl_restore net.ipv4.fib_multipath_hash_policy
+}
+
+multipath6_l4_test()
+{
+	local what=$1; shift
+	local weight1=$1; shift
+	local weight2=$1; shift
+
+	sysctl_set net.ipv6.fib_multipath_hash_policy 1
+	ip route replace vrf v$ol1 2001:db8:2::/64 \
+	   nexthop dev g1a weight $weight1 \
+	   nexthop dev g1b weight $weight2
+
+	local t0_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t0_222=$(tc_rule_stats_get $ul1 222 egress)
+
+	ip vrf exec v$h1 \
+	   $MZ $h1 -6 -q -p 64 -A 2001:db8:1::1 -B 2001:db8:2::2 \
+	       -d 1msec -t udp "sp=1024,dp=0-32768"
+
+	local t1_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t1_222=$(tc_rule_stats_get $ul1 222 egress)
+
+	local d111=$((t1_111 - t0_111))
+	local d222=$((t1_222 - t0_222))
+	multipath_eval "$what" $weight1 $weight2 $d111 $d222
+
+	ip route replace vrf v$ol1 2001:db8:2::/64 \
+	   nexthop dev g1a \
+	   nexthop dev g1b
+	sysctl_restore net.ipv6.fib_multipath_hash_policy
+}
+
+multipath6_test()
+{
+	local what=$1; shift
+	local weight1=$1; shift
+	local weight2=$1; shift
+
+	ip route replace vrf v$ol1 2001:db8:2::/64 \
+	   nexthop dev g1a weight $weight1 \
+	   nexthop dev g1b weight $weight2
+
+	local t0_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t0_222=$(tc_rule_stats_get $ul1 222 egress)
+
+        # Generate 16384 echo requests, each with a random flow label.
+	for ((i=0; i < 16384; ++i)); do
+		ip vrf exec v$h1 $PING6 2001:db8:2::2 -F 0 -c 1 -q &> /dev/null
+	done
+
+	local t1_111=$(tc_rule_stats_get $ul1 111 egress)
+	local t1_222=$(tc_rule_stats_get $ul1 222 egress)
+
+	local d111=$((t1_111 - t0_111))
+	local d222=$((t1_222 - t0_222))
+	multipath_eval "$what" $weight1 $weight2 $d111 $d222
+
+	ip route replace vrf v$ol1 2001:db8:2::/64 \
+	   nexthop dev g1a \
+	   nexthop dev g1b
+}
+
+ping_ipv4()
+{
+	ping_test $h1 192.0.2.18
+}
+
+ping_ipv6()
+{
+	ping6_test $h1 2001:db8:2::2
+}
+
+multipath_ipv4()
+{
+	log_info "Running IPv4 multipath tests"
+	multipath4_test "ECMP" 1 1
+	multipath4_test "Weighted MP 2:1" 2 1
+	multipath4_test "Weighted MP 11:45" 11 45
+}
+
+multipath_ipv6()
+{
+	log_info "Running IPv6 multipath tests"
+	multipath6_test "ECMP" 1 1
+	multipath6_test "Weighted MP 2:1" 2 1
+	multipath6_test "Weighted MP 11:45" 11 45
+}
+
+multipath_ipv6_l4()
+{
+	log_info "Running IPv6 L4 hash multipath tests"
+	multipath6_l4_test "ECMP" 1 1
+	multipath6_l4_test "Weighted MP 2:1" 2 1
+	multipath6_l4_test "Weighted MP 11:45" 11 45
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+tests_run
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 4/6] selftests: forwarding: lib: Extract interface-init functions
From: Petr Machata @ 2018-06-26  0:08 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

The function simple_if_init() does two things: it creates a VRF, then
moves an interface into this VRF and configures addresses. The latter
comes in handy when adding more interfaces into a VRF later on. The
situation is similar for simple_if_fini().

Therefore split the interface remastering and address de/initialization
logic to a new pair of helpers __simple_if_init() / __simple_if_fini(),
and defer to these helpers from simple_if_init() and simple_if_fini().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh | 32 +++++++++++++++++++++------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index f94ea4bafa13..1dfdf14894e2 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -287,6 +287,29 @@ __addr_add_del()
 	done
 }
 
+__simple_if_init()
+{
+	local if_name=$1; shift
+	local vrf_name=$1; shift
+	local addrs=("${@}")
+
+	ip link set dev $if_name master $vrf_name
+	ip link set dev $if_name up
+
+	__addr_add_del $if_name add "${addrs[@]}"
+}
+
+__simple_if_fini()
+{
+	local if_name=$1; shift
+	local addrs=("${@}")
+
+	__addr_add_del $if_name del "${addrs[@]}"
+
+	ip link set dev $if_name down
+	ip link set dev $if_name nomaster
+}
+
 simple_if_init()
 {
 	local if_name=$1
@@ -298,11 +321,8 @@ simple_if_init()
 	array=("${@}")
 
 	vrf_create $vrf_name
-	ip link set dev $if_name master $vrf_name
 	ip link set dev $vrf_name up
-	ip link set dev $if_name up
-
-	__addr_add_del $if_name add "${array[@]}"
+	__simple_if_init $if_name $vrf_name "${array[@]}"
 }
 
 simple_if_fini()
@@ -315,9 +335,7 @@ simple_if_fini()
 	vrf_name=v$if_name
 	array=("${@}")
 
-	__addr_add_del $if_name del "${array[@]}"
-
-	ip link set dev $if_name down
+	__simple_if_fini $if_name "${array[@]}"
 	vrf_destroy $vrf_name
 }
 
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 3/6] selftests: forwarding: tc_rule_stats_get: Parameterize direction
From: Petr Machata @ 2018-06-26  0:07 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

The GRE multipath tests need stats on an egress counter. Change
tc_rule_stats_get() to take direction as an optional argument, with
default of ingress.

Take the opportunity to change line continuation character from | to \.
Move the | to the next line, which indent.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 911d753c4ff0..f94ea4bafa13 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -383,9 +383,10 @@ tc_rule_stats_get()
 {
 	local dev=$1; shift
 	local pref=$1; shift
+	local dir=$1; shift
 
-	tc -j -s filter show dev $dev ingress pref $pref |
-	jq '.[1].options.actions[].stats.packets'
+	tc -j -s filter show dev $dev ${dir:-ingress} pref $pref \
+	    | jq '.[1].options.actions[].stats.packets'
 }
 
 mac_get()
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 2/6] selftests: forwarding: multipath_eval(): Improve style
From: Petr Machata @ 2018-06-26  0:07 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

- Change the indentation of the function body from 7 spaces to one tab.
- Move initialization of weights_ratio up so that it can be referenced
  from the error message about packet difference being zero.
- Move |'s consistently to continuation line, which reindent.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh | 74 ++++++++++++++-------------
 1 file changed, 39 insertions(+), 35 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7fae805147ae..911d753c4ff0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -559,41 +559,45 @@ tests_run()
 
 multipath_eval()
 {
-       local desc="$1"
-       local weight_rp12=$2
-       local weight_rp13=$3
-       local packets_rp12=$4
-       local packets_rp13=$5
-       local weights_ratio packets_ratio diff
-
-       RET=0
-
-       if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
-              check_err 1 "Packet difference is 0"
-              log_test "Multipath"
-              log_info "Expected ratio $weights_ratio"
-              return
-       fi
-
-       if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
-               weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
-		       | bc -l)
-               packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
-		       | bc -l)
-       else
-               weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
-		       bc -l)
-               packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \
-		       bc -l)
-       fi
-
-       diff=$(echo $weights_ratio - $packets_ratio | bc -l)
-       diff=${diff#-}
-
-       test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
-       check_err $? "Too large discrepancy between expected and measured ratios"
-       log_test "$desc"
-       log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
+	local desc="$1"
+	local weight_rp12=$2
+	local weight_rp13=$3
+	local packets_rp12=$4
+	local packets_rp13=$5
+	local weights_ratio packets_ratio diff
+
+	RET=0
+
+	if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+		weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
+				| bc -l)
+	else
+		weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" \
+				| bc -l)
+	fi
+
+	if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
+	       check_err 1 "Packet difference is 0"
+	       log_test "Multipath"
+	       log_info "Expected ratio $weights_ratio"
+	       return
+	fi
+
+	if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+		packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
+				| bc -l)
+	else
+		packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" \
+				| bc -l)
+	fi
+
+	diff=$(echo $weights_ratio - $packets_ratio | bc -l)
+	diff=${diff#-}
+
+	test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
+	check_err $? "Too large discrepancy between expected and measured ratios"
+	log_test "$desc"
+	log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
 }
 
 ##############################################################################
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 1/6] selftests: forwarding: Move multipath_eval() to lib.sh
From: Petr Machata @ 2018-06-26  0:06 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah
In-Reply-To: <cover.1529971148.git.petrm@mellanox.com>

This function will be useful for the GRE multipath test that is coming
later.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh      | 39 ++++++++++++++++++++++
 .../selftests/net/forwarding/router_multipath.sh   | 39 ----------------------
 2 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7b18a53aa556..7fae805147ae 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -557,6 +557,45 @@ tests_run()
 	done
 }
 
+multipath_eval()
+{
+       local desc="$1"
+       local weight_rp12=$2
+       local weight_rp13=$3
+       local packets_rp12=$4
+       local packets_rp13=$5
+       local weights_ratio packets_ratio diff
+
+       RET=0
+
+       if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
+              check_err 1 "Packet difference is 0"
+              log_test "Multipath"
+              log_info "Expected ratio $weights_ratio"
+              return
+       fi
+
+       if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+               weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
+		       | bc -l)
+               packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
+		       | bc -l)
+       else
+               weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
+		       bc -l)
+               packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \
+		       bc -l)
+       fi
+
+       diff=$(echo $weights_ratio - $packets_ratio | bc -l)
+       diff=${diff#-}
+
+       test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
+       check_err $? "Too large discrepancy between expected and measured ratios"
+       log_test "$desc"
+       log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
+}
+
 ##############################################################################
 # Tests
 
diff --git a/tools/testing/selftests/net/forwarding/router_multipath.sh b/tools/testing/selftests/net/forwarding/router_multipath.sh
index 8b6d0fb6d604..79a209927962 100755
--- a/tools/testing/selftests/net/forwarding/router_multipath.sh
+++ b/tools/testing/selftests/net/forwarding/router_multipath.sh
@@ -159,45 +159,6 @@ router2_destroy()
 	vrf_destroy "vrf-r2"
 }
 
-multipath_eval()
-{
-       local desc="$1"
-       local weight_rp12=$2
-       local weight_rp13=$3
-       local packets_rp12=$4
-       local packets_rp13=$5
-       local weights_ratio packets_ratio diff
-
-       RET=0
-
-       if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
-              check_err 1 "Packet difference is 0"
-              log_test "Multipath"
-              log_info "Expected ratio $weights_ratio"
-              return
-       fi
-
-       if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
-               weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
-		       | bc -l)
-               packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
-		       | bc -l)
-       else
-               weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
-		       bc -l)
-               packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \
-		       bc -l)
-       fi
-
-       diff=$(echo $weights_ratio - $packets_ratio | bc -l)
-       diff=${diff#-}
-
-       test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
-       check_err $? "Too large discrepancy between expected and measured ratios"
-       log_test "$desc"
-       log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
-}
-
 multipath4_test()
 {
        local desc="$1"
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 0/6] Multipath tests for tunnel devices
From: Petr Machata @ 2018-06-26  0:05 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah

This patchset adds a test for ECMP and weighted ECMP between two GRE
tunnels.

In patches #1 and #2, the function multipath_eval() is first moved from
router_multipath.sh to lib.sh for ease of reuse, and then fixed up.

In patch #3, the function tc_rule_stats_get() is parameterized to be
useful for egress rules as well.

In patch #4, a new function __simple_if_init() is extracted from
simple_if_init(). This covers the logic that needs to be done for the
usual interface: VRF migration, upping and installation of IP addresses.

Patch #5 then adds the test itself.

Additionally in patch #6, a requirement to add diagrams to selftests is
documented.

Petr Machata (6):
  selftests: forwarding: Move multipath_eval() to lib.sh
  selftests: forwarding: multipath_eval(): Improve style
  selftests: forwarding: tc_rule_stats_get: Parameterize direction
  selftests: forwarding: lib: Extract interface-init functions
  selftests: forwarding: Test multipath tunneling
  selftests: forwarding: README: Require diagrams

 tools/testing/selftests/net/forwarding/README      |   2 +
 .../selftests/net/forwarding/gre_multipath.sh      | 354 +++++++++++++++++++++
 tools/testing/selftests/net/forwarding/lib.sh      |  80 ++++-
 .../selftests/net/forwarding/router_multipath.sh   |  39 ---
 4 files changed, 427 insertions(+), 48 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh

-- 
2.4.11

^ permalink raw reply

* [PATCH net-next] net/tls: Remove VLA usage on nonce
From: Kees Cook @ 2018-06-25 23:55 UTC (permalink / raw)
  To: Dave Watson
  Cc: Boris Pismenny, Aviad Yehezkel, David S. Miller, netdev,
	linux-kernel

It looks like the prior VLA removal, commit b16520f7493d ("net/tls: Remove
VLA usage"), and a new VLA addition, commit c46234ebb4d1e ("tls: RX path
for ktls"), passed in the night. This removes the newly added VLA, which
happens to have its bounds based on the same max value.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/tls/tls_sw.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f127fac88acf..3e0b64549552 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -941,7 +941,7 @@ static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(strp->sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-	char header[tls_ctx->rx.prepend_size];
+	char header[TLS_HEADER_SIZE + MAX_IV_SIZE];
 	struct strp_msg *rxm = strp_msg(skb);
 	size_t cipher_overhead;
 	size_t data_len = 0;
@@ -951,6 +951,12 @@ static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 	if (rxm->offset + tls_ctx->rx.prepend_size > skb->len)
 		return 0;
 
+	/* Sanity-check size of on-stack buffer. */
+	if (WARN_ON(tls_ctx->rx.prepend_size > sizeof(header))) {
+		ret = -EINVAL;
+		goto read_failure;
+	}
+
 	/* Linearize header to local buffer */
 	ret = skb_copy_bits(skb, rxm->offset, header, tls_ctx->rx.prepend_size);
 
@@ -1111,7 +1117,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 	}
 
 	/* Sanity-check the IV size for stack allocations. */
-	if (iv_size > MAX_IV_SIZE) {
+	if (iv_size > MAX_IV_SIZE || nonce_size > MAX_IV_SIZE) {
 		rc = -EINVAL;
 		goto free_priv;
 	}
-- 
2.17.1


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* [PATCH v2 net-next 3/4] netdevsim: add ipsec offload testing
From: Shannon Nelson @ 2018-06-25 23:41 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529970096-23199-1-git-send-email-shannon.nelson@oracle.com>

Implement the IPsec/XFRM offload API for testing.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
V2 - addressed formatting comments from Jakub Kicinski

 drivers/net/netdevsim/Makefile    |   4 +
 drivers/net/netdevsim/ipsec.c     | 298 ++++++++++++++++++++++++++++++++++++++
 drivers/net/netdevsim/netdev.c    |   7 +
 drivers/net/netdevsim/netdevsim.h |  41 ++++++
 4 files changed, 350 insertions(+)
 create mode 100644 drivers/net/netdevsim/ipsec.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 449b2a1..0fee1d0 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -13,3 +13,7 @@ endif
 ifneq ($(CONFIG_NET_DEVLINK),)
 netdevsim-objs += devlink.o fib.o
 endif
+
+ifneq ($(CONFIG_XFRM_OFFLOAD),)
+netdevsim-objs += ipsec.o
+endif
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 0000000..1a23426
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include <crypto/aead.h>
+#include <linux/debugfs.h>
+#include <net/xfrm.h>
+
+#include "netdevsim.h"
+
+#define NSIM_IPSEC_AUTH_BITS	128
+
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
+					char __user *buffer,
+					size_t count, loff_t *ppos)
+{
+	struct netdevsim *ns = filp->private_data;
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	size_t bufsize;
+	char *buf, *p;
+	int len;
+	int i;
+
+	/* the buffer needed is
+	 * (num SAs * 3 lines each * ~60 bytes per line) + one more line
+	 */
+	bufsize = (ipsec->count * 4 * 60) + 60;
+	buf = kzalloc(bufsize, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	p = buf;
+	p += snprintf(p, bufsize - (p - buf),
+		      "SA count=%u tx=%u\n",
+		      ipsec->count, ipsec->tx);
+
+	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+		struct nsim_sa *sap = &ipsec->sa[i];
+
+		if (!sap->used)
+			continue;
+
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+			      i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+			      sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i]    spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n",
+			      i, be32_to_cpu(sap->xs->id.spi),
+			      sap->xs->id.proto, sap->salt, sap->crypt);
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i]    key=0x%08x %08x %08x %08x\n",
+			      i, sap->key[0], sap->key[1],
+			      sap->key[2], sap->key[3]);
+	}
+
+	len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
+
+	kfree(buf);
+	return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = nsim_dbg_netdev_ops_read,
+};
+
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+	u32 i;
+
+	if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+		return -ENOSPC;
+
+	/* search sa table */
+	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+		if (!ipsec->sa[i].used)
+			return i;
+	}
+
+	return -ENOSPC;
+}
+
+static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
+				       u32 *mykey, u32 *mysalt)
+{
+	const char aes_gcm_name[] = "rfc4106(gcm(aes))";
+	struct net_device *dev = xs->xso.dev;
+	unsigned char *key_data;
+	char *alg_name = NULL;
+	int key_len;
+
+	if (!xs->aead) {
+		netdev_err(dev, "Unsupported IPsec algorithm\n");
+		return -EINVAL;
+	}
+
+	if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
+		netdev_err(dev, "IPsec offload requires %d bit authentication\n",
+			   NSIM_IPSEC_AUTH_BITS);
+		return -EINVAL;
+	}
+
+	key_data = &xs->aead->alg_key[0];
+	key_len = xs->aead->alg_key_len;
+	alg_name = xs->aead->alg_name;
+
+	if (strcmp(alg_name, aes_gcm_name)) {
+		netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+			   aes_gcm_name);
+		return -EINVAL;
+	}
+
+	/* 160 accounts for 16 byte key and 4 byte salt */
+	if (key_len > NSIM_IPSEC_AUTH_BITS) {
+		*mysalt = ((u32 *)key_data)[4];
+	} else if (key_len == NSIM_IPSEC_AUTH_BITS) {
+		*mysalt = 0;
+	} else {
+		netdev_err(dev, "IPsec hw offload only supports 128 bit keys with optional 32 bit salt\n");
+		return -EINVAL;
+	}
+	memcpy(mykey, key_data, 16);
+
+	return 0;
+}
+
+static int nsim_ipsec_add_sa(struct xfrm_state *xs)
+{
+	struct nsim_ipsec *ipsec;
+	struct net_device *dev;
+	struct netdevsim *ns;
+	struct nsim_sa sa;
+	u16 sa_idx;
+	int ret;
+
+	dev = xs->xso.dev;
+	ns = netdev_priv(dev);
+	ipsec = &ns->ipsec;
+
+	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+		netdev_err(dev, "Unsupported protocol 0x%04x for ipsec offload\n",
+			   xs->id.proto);
+		return -EINVAL;
+	}
+
+	if (xs->calg) {
+		netdev_err(dev, "Compression offload not supported\n");
+		return -EINVAL;
+	}
+
+	/* find the first unused index */
+	ret = nsim_ipsec_find_empty_idx(ipsec);
+	if (ret < 0) {
+		netdev_err(dev, "No space for SA in Rx table!\n");
+		return ret;
+	}
+	sa_idx = (u16)ret;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.used = true;
+	sa.xs = xs;
+
+	if (sa.xs->id.proto & IPPROTO_ESP)
+		sa.crypt = xs->ealg || xs->aead;
+
+	/* get the key and salt */
+	ret = nsim_ipsec_parse_proto_keys(xs, sa.key, &sa.salt);
+	if (ret) {
+		netdev_err(dev, "Failed to get key data for SA table\n");
+		return ret;
+	}
+
+	if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) {
+		sa.rx = true;
+
+		if (xs->props.family == AF_INET6)
+			memcpy(sa.ipaddr, &xs->id.daddr.a6, 16);
+		else
+			memcpy(&sa.ipaddr[3], &xs->id.daddr.a4, 4);
+	}
+
+	/* the preparations worked, so save the info */
+	memcpy(&ipsec->sa[sa_idx], &sa, sizeof(sa));
+
+	/* the XFRM stack doesn't like offload_handle == 0,
+	 * so add a bitflag in case our array index is 0
+	 */
+	xs->xso.offload_handle = sa_idx | NSIM_IPSEC_VALID;
+	ipsec->count++;
+
+	return 0;
+}
+
+static void nsim_ipsec_del_sa(struct xfrm_state *xs)
+{
+	struct net_device *dev = xs->xso.dev;
+	struct netdevsim *ns = netdev_priv(dev);
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	u16 sa_idx;
+
+	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+	if (!ipsec->sa[sa_idx].used) {
+		netdev_err(dev, "Invalid SA for delete sa_idx=%d\n", sa_idx);
+		return;
+	}
+
+	memset(&ipsec->sa[sa_idx], 0, sizeof(struct nsim_sa));
+	ipsec->count--;
+}
+
+static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+	struct net_device *dev = xs->xso.dev;
+	struct netdevsim *ns = netdev_priv(dev);
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+
+	ipsec->ok++;
+
+	return true;
+}
+
+static const struct xfrmdev_ops nsim_xfrmdev_ops = {
+	.xdo_dev_state_add	= nsim_ipsec_add_sa,
+	.xdo_dev_state_delete	= nsim_ipsec_del_sa,
+	.xdo_dev_offload_ok	= nsim_ipsec_offload_ok,
+};
+
+bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+{
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	struct xfrm_state *xs;
+	struct nsim_sa *tsa;
+	u32 sa_idx;
+
+	/* do we even need to check this packet? */
+	if (!skb->sp)
+		return true;
+
+	if (unlikely(!skb->sp->len)) {
+		netdev_err(ns->netdev, "no xfrm state len = %d\n",
+			   skb->sp->len);
+		return false;
+	}
+
+	xs = xfrm_input_state(skb);
+	if (unlikely(!xs)) {
+		netdev_err(ns->netdev, "no xfrm_input_state() xs = %p\n", xs);
+		return false;
+	}
+
+	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+	if (unlikely(sa_idx > NSIM_IPSEC_MAX_SA_COUNT)) {
+		netdev_err(ns->netdev, "bad sa_idx=%d max=%d\n",
+			   sa_idx, NSIM_IPSEC_MAX_SA_COUNT);
+		return false;
+	}
+
+	tsa = &ipsec->sa[sa_idx];
+	if (unlikely(!tsa->used)) {
+		netdev_err(ns->netdev, "unused sa_idx=%d\n", sa_idx);
+		return false;
+	}
+
+	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+		netdev_err(ns->netdev, "unexpected proto=%d\n", xs->id.proto);
+		return false;
+	}
+
+	ipsec->tx++;
+
+	return true;
+}
+
+void nsim_ipsec_init(struct netdevsim *ns)
+{
+	ns->netdev->xfrmdev_ops = &nsim_xfrmdev_ops;
+
+#define NSIM_ESP_FEATURES	(NETIF_F_HW_ESP | \
+				 NETIF_F_HW_ESP_TX_CSUM | \
+				 NETIF_F_GSO_ESP)
+
+	ns->netdev->features |= NSIM_ESP_FEATURES;
+	ns->netdev->hw_enc_features |= NSIM_ESP_FEATURES;
+
+	ns->ipsec.pfile = debugfs_create_file("ipsec", 0400, ns->ddir, ns,
+					      &ipsec_dbg_fops);
+}
+
+void nsim_ipsec_teardown(struct netdevsim *ns)
+{
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+
+	if (ipsec->count)
+		netdev_err(ns->netdev, "tearing down IPsec offload with %d SAs left\n",
+			   ipsec->count);
+	debugfs_remove_recursive(ipsec->pfile);
+}
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index ec68f38..6ce8604d 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -171,6 +171,8 @@ static int nsim_init(struct net_device *dev)
 	if (err)
 		goto err_unreg_dev;
 
+	nsim_ipsec_init(ns);
+
 	return 0;
 
 err_unreg_dev:
@@ -186,6 +188,7 @@ static void nsim_uninit(struct net_device *dev)
 {
 	struct netdevsim *ns = netdev_priv(dev);
 
+	nsim_ipsec_teardown(ns);
 	nsim_devlink_teardown(ns);
 	debugfs_remove_recursive(ns->ddir);
 	nsim_bpf_uninit(ns);
@@ -203,11 +206,15 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct netdevsim *ns = netdev_priv(dev);
 
+	if (!nsim_ipsec_tx(ns, skb))
+		goto out;
+
 	u64_stats_update_begin(&ns->syncp);
 	ns->tx_packets++;
 	ns->tx_bytes += skb->len;
 	u64_stats_update_end(&ns->syncp);
 
+out:
 	dev_kfree_skb(skb);
 
 	return NETDEV_TX_OK;
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 3a8581a..29448e8 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -29,6 +29,27 @@ struct bpf_prog;
 struct dentry;
 struct nsim_vf_config;
 
+#define NSIM_IPSEC_MAX_SA_COUNT		33
+#define NSIM_IPSEC_VALID		BIT(31)
+
+struct nsim_sa {
+	struct xfrm_state *xs;
+	__be32 ipaddr[4];
+	u32 key[4];
+	u32 salt;
+	bool used;
+	bool crypt;
+	bool rx;
+};
+
+struct nsim_ipsec {
+	struct nsim_sa sa[NSIM_IPSEC_MAX_SA_COUNT];
+	struct dentry *pfile;
+	u32 count;
+	u32 tx;
+	u32 ok;
+};
+
 struct netdevsim {
 	struct net_device *netdev;
 
@@ -67,6 +88,7 @@ struct netdevsim {
 #if IS_ENABLED(CONFIG_NET_DEVLINK)
 	struct devlink *devlink;
 #endif
+	struct nsim_ipsec ipsec;
 };
 
 extern struct dentry *nsim_ddir;
@@ -147,6 +169,25 @@ static inline void nsim_devlink_exit(void)
 }
 #endif
 
+#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
+void nsim_ipsec_init(struct netdevsim *ns);
+void nsim_ipsec_teardown(struct netdevsim *ns);
+bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb);
+#else
+static inline void nsim_ipsec_init(struct netdevsim *ns)
+{
+}
+
+static inline void nsim_ipsec_teardown(struct netdevsim *ns)
+{
+}
+
+static inline bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+{
+	return true;
+}
+#endif
+
 static inline struct netdevsim *to_nsim(struct device *ptr)
 {
 	return container_of(ptr, struct netdevsim, dev);
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 net-next 4/4] selftests: rtnetlink: add ipsec offload API test
From: Shannon Nelson @ 2018-06-25 23:41 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529970096-23199-1-git-send-email-shannon.nelson@oracle.com>

Using the netdevsim as a device for testing, try out the XFRM commands
for setting up IPsec hardware offloads.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 114 +++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 15948cf..9e1a82e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -608,6 +608,119 @@ kci_test_ipsec()
 	echo "PASS: ipsec"
 }
 
+#-------------------------------------------------------------------
+# Example commands
+#   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+#            spi 0x07 mode transport reqid 0x07 replay-window 32 \
+#            aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+#            sel src 14.0.0.52/24 dst 14.0.0.70/24
+#            offload dev sim1 dir out
+#   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+#            tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+#            spi 0x07 mode transport reqid 0x07
+#
+#-------------------------------------------------------------------
+kci_test_ipsec_offload()
+{
+	ret=0
+	algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+	srcip=192.168.123.3
+	dstip=192.168.123.4
+	dev=simx1
+	sysfsd=/sys/kernel/debug/netdevsim/$dev
+	sysfsf=$sysfsd/ipsec
+
+	# setup netdevsim since dummydev doesn't have offload support
+	modprobe netdevsim
+	check_err $?
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload can't load netdevsim"
+		return 1
+	fi
+
+	ip link add $dev type netdevsim
+	ip addr add $srcip dev $dev
+	ip link set $dev up
+	if [ ! -d $sysfsd ] ; then
+		echo "FAIL: ipsec_offload can't create device $dev"
+		return 1
+	fi
+	if [ ! -f $sysfsf ] ; then
+		echo "FAIL: ipsec_offload netdevsim doesn't support IPsec offload"
+		return 1
+	fi
+
+	# flush to be sure there's nothing configured
+	ip x s flush ; ip x p flush
+
+	# create offloaded SAs, both in and out
+	ip x p add dir out src $srcip/24 dst $dstip/24 \
+	    tmpl proto esp src $srcip dst $dstip spi 9 \
+	    mode transport reqid 42
+	check_err $?
+	ip x p add dir out src $dstip/24 dst $srcip/24 \
+	    tmpl proto esp src $dstip dst $srcip spi 9 \
+	    mode transport reqid 42
+	check_err $?
+
+	ip x s add proto esp src $srcip dst $dstip spi 9 \
+	    mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \
+	    offload dev $dev dir out
+	check_err $?
+	ip x s add proto esp src $dstip dst $srcip spi 9 \
+	    mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \
+	    offload dev $dev dir in
+	check_err $?
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload can't create SA"
+		return 1
+	fi
+
+	# does offload show up in ip output
+	lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"`
+	if [ $lines -ne 2 ] ; then
+		echo "FAIL: ipsec_offload SA offload missing from list output"
+		check_err 1
+	fi
+
+	# use ping to exercise the Tx path
+	ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null
+
+	# does driver have correct offload info
+	diff $sysfsf - << EOF
+SA count=2 tx=3
+sa[0] tx ipaddr=0x00000000 00000000 00000000 00000000
+sa[0]    spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[0]    key=0x34333231 38373635 32313039 36353433
+sa[1] rx ipaddr=0x00000000 00000000 00000000 037ba8c0
+sa[1]    spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[1]    key=0x34333231 38373635 32313039 36353433
+EOF
+	if [ $? -ne 0 ] ; then
+		echo "FAIL: ipsec_offload incorrect driver data"
+		check_err 1
+	fi
+
+	# does offload get removed from driver
+	ip x s flush
+	ip x p flush
+	lines=`grep -c "SA count=0" $sysfsf`
+	if [ $lines -ne 1 ] ; then
+		echo "FAIL: ipsec_offload SA not removed from driver"
+		check_err 1
+	fi
+
+	# clean up any leftovers
+	ip link del $dev
+	rmmod netdevsim
+
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload"
+		return 1
+	fi
+	echo "PASS: ipsec_offload"
+}
+
 kci_test_gretap()
 {
 	testns="testns"
@@ -862,6 +975,7 @@ kci_test_rtnl()
 	kci_test_encap
 	kci_test_macsec
 	kci_test_ipsec
+	kci_test_ipsec_offload
 
 	kci_del_dummy
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test
From: Shannon Nelson @ 2018-06-25 23:41 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529970096-23199-1-git-send-email-shannon.nelson@oracle.com>

Following the custom from the other functions, clear the global
ret code before starting the test so as to not have previously
failed tests cause us to thing this test has failed.

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index b33a371..261a981 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,6 +522,8 @@ kci_test_macsec()
 #-------------------------------------------------------------------
 kci_test_ipsec()
 {
+	ret=0
+
 	# find an ip address on this machine and make up a destination
 	srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
 	net=`echo $srcip | cut -f1-3 -d.`
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 net-next 2/4] selftests: rtnetlink: use dummydev as a test device
From: Shannon Nelson @ 2018-06-25 23:41 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529970096-23199-1-git-send-email-shannon.nelson@oracle.com>

We really shouldn't mess with local system settings, so let's
use the already created dummy device instead for ipsec testing.
Oh, and let's put the temp file into a proper directory.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 261a981..15948cf 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -523,21 +523,19 @@ kci_test_macsec()
 kci_test_ipsec()
 {
 	ret=0
-
-	# find an ip address on this machine and make up a destination
-	srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
-	net=`echo $srcip | cut -f1-3 -d.`
-	base=`echo $srcip | cut -f4 -d.`
-	dstip="$net."`expr $base + 1`
-
 	algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+	srcip=192.168.123.1
+	dstip=192.168.123.2
+	spi=7
+
+	ip addr add $srcip dev $devdummy
 
 	# flush to be sure there's nothing configured
 	ip x s flush ; ip x p flush
 	check_err $?
 
 	# start the monitor in the background
-	tmpfile=`mktemp ipsectestXXX`
+	tmpfile=`mktemp /var/run/ipsectestXXX`
 	mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
 	sleep 0.2
 
@@ -601,6 +599,7 @@ kci_test_ipsec()
 	check_err $?
 	ip x p flush
 	check_err $?
+	ip addr del $srcip/32 dev $devdummy
 
 	if [ $ret -ne 0 ]; then
 		echo "FAIL: ipsec"
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 net-next 0/4] Updates for ipsec selftests
From: Shannon Nelson @ 2018-06-25 23:41 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest

Fix up the existing ipsec selftest and add tests for
the ipsec offload driver API.

v2: addressed formatting nits in netdevsim from Jakub Kicinski

Shannon Nelson (4):
  selftests: rtnetlink: clear the return code at start of ipsec test
  selftests: rtnetlink: use dummydev as a test device
  netdevsim: add ipsec offload testing
  selftests: rtnetlink: add ipsec offload API test

 drivers/net/netdevsim/Makefile           |   4 +
 drivers/net/netdevsim/ipsec.c            | 345 +++++++++++++++++++++++++++++++
 drivers/net/netdevsim/netdev.c           |   7 +
 drivers/net/netdevsim/netdevsim.h        |  37 ++++
 tools/testing/selftests/net/rtnetlink.sh | 132 +++++++++++-
 5 files changed, 518 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/netdevsim/ipsec.c

-- 
2.7.4

^ permalink raw reply

* [PATCH v2] fib_rules: match rules based on suppress_* properties too
From: Jason A. Donenfeld @ 2018-06-25 23:39 UTC (permalink / raw)
  To: roopa, netdev; +Cc: Jason A. Donenfeld
In-Reply-To: <CAJieiUiqUCRBGTqkCWQTFJuZa7AvcoAzPAbJc3LyQT6WjBan+A@mail.gmail.com>

Two rules with different values of suppress_prefix or suppress_ifgroup
are not the same. This fixes an -EEXIST when running:

   $ ip -4 rule add table main suppress_prefixlength 0

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
---
This adds the new condition you mentioned. I'm not sure what you make of
DaveM's remark about this not being in the original code, but here is
nonetheless the requested change.

 net/core/fib_rules.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 126ffc5bc630..bc8425d81022 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops,
 		if (rule->mark && r->mark != rule->mark)
 			continue;
 
+		if (rule->suppress_ifgroup != -1 &&
+		    r->suppress_ifgroup != rule->suppress_ifgroup)
+			continue;
+
+		if (rule->suppress_prefixlen != -1 &&
+		    r->suppress_prefixlen != rule->suppress_prefixlen)
+			continue;
+
 		if (rule->mark_mask && r->mark_mask != rule->mark_mask)
 			continue;
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net 2/2] nfp: reject binding to shared blocks
From: kbuild test robot @ 2018-06-25 23:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kbuild-all, davem, oss-drivers, netdev, John Hurley,
	Jakub Kicinski
In-Reply-To: <20180625202246.28871-3-jakub.kicinski@netronome.com>

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

Hi John,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/0day-ci/linux/commits/Jakub-Kicinski/nfp-MPLS-and-shared-blocks-TC-offload-fixes/20180626-042358
config: i386-randconfig-x001-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/netronome/nfp/flower/offload.c: In function 'nfp_flower_setup_tc_block':
>> drivers/net/ethernet/netronome/nfp/flower/offload.c:634:6: error: implicit declaration of function 'tcf_block_shared'; did you mean 'tcf_block_dev'? [-Werror=implicit-function-declaration]
     if (tcf_block_shared(f->block))
         ^~~~~~~~~~~~~~~~
         tcf_block_dev
   cc1: some warnings being treated as errors

vim +634 drivers/net/ethernet/netronome/nfp/flower/offload.c

   625	
   626	static int nfp_flower_setup_tc_block(struct net_device *netdev,
   627					     struct tc_block_offload *f)
   628	{
   629		struct nfp_repr *repr = netdev_priv(netdev);
   630	
   631		if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
   632			return -EOPNOTSUPP;
   633	
 > 634		if (tcf_block_shared(f->block))
   635			return -EOPNOTSUPP;
   636	
   637		switch (f->command) {
   638		case TC_BLOCK_BIND:
   639			return tcf_block_cb_register(f->block,
   640						     nfp_flower_setup_tc_block_cb,
   641						     repr, repr);
   642		case TC_BLOCK_UNBIND:
   643			tcf_block_cb_unregister(f->block,
   644						nfp_flower_setup_tc_block_cb,
   645						repr);
   646			return 0;
   647		default:
   648			return -EOPNOTSUPP;
   649		}
   650	}
   651	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33423 bytes --]

^ permalink raw reply

* [PATCH 3/4] net: lan78xx: Add support for VLAN tag stripping.
From: Dave Stevenson @ 2018-06-25 14:07 UTC (permalink / raw)
  To: woojung.huh, UNGLinuxDriver, davem, netdev; +Cc: Dave Stevenson
In-Reply-To: <cover.1529935234.git.dave.stevenson@raspberrypi.org>

The chip supports stripping the VLAN tag and reporting it
in metadata.
Complete the support for this.

Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.org>
---
 drivers/net/usb/lan78xx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index afe7fa3..f72a8f5 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -64,6 +64,7 @@
 #define DEFAULT_RX_CSUM_ENABLE		(true)
 #define DEFAULT_TSO_CSUM_ENABLE		(true)
 #define DEFAULT_VLAN_FILTER_ENABLE	(true)
+#define DEFAULT_VLAN_RX_OFFLOAD		(true)
 #define TX_OVERHEAD			(8)
 #define RXW_PADDING			2
 
@@ -2363,6 +2364,11 @@ static int lan78xx_set_features(struct net_device *netdev,
 		pdata->rfe_ctl &= ~(RFE_CTL_ICMP_COE_ | RFE_CTL_IGMP_COE_);
 	}
 
+	if (features & NETIF_F_HW_VLAN_CTAG_RX)
+		pdata->rfe_ctl |= RFE_CTL_VLAN_STRIP_;
+	else
+		pdata->rfe_ctl &= ~RFE_CTL_VLAN_STRIP_;
+
 	if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
 		pdata->rfe_ctl |= RFE_CTL_VLAN_FILTER_;
 	else
@@ -2976,6 +2982,9 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 	if (DEFAULT_TSO_CSUM_ENABLE)
 		dev->net->features |= NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_SG;
 
+	if (DEFAULT_VLAN_RX_OFFLOAD)
+		dev->net->features |= NETIF_F_HW_VLAN_CTAG_RX;
+
 	if (DEFAULT_VLAN_FILTER_ENABLE)
 		dev->net->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
@@ -3052,6 +3061,16 @@ static void lan78xx_rx_csum_offload(struct lan78xx_net *dev,
 	}
 }
 
+static void lan78xx_rx_vlan_offload(struct lan78xx_net *dev,
+				    struct sk_buff *skb,
+				    u32 rx_cmd_a, u32 rx_cmd_b)
+{
+	if ((dev->net->features & NETIF_F_HW_VLAN_CTAG_RX) &&
+	    (rx_cmd_a & RX_CMD_A_FVTG_))
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+				       (rx_cmd_b & 0xffff));
+}
+
 static void lan78xx_skb_return(struct lan78xx_net *dev, struct sk_buff *skb)
 {
 	int		status;
@@ -3116,6 +3135,8 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct sk_buff *skb)
 			if (skb->len == size) {
 				lan78xx_rx_csum_offload(dev, skb,
 							rx_cmd_a, rx_cmd_b);
+				lan78xx_rx_vlan_offload(dev, skb,
+							rx_cmd_a, rx_cmd_b);
 
 				skb_trim(skb, skb->len - 4); /* remove fcs */
 				skb->truesize = size + sizeof(struct sk_buff);
@@ -3134,6 +3155,7 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct sk_buff *skb)
 			skb_set_tail_pointer(skb2, size);
 
 			lan78xx_rx_csum_offload(dev, skb2, rx_cmd_a, rx_cmd_b);
+			lan78xx_rx_vlan_offload(dev, skb2, rx_cmd_a, rx_cmd_b);
 
 			skb_trim(skb2, skb2->len - 4); /* remove fcs */
 			skb2->truesize = size + sizeof(struct sk_buff);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
From: Nishanth Devarajan @ 2018-06-25 23:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jamal Hadi Salim, Jiri Pirko, David Miller, netdev, doucette,
	michel
In-Reply-To: <CAM_iQpV9uiSA=T2ZN=gaJSuzB+dci9S9FiUx=dAQYomPORfjLg@mail.gmail.com>

On Sat, Jun 23, 2018 at 02:43:16PM -0700, Cong Wang wrote:
> On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan <ndev2021@gmail.com> wrote:
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 37b5096..6fd07e8 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> ...
> > +#define SKBPRIO_MAX_PRIORITY 64
> > +
> > +struct tc_skbprio_qopt {
> > +       __u32   limit;          /* Queue length in packets. */
> > +};
> 
> 
> Since this is just an integer, you can just make it NLA_U32 instead
> of a struct?
> 
>

Making it NLA_U32, wouldn't that be incurring a nla_policy struct in the
code? I also feel uneasy that we'd be straying convention of having a tc qopt
struct to pass in essential parameters from userspace.

> > +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
> > +                       struct netlink_ext_ack *extack)
> > +{
> > +       struct skbprio_sched_data *q = qdisc_priv(sch);
> > +       struct tc_skbprio_qopt *ctl = nla_data(opt);
> > +       const unsigned int min_limit = 1;
> > +
> > +       if (ctl->limit == (typeof(ctl->limit))-1)
> > +               q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
> > +       else if (ctl->limit < min_limit ||
> > +               ctl->limit > qdisc_dev(sch)->tx_queue_len)
> > +               return -EINVAL;
> > +       else
> > +               q->max_limit = ctl->limit;
> > +
> > +       return 0;
> > +}
> 
> Isn't q->max_limit same with sch->limit?
>

q->max_limit was intended to represent the maximum limit that Skbprio could
accomodate i.e the tx queue len of the device attached to the qdisc, to check
the limit parameter passed from userspace. I'll correct this in v3.
 
> Also, please avoid dev->tx_queue_len here, it may change
> independently of your qdisc change, unless you want to implement
> ops->change_tx_queue_len().

OK, will make this change.

^ permalink raw reply

* [PATCH net-next] selftests: forwarding: mirror_gre_vlan_bridge_1q: Unset rp_filter
From: Petr Machata @ 2018-06-25 23:20 UTC (permalink / raw)
  To: netdev, linux-kselftest; +Cc: davem, shuah

The IP addresses of tunnel endpoint at H3 are set at the VLAN device
$h3.555. Therefore when test_gretap_untagged_egress() sets vlan 555 to
egress untagged at $swp3, $h3's rp_filter rejects these packets. The
test then spuriously fails.

Therefore turn off net.ipv4.conf.{all, $h3}.rp_filter.

Fixes: 9c7c8a82442c ("selftests: forwarding: mirror_gre_vlan_bridge_1q: Add more tests")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh        | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
index 5dbc7a08f4bd..1ac5038ae256 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
@@ -39,6 +39,12 @@ setup_prepare()
 	swp3=${NETIFS[p5]}
 	h3=${NETIFS[p6]}
 
+	# gt4's remote address is at $h3.555, not $h3. Thus the packets arriving
+	# directly to $h3 for test_gretap_untagged_egress() are rejected by
+	# rp_filter and the test spuriously fails.
+	sysctl_set net.ipv4.conf.all.rp_filter 0
+	sysctl_set net.ipv4.conf.$h3.rp_filter 0
+
 	vrf_prepare
 	mirror_gre_topo_create
 
@@ -65,6 +71,9 @@ cleanup()
 
 	mirror_gre_topo_destroy
 	vrf_cleanup
+
+	sysctl_restore net.ipv4.conf.$h3.rp_filter
+	sysctl_restore net.ipv4.conf.all.rp_filter
 }
 
 test_vlan_match()
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing
From: Jakub Kicinski @ 2018-06-25 23:09 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: davem, netdev, anders.roxell, linux-kselftest
In-Reply-To: <c9658422-13a3-b926-cc18-c759a1a0d343@oracle.com>

On Mon, 25 Jun 2018 15:37:10 -0700, Shannon Nelson wrote:
> On 6/22/2018 9:07 PM, Jakub Kicinski wrote:
> > On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote:  
> >> Implement the IPsec/XFRM offload API for testing.
> >>
> >> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>  
> >> +#define NSIM_IPSEC_AUTH_BITS	128
> >> +
> >> +/**
> >> + * nsim_ipsec_dbg_read - read for ipsec data
> >> + * @filp: the opened file
> >> + * @buffer: where to write the data for the user to read
> >> + * @count: the size of the user's buffer
> >> + * @ppos: file position offset
> >> + **/
> >> +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,  
> > 
> > Doesn't match the kdoc.  Please run
> > 
> > ./scripts/kernel-doc -none $file
> > 
> > if you want kdoc.  Although IMHO you may as well drop the kdoc, your
> > code is quite self explanatory and local.  
> 
> By adding -v to that I got a couple of warnings that I didn't include 
> the Return information - is that what you were commenting on?  The rest 
> seems acceptable to the script I'm using from the net-next tree.

Hm, strange.  Two things: first kdoc requires () after function name,
second the function is called nsim_dbg_netdev_ops_read() while the doc
refers to nsim_ipsec_dbg_read().  Perhaps the combination of the two
makes the script miss the problem. 

> >> +					char __user *buffer,
> >> +					size_t count, loff_t *ppos)
> >> +{
> >> +	struct netdevsim *ns = filp->private_data;
> >> +	struct nsim_ipsec *ipsec = &ns->ipsec;
> >> +	size_t bufsize;
> >> +	char *buf, *p;
> >> +	int len;
> >> +	int i;
> >> +
> >> +	/* don't allow partial reads */
> >> +	if (*ppos != 0)
> >> +		return 0;
> >> +
> >> +	/* the buffer needed is
> >> +	 * (num SAs * 3 lines each * ~60 bytes per line) + one more line
> >> +	 */
> >> +	bufsize = (ipsec->count * 4 * 60) + 60;
> >> +	buf = kzalloc(bufsize, GFP_KERNEL);
> >> +	if (!buf)
> >> +		return -ENOMEM;
> >> +
> >> +	p = buf;
> >> +	p += snprintf(p, bufsize - (p - buf),
> >> +		      "SA count=%u tx=%u\n",
> >> +		      ipsec->count, ipsec->tx);
> >> +
> >> +	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
> >> +		struct nsim_sa *sap = &ipsec->sa[i];
> >> +
> >> +		if (!sap->used)
> >> +			continue;
> >> +
> >> +		p += snprintf(p, bufsize - (p - buf),
> >> +			      "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
> >> +			      i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
> >> +			      sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
> >> +		p += snprintf(p, bufsize - (p - buf),
> >> +			      "sa[%i]    spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n",
> >> +			      i, be32_to_cpu(sap->xs->id.spi),
> >> +			      sap->xs->id.proto, sap->salt, sap->crypt);
> >> +		p += snprintf(p, bufsize - (p - buf),
> >> +			      "sa[%i]    key=0x%08x %08x %08x %08x\n",
> >> +			      i, sap->key[0], sap->key[1],
> >> +			      sap->key[2], sap->key[3]);
> >> +	}
> >> +
> >> +	len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);  
> > 
> > Why not seq_file for this?  
> 
> Why bother with more interface code?  This is useful enough to support 
> the API testing needed.

No objection on this, seq_file is less error prone, but I don't mind.
FWIW you can drop the *ppos == 0 requirement, simple_read_from_buffer()
will handle other cases just fine.

> >> +	kfree(buf);
> >> +	return len;
> >> +}
> >> +
> >> +static const struct file_operations ipsec_dbg_fops = {
> >> +	.owner = THIS_MODULE,
> >> +	.open = simple_open,
> >> +	.read = nsim_dbg_netdev_ops_read,
> >> +};
> >> +
> >> +/**
> >> + * nsim_ipsec_find_empty_idx - find the first unused security parameter index
> >> + * @ipsec: pointer to ipsec struct
> >> + **/
> >> +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
> >> +{
> >> +	u32 i;
> >> +
> >> +	if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
> >> +		return -ENOSPC;
> >> +
> >> +	/* search sa table */
> >> +	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
> >> +		if (!ipsec->sa[i].used)
> >> +			return i;
> >> +	}
> >> +
> >> +	return -ENOSPC;  
> > 
> > FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and
> > concise for a small ID allocator, but no objection to open coding.  
> 
> Sure, we could add a parallel bitmap data structure to track usage of 
> our array elements, and probably would for a much larger array so as to 
> lessen the impact of a serial search.  But, since this is a short array 
> for simple testing purposes, the search time is minimal so I think the 
> simple code is fine.

Ack, no objection.

> >   
> >> +	} else if (key_len == 128) {
> >> +		*mysalt = 0;
> >> +	} else {
> >> +		netdev_err(dev, "IPsec hw offload only supports 128 bit keys with optional 32 bit salt\n");
> >> +		return -EINVAL;
> >> +	}
> >> +	memcpy(mykey, key_data, 16);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * nsim_ipsec_add_sa - program device with a security association
> >> + * @xs: pointer to transformer state struct
> >> + **/
> >> +static int nsim_ipsec_add_sa(struct xfrm_state *xs)
> >> +{
> >> +	struct net_device *dev = xs->xso.dev;
> >> +	struct netdevsim *ns = netdev_priv(dev);
> >> +	struct nsim_ipsec *ipsec = &ns->ipsec;  
> > 
> > xmas tree again (initialize out of line if you have to)  
> 
> This one is pretty much the way I've done in the past with no complaints 
> and seems common enough in other net drivers, specifically when dealing 
> with netdev and netdevpriv elements.  Only the first line is out of 
> place, with the next lines dependent on it.

I know, but I'd really prefer you just followed the rule here.

> >> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
> >> index 3a8581a..1708dee 100644
> >> --- a/drivers/net/netdevsim/netdevsim.h
> >> +++ b/drivers/net/netdevsim/netdevsim.h
> >> @@ -29,6 +29,29 @@ struct bpf_prog;
> >>   struct dentry;
> >>   struct nsim_vf_config;
> >>   
> >> +#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
> >> +#define NSIM_IPSEC_MAX_SA_COUNT		33  
> > 
> > 33 caught my eye - out of curiosity is it 2^5 + 1 to catch some type of
> > bug or failure mode?  
> 
> For test rigs, I often use something like this to help flush out any 
> interesting power-of-two or alignment assumptions in the code.  I don't 
> expect anything here, but it doesn't hurt.

Cool, seems like a good idea anyway!

> >> +#define NSIM_IPSEC_VALID		BIT(31)
> >> +
> >> +struct nsim_sa {
> >> +	struct xfrm_state *xs;
> >> +	__be32 ipaddr[4];
> >> +	u32 key[4];
> >> +	u32 salt;
> >> +	bool used;
> >> +	bool crypt;
> >> +	bool rx;
> >> +};
> >> +
> >> +struct nsim_ipsec {
> >> +	struct nsim_sa sa[NSIM_IPSEC_MAX_SA_COUNT];
> >> +	struct dentry *pfile;
> >> +	u32 count;
> >> +	u32 tx;
> >> +	u32 ok;
> >> +};
> >> +#endif  
> > 
> > No need to wrap struct definitions in #if/#endif.  
> 
> I suppose this is a philosophical point... Since CONFIG_XFRM_OFFLOAD is 
> not yet a common config setting, I'd like to keep it here to not break 
> other folks' builds or dirty them up with unused struct definitions when 
> they aren't playing with IPsec offload anyway.

The fewer ifdefs in the code the better.  Both from pure LoC stand
point but also because someone may actually change things around
without having CONFIG_XFRM_OFFLOAD enabled and break *your*
configuration.  E.g. you are depending on indirect include of
net/xfrm.h, what if someone was to change headers around and broke that
implicit dependency?  The fewer ifdefs the better.

^ permalink raw reply

* Re: [PATCH 07/14] net: davinci_emac: use nvmem to retrieve the mac address
From: Grygorii Strashko @ 2018-06-25 23:09 UTC (permalink / raw)
  To: Bartosz Golaszewski, Sekhar Nori, Kevin Hilman, Russell King,
	David S . Miller, Srinivas Kandagatla, Lukas Wunner, Rob Herring,
	Florian Fainelli, Dan Carpenter, Ivan Khoronzhuk, David Lechner,
	Greg Kroah-Hartman
  Cc: netdev, linux-omap, linux-kernel, linux-arm-kernel,
	Bartosz Golaszewski
In-Reply-To: <20180625155025.12567-8-brgl@bgdev.pl>



On 06/25/2018 10:50 AM, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> 
> All users which store the MAC address in EEPROM now register relevant
> nvmem cells. Switch to retrieving the MAC address over the nvmem
> framework.
> 
> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> ---
>   drivers/net/ethernet/ti/davinci_emac.c | 29 +++++++++++++++++---------
>   1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
> index a1a6445b5a7e..22c2322e46be 100644
> --- a/drivers/net/ethernet/ti/davinci_emac.c
> +++ b/drivers/net/ethernet/ti/davinci_emac.c
> @@ -67,7 +67,7 @@
>   #include <linux/of_irq.h>
>   #include <linux/of_net.h>
>   #include <linux/mfd/syscon.h>
> -
> +#include <linux/nvmem-consumer.h>
>   #include <asm/irq.h>
>   #include <asm/page.h>
>   
> @@ -1696,7 +1696,6 @@ davinci_emac_of_get_pdata(struct platform_device *pdev, struct emac_priv *priv)
>   	const struct of_device_id *match;
>   	const struct emac_platform_data *auxdata;
>   	struct emac_platform_data *pdata = NULL;
> -	const u8 *mac_addr;
>   
>   	if (!IS_ENABLED(CONFIG_OF) || !pdev->dev.of_node)
>   		return dev_get_platdata(&pdev->dev);
> @@ -1708,12 +1707,6 @@ davinci_emac_of_get_pdata(struct platform_device *pdev, struct emac_priv *priv)
>   	np = pdev->dev.of_node;
>   	pdata->version = EMAC_VERSION_2;
>   
> -	if (!is_valid_ether_addr(pdata->mac_addr)) {
> -		mac_addr = of_get_mac_address(np);
> -		if (mac_addr)
> -			ether_addr_copy(pdata->mac_addr, mac_addr);
> -	}
> -

Not sure it is correct. of_get_mac_address() is called when board data doesn't
provide MAC address with expectation that MAC address is defined in DT (usually by u-boot).
Standard DT properties for MAC address "local-mac-address" and "mac-address".

So, you can't just drop it and replace with nvmem.

Two options are here:
1) try to read MAC-address from nvmem then try DT (as it's now)
2) try to read DT then try nvmem [then random MAC | fail]


>   	of_property_read_u32(np, "ti,davinci-ctrl-reg-offset",
>   			     &pdata->ctrl_reg_offset);
>   
> @@ -1783,7 +1776,9 @@ static int davinci_emac_probe(struct platform_device *pdev)
>   	struct cpdma_params dma_params;
>   	struct clk *emac_clk;
>   	unsigned long emac_bus_frequency;
> -
> +	struct nvmem_cell *cell;
> +	void *mac_addr;
> +	size_t mac_addr_len;
>   
>   	/* obtain emac clock from kernel */
>   	emac_clk = devm_clk_get(&pdev->dev, NULL);
> @@ -1815,8 +1810,22 @@ static int davinci_emac_probe(struct platform_device *pdev)
>   		goto err_free_netdev;
>   	}
>   
> +	cell = nvmem_cell_get(&pdev->dev, "mac-address");
> +	if (!IS_ERR(cell)) {
> +		mac_addr = nvmem_cell_read(cell, &mac_addr_len);
> +		if (!IS_ERR(mac_addr)) {
> +			if (is_valid_ether_addr(mac_addr)) {
> +				dev_info(&pdev->dev,
> +					 "Read MAC addr from EEPROM: %pM\n",
> +					 mac_addr);
> +				memcpy(priv->mac_addr, mac_addr, ETH_ALEN);
> +			}
> +			kfree(mac_addr);
> +		}
> +		nvmem_cell_put(cell);
> +	}
> +
>   	/* MAC addr and PHY mask , RMII enable info from platform_data */
> -	memcpy(priv->mac_addr, pdata->mac_addr, ETH_ALEN);
>   	priv->phy_id = pdata->phy_id;
>   	priv->rmii_en = pdata->rmii_en;
>   	priv->version = pdata->version;
> 

-- 
regards,
-grygorii

^ permalink raw reply

* [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 Documentation/ABI/testing/sysfs-class-net-queues |   11 ++++
 Documentation/networking/scaling.txt             |   57 ++++++++++++++++++----
 2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
index 0c0df91..978b763 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -42,6 +42,17 @@ Description:
 		network device transmit queue. Possible vaules depend on the
 		number of available CPU(s) in the system.
 
+What:		/sys/class/<iface>/queues/tx-<queue>/xps_rxqs
+Date:		June 2018
+KernelVersion:	4.18.0
+Contact:	netdev@vger.kernel.org
+Description:
+		Mask of the receive queue(s) currently enabled to participate
+		into the Transmit Packet Steering packet processing flow for this
+		network device transmit queue. Possible values depend on the
+		number of available receive queue(s) in the network device.
+		Default is disabled.
+
 What:		/sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
 Date:		November 2011
 KernelVersion:	3.3
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index f55639d..8336116 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
 into the set.
@@ -404,11 +432,15 @@ acknowledged.
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
 default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+configured. To enable XPS, the bitmap of CPUs/receive-queues that may
+use a transmit queue is configured using the sysfs file entry:
 
+For selection based on CPUs map:
 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
 
+For selection based on receive-queues map:
+/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
@@ -421,6 +453,11 @@ best CPUs to share a given queue are probably those that share the cache
 with the CPU that processes transmit completions for that queue
 (transmit interrupts).
 
+For transmit queue selection based on receive queue(s), XPS has to be
+explicitly configured mapping receive-queue(s) to transmit queue(s). If the
+user configuration for receive-queue map does not apply, then the transmit
+queue is selected based on the CPUs map.
+
 Per TX Queue rate limitation:
 =============================
 

^ permalink raw reply related

* [net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/net-sysfs.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b39987c..5d2ed02 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_dev_maps *dev_maps;
+	unsigned long *mask, index;
+	int j, len, num_tc = 1, tc = 0;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	if (dev->num_tc) {
+		num_tc = dev->num_tc;
+		tc = netdev_txq_to_tc(dev, index);
+		if (tc < 0)
+			return -EINVAL;
+	}
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
+	if (dev_maps) {
+		for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+		     j < dev->num_rx_queues;) {
+			int i, tci = j * num_tc + tc;
+			struct xps_map *map;
+
+			map = rcu_dereference(dev_maps->attr_map[tci]);
+			if (!map)
+				continue;
+
+			for (i = map->len; i--;) {
+				if (map->queues[i] == index) {
+					set_bit(j, mask);
+					break;
+				}
+			}
+		}
+	}
+
+	len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+	rcu_read_unlock();
+	kfree(mask);
+
+	return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+			      size_t len)
+{
+	struct net_device *dev = queue->dev;
+	unsigned long *mask, index;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+	if (err) {
+		kfree(mask);
+		return err;
+	}
+
+	err = __netif_set_xps_queue(dev, mask, index, true);
+	kfree(mask);
+	return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+	= __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
+	&xps_rxqs_attribute.attr,
 	&queue_tx_maxrate.attr,
 #endif
 	NULL

^ permalink raw reply related

* [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |    4 +++
 net/core/dev.c     |   62 ++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0ff4416..cb18139 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1710,6 +1710,10 @@ static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
 #endif
 }
 
+static inline int sk_rx_queue_get(const struct sock *sk)
+{
+	return sk ? sk->sk_rx_queue_mapping - 1 : -1;
+}
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index df2a78d..2450c5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3454,35 +3454,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+#ifdef CONFIG_XPS
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+			       struct xps_dev_maps *dev_maps, unsigned int tci)
+{
+	struct xps_map *map;
+	int queue_index = -1;
+
+	if (dev->num_tc) {
+		tci *= dev->num_tc;
+		tci += netdev_get_prio_tc_map(dev, skb->priority);
+	}
+
+	map = rcu_dereference(dev_maps->attr_map[tci]);
+	if (map) {
+		if (map->len == 1)
+			queue_index = map->queues[0];
+		else
+			queue_index = map->queues[reciprocal_scale(
+						skb_get_hash(skb), map->len)];
+		if (unlikely(queue_index >= dev->real_num_tx_queues))
+			queue_index = -1;
+	}
+	return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
-	struct xps_map *map;
+	struct sock *sk = skb->sk;
 	int queue_index = -1;
 
 	if (!static_key_false(&xps_needed))
 		return -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_cpus_map);
+	if (!static_key_false(&xps_rxqs_needed))
+		goto get_cpus_map;
+
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
 	if (dev_maps) {
-		unsigned int tci = skb->sender_cpu - 1;
+		int tci = sk_rx_queue_get(sk);
 
-		if (dev->num_tc) {
-			tci *= dev->num_tc;
-			tci += netdev_get_prio_tc_map(dev, skb->priority);
-		}
+		if (tci >= 0 && tci < dev->num_rx_queues)
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
+	}
 
-		map = rcu_dereference(dev_maps->attr_map[tci]);
-		if (map) {
-			if (map->len == 1)
-				queue_index = map->queues[0];
-			else
-				queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
-									   map->len)];
-			if (unlikely(queue_index >= dev->real_num_tx_queues))
-				queue_index = -1;
+get_cpus_map:
+	if (queue_index < 0) {
+		dev_maps = rcu_dereference(dev->xps_cpus_map);
+		if (dev_maps) {
+			unsigned int tci = skb->sender_cpu - 1;
+
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
 		}
 	}
 	rcu_read_unlock();

^ permalink raw reply related

* [net-next PATCH v4 4/7] net: Record receive queue number for a connection
From: Amritha Nambiar @ 2018-06-25 18:04 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom
In-Reply-To: <152994950582.9733.3330634251364177102.stgit@anamhost.jf.intel.com>

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/busy_poll.h |    1 +
 include/net/sock.h      |   14 ++++++++++++++
 net/core/sock.c         |    4 ++++
 net/ipv4/tcp_input.c    |    3 +++
 4 files changed, 22 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id = skb->napi_id;
 #endif
+	sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 009fd30..0ff4416 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	unsigned short		skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	unsigned short		skc_rx_queue_mapping;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb) + 1;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..5e4715b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = ~0;
+#endif
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f..c404c53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5584,6 +5585,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	if (skb) {
 		icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
 		security_inet_conn_established(sk, skb);
+		sk_mark_napi_id(sk, skb);
 	}
 
 	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6412,6 +6414,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_rx_queue_set(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox