Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan
From: Jakub Kicinski @ 2018-06-06 15:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Paul Blakey, Jiri Pirko, Cong Wang, Jamal Hadi Salim,
	David Miller, Linux Netdev List, Yevgeny Kliteynik, Roi Dayan,
	Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <CAJ3xEMg0QJ2MdcXgWbe1J+7KssN9b36X_2+nBFaGAKjkcy2xPg@mail.gmail.com>

On Wed, 6 Jun 2018 08:12:20 +0300, Or Gerlitz wrote:
> On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:  
> >> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:  
> >> > When using a vxlan device as the ingress dev, we count it as a
> >> > "no offload dev", so when such a rule comes and err stop is true,
> >> > we fail early and don't try the egdev route which can offload it
> >> > through the egress device.
> >> >
> >> > Fix that by not calling the block offload if one of the devices
> >> > attached to it is not offload capable, but make sure egress on such case
> >> > is capable instead.
> >> >
> >> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> >> > Reviewed-by: Roi Dayan <roid@mellanox.com>
> >> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> >> > Signed-off-by: Paul Blakey <paulb@mellanox.com>  
> >>
> >> Very poor commit message.  What you're doing is re-enabling skip_sw
> >> filters on tunnel devices which semantically make no sense and
> >> shouldn't have been allowed in the first place.
> >>
> >> This will breaks block sharing between tunnels and HW netdevs (because
> >> you skip the tcf_block_cb_call() completely).  The entire egdev idea
> >> remains badly broken accepting filters like this:
> >>
> >> # tc filter add dev vxlan0 ingress \
> >>       matchall action skip_sw \
> >>               mirred egress redirect dev lo \
> >>               mirred egress redirect dev sw1np0  
> >
> > For above we probably need something like this (untested):
> >
> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > index 3f4cf930f809..71e5eebec31a 100644
> > --- a/net/sched/act_api.c
> > +++ b/net/sched/act_api.c
> > @@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
> >         struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
> >
> >         if (!egdev)
> > -               return 0;
> > +               return err_stop ? -EOPNOTSUPP : 0;
> >         return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
> >  }
> >  EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);  
> 
> Jakub,
> 
> We will test it out and let you know

That's probably insufficient, because the second action should be
skipped completely.  But an improvement nonetheless :)

> > But the correct fix is to remove egdev crutch completely IMO.  
> 
> Not against it, sometimes designs should change and be replaced
> with better ones, happens

Cool, I'm not sure when we'll get to it, but perhaps we can go over 
the ideas I presented at netconf on switchdev call next week (if it's
still active)?

^ permalink raw reply

* Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan
From: Jakub Kicinski @ 2018-06-06 15:46 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Paul Blakey, Jiri Pirko, Cong Wang,
	Jamal Hadi Salim, Linux Netdev List, Yevgeny Kliteynik, Roi Dayan,
	Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <CAJ3xEMh5TffWyD1V1w_EWGgnQuhA26O1BX8kZrdkHu3bB9+7-g@mail.gmail.com>

On Wed, 6 Jun 2018 08:15:27 +0300, Or Gerlitz wrote:
> On Wed, Jun 6, 2018 at 12:27 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:  
> >> From: Jakub Kicinski <kubakici@wp.pl>
> >> Date: Tue, 5 Jun 2018 11:57:47 -0700
> >>  
> >> > Do we still care about correctness and not breaking backward
> >> > compatibility?  
> >>
> >> Jakub let me know if you want me to revert this change.  
> >
> > Yes, I think this patch introduces a regression when block is shared
> > between offload capable and in-capable device, therefore it should be
> > reverted.  Shared blocks went through a number of review cycles to
> > ensure such cases are handled correctly.
> >
> >
> > Longer story about the actual issue which is never explained in the
> > commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> > supported on tunnels in cls_flower only:
> >
> > static int fl_hw_replace_filter(struct tcf_proto *tp,
> > [...]
> >         if (!tc_can_offload(dev)) {
> >                 if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
> >                     (f->hw_dev && !tc_can_offload(f->hw_dev))) {
> >                         f->hw_dev = dev;
> >                         return tc_skip_sw(f->flags) ? -EINVAL : 0;
> >                 }
> >                 dev = f->hw_dev;
> >                 cls_flower.egress_dev = true;
> >         } else {
> >                 f->hw_dev = dev;
> >         }
> >
> >
> > In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> > promoted to a generic TC thing supported on all filters but it no
> > longer works with skip_sw.
> >
> > I'd argue skip_sw is incorrect for tunnels, because the rule will only
> > apply to traffic ingressing on ASIC ports, not all traffic which hits
> > the tunnel netdev.  
> 
> This argument makes sense, however, skip_sw for tunnel decap rules
>  **is** allowed since 4.10 and we have some sort of regression here (turns
> out that before and after the patch..)

As I said it was allowed in 4 releases, which was a mistake, in last 3
it wasn't.  I understand your use case, but the semantics of skip_sw
are not preserved here so we should find a different solution.

> > Therefore we should keep the 4.15 - 4.17 behaviour.
> > But that's a side note, I don't think we should be breaking offload on
> > shared blocks whether we decide to support skip_sw on tunnels or not.  
> 
> skip_sw on tunnels was there before shared-block, newer features should
> take care not to break existing ones.

Oh, I agree we shouldn't break existing use cases so please don't break
the use case I mentioned above.  I want to set up shared block between
a LAG and its members.  Now since the bond will not be offload-capable
TC will not even make an attempt to offload to members.

I'm gonna test that my reading of the code is correct and send a revert
later today, sorry.

^ permalink raw reply

* [PATCH v2 net-next] net/sched: add skbprio scheduler
From: Nishanth Devarajan @ 2018-06-06 15:40 UTC (permalink / raw)
  To: xiyou.wangcong, jhs, jiri, davem; +Cc: netdev, doucette, michel

net/sched: add skbprio scheduler

Skbprio (SKB Priority Queue) is a queuing discipline that prioritizes IPv4
and IPv6 packets according to their skb->priority field. Although Skbprio can
be employed in any scenario in which a higher skb->priority field means a
higher priority packet, Skbprio was concieved as a solution for
denial-of-service defenses that need to route packets with different priorities.

v2:
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).

*Queueing discipline is made classful to expose Skbprio's internal priority
queues.

Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com>
Reviewed-by: Cody Doucette <doucette@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
---
 include/uapi/linux/pkt_sched.h |  15 ++
 net/sched/Kconfig              |  13 ++
 net/sched/Makefile             |   1 +
 net/sched/sch_skbprio.c        | 347 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 376 insertions(+)
 create mode 100644 net/sched/sch_skbprio.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096..6fd07e8 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -124,6 +124,21 @@ struct tc_fifo_qopt {
 	__u32	limit;	/* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* SKBPRIO section */
+
+/*
+ * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
+ * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
+ * to map one to one the DS field of IPV4 and IPV6 headers.
+ * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
+ */
+
+#define SKBPRIO_MAX_PRIORITY 64
+
+struct tc_skbprio_qopt {
+	__u32	limit; 	    	/* Queue length in packets. */
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS	16
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169f..9ac4b53 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -240,6 +240,19 @@ config NET_SCH_MQPRIO
 
 	  If unsure, say N.
 
+config NET_SCH_SKBPRIO
+	tristate "SKB priority queue scheduler (SKBPRIO)"
+	help
+	  Say Y here if you want to use the SKB priority queue
+	  scheduler. This schedules packets according to skb->priority,
+	  which is useful for request packets in DoS mitigation systems such
+	  as Gatekeeper.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called sch_skbprio.
+
+	  If unsure, say N.
+
 config NET_SCH_CHOKE
 	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d38..a4d8893 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
 obj-$(CONFIG_NET_SCH_PLUG)	+= sch_plug.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
+obj-$(CONFIG_NET_SCH_SKBPRIO)	+= sch_skbprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
new file mode 100644
index 0000000..f73ad62
--- /dev/null
+++ b/net/sched/sch_skbprio.c
@@ -0,0 +1,347 @@
+/*
+ * net/sched/sch_skbprio.c  SKB Priority Queue.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Nishanth Devarajan, <ndev2021@gmail.com>
+ *		Cody Doucette, <doucette@bu.edu>
+ *	        original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
+ */
+
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/inet_ecn.h>
+
+
+/*	  SKB Priority Queue
+ *	=================================
+ *
+ * This qdisc schedules a packet according to skb->priority, where a higher
+ * value places the packet closer to the exit of the queue. When the queue is
+ * full, the lowest priority packet in the queue is dropped to make room for
+ * the packet to be added if it has higher priority. If the packet to be added
+ * has lower priority than all packets in the queue, it is dropped.
+ *
+ * Without the SKB priority queue, queue length limits must be imposed
+ * for individual queues, and there is no easy way to enforce a global queue
+ * length limit across all priorities. With the SKBprio queue, a global
+ * queue length limit can be enforced while not restricting the queue lengths
+ * of individual priorities.
+ *
+ * This is especially useful for a denial-of-service defense system like
+ * Gatekeeper, which prioritizes packets in flows that demonstrate expected
+ * behavior of legitimate users. The queue is flexible to allow any number
+ * of packets of any priority up to the global limit of the scheduler
+ * without risking resource overconsumption by a flood of low priority packets.
+ *
+ * The Gatekeeper codebase is found here:
+ *
+ *		https://github.com/AltraMayor/gatekeeper
+ */
+
+struct skbprio_sched_data {
+	/* Parameters. */
+	u32 max_limit;
+
+	/* Queue state. */
+	struct sk_buff_head qdiscs[SKBPRIO_MAX_PRIORITY];
+	struct gnet_stats_queue qstats[SKBPRIO_MAX_PRIORITY];
+	u16 highest_prio;
+	u16 lowest_prio;
+};
+
+static u16 calc_new_high_prio(const struct skbprio_sched_data *q)
+{
+	int prio;
+
+	for (prio = q->highest_prio - 1; prio >= q->lowest_prio; prio--) {
+		if (!skb_queue_empty(&q->qdiscs[prio]))
+			return prio;
+	}
+
+	/* SKB queue is empty, return 0 (default highest priority setting). */
+	return 0;
+}
+
+static u16 calc_new_low_prio(const struct skbprio_sched_data *q)
+{
+	int prio;
+
+	for (prio = q->lowest_prio + 1; prio <= q->highest_prio; prio++) {
+		if (!skb_queue_empty(&q->qdiscs[prio]))
+			return prio;
+	}
+
+	/* SKB queue is empty, return SKBPRIO_MAX_PRIORITY - 1
+	 * (default lowest priority setting).
+	 */
+	return SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			  struct sk_buff **to_free)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct sk_buff_head *qdisc;
+	struct sk_buff_head *lp_qdisc;
+	struct sk_buff *to_drop;
+	const unsigned int max_priority = SKBPRIO_MAX_PRIORITY - 1;
+	u16 prio, lp;
+
+	/* Obtain the priority of @skb. */
+	prio = min(skb->priority, max_priority);
+
+	qdisc = &q->qdiscs[prio];
+	if (sch->q.qlen < q->max_limit) {
+		__skb_queue_tail(qdisc, skb);
+		qdisc_qstats_backlog_inc(sch, skb);
+		q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+		/* Check to update highest and lowest priorities. */
+		if (prio > q->highest_prio)
+			q->highest_prio = prio;
+
+		if (prio < q->lowest_prio)
+			q->lowest_prio = prio;
+
+		sch->q.qlen++;
+		return NET_XMIT_SUCCESS;
+	}
+
+	/* If this packet has the lowest priority, drop it. */
+	lp = q->lowest_prio;
+	if (prio <= lp) {
+		q->qstats[prio].drops++;
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	/* Drop the packet at the tail of the lowest priority qdisc. */
+	lp_qdisc = &q->qdiscs[lp];
+	to_drop = __skb_dequeue_tail(lp_qdisc);
+	BUG_ON(!to_drop);
+	qdisc_qstats_backlog_dec(sch, to_drop);
+	qdisc_drop(to_drop, sch, to_free);
+
+	q->qstats[lp].backlog -= qdisc_pkt_len(to_drop);
+	q->qstats[lp].drops++;
+
+	__skb_queue_tail(qdisc, skb);
+	qdisc_qstats_backlog_inc(sch, skb);
+	q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+	/* Check to update highest and lowest priorities. */
+	if (skb_queue_empty(lp_qdisc)) {
+		if (q->lowest_prio == q->highest_prio) {
+			BUG_ON(sch->q.qlen);
+			q->lowest_prio = prio;
+			q->highest_prio = prio;
+		} else {
+			q->lowest_prio = calc_new_low_prio(q);
+		}
+	}
+
+	if (prio > q->highest_prio)
+		q->highest_prio = prio;
+
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *skbprio_dequeue(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct sk_buff_head *hpq = &q->qdiscs[q->highest_prio];
+	struct sk_buff *skb = __skb_dequeue(hpq);
+
+	if (unlikely(!skb))
+		return NULL;
+
+	sch->q.qlen--;
+	qdisc_qstats_backlog_dec(sch, skb);
+	qdisc_bstats_update(sch, skb);
+
+	q->qstats[q->highest_prio].backlog -= qdisc_pkt_len(skb);
+
+	/* Update highest priority field. */
+	if (skb_queue_empty(hpq)) {
+		if (q->lowest_prio == q->highest_prio) {
+			BUG_ON(sch->q.qlen);
+			q->highest_prio = 0;
+			q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+		} else {
+			q->highest_prio = calc_new_high_prio(q);
+		}
+	}
+	return skb;
+}
+
+static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct tc_skbprio_qopt *ctl = nla_data(opt);
+	const unsigned int min_limit = 1;
+
+	if (ctl->limit == (typeof(ctl->limit))-1)
+		q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+	else if (ctl->limit < min_limit ||
+		ctl->limit > qdisc_dev(sch)->tx_queue_len)
+		return -EINVAL;
+	else
+		q->max_limit = ctl->limit;
+
+	return 0;
+}
+
+static int skbprio_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	int prio;
+	const unsigned int min_limit = 1;
+
+	/* Initialise all queues, one for each possible priority. */
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_head_init(&q->qdiscs[prio]);
+
+	memset(&q->qstats, 0, sizeof(q->qstats));
+	q->highest_prio = 0;
+	q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+	if (!opt) {
+		q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+		return 0;
+	}
+	return skbprio_change(sch, opt, extack);
+}
+
+static int skbprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	struct tc_skbprio_qopt opt;
+
+	opt.limit = q->max_limit;
+
+	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+		return -1;
+
+	return skb->len;
+}
+
+static void skbprio_reset(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_purge(&q->qdiscs[prio]);
+
+	memset(&q->qstats, 0, sizeof(q->qstats));
+	q->highest_prio = 0;
+	q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static void skbprio_destroy(struct Qdisc *sch)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+		__skb_queue_purge(&q->qdiscs[prio]);
+}
+
+static struct Qdisc *skbprio_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long skbprio_find(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static int skbprio_dump_class(struct Qdisc *sch, unsigned long cl,
+			     struct sk_buff *skb, struct tcmsg *tcm)
+{
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static int skbprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+				   struct gnet_dump *d)
+{
+	struct skbprio_sched_data *q = qdisc_priv(sch);
+	if (gnet_stats_copy_queue(d, NULL, &q->qstats[cl - 1],
+		q->qstats[cl - 1].qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void skbprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < SKBPRIO_MAX_PRIORITY; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops skbprio_class_ops = {
+	.leaf		=	skbprio_leaf,
+	.find		=	skbprio_find,
+	.dump		=	skbprio_dump_class,
+	.dump_stats	=	skbprio_dump_class_stats,
+	.walk		=	skbprio_walk,
+};
+
+static struct Qdisc_ops skbprio_qdisc_ops __read_mostly = {
+	.cl_ops		=	&skbprio_class_ops,
+	.id		=	"skbprio",
+	.priv_size	=	sizeof(struct skbprio_sched_data),
+	.enqueue	=	skbprio_enqueue,
+	.dequeue	=	skbprio_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	skbprio_init,
+	.reset		=	skbprio_reset,
+	.change		=	skbprio_change,
+	.dump		=	skbprio_dump,
+	.destroy	=	skbprio_destroy,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init skbprio_module_init(void)
+{
+	return register_qdisc(&skbprio_qdisc_ops);
+}
+
+static void __exit skbprio_module_exit(void)
+{
+	unregister_qdisc(&skbprio_qdisc_ops);
+}
+
+module_init(skbprio_module_init)
+module_exit(skbprio_module_exit)
+
+MODULE_LICENSE("GPL");
-- 
1.9.1

^ permalink raw reply related

* [PATCH net] net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan
From: Willem de Bruijn @ 2018-06-06 15:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, rppt, herbert, anton.ivanov, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.

For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer.  Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.

Commit f09e2249c4f5 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.

Virtio, packet and uml do not insert the vlan header in the user
buffer.

When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.

Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.

Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 arch/um/drivers/vector_transports.c |  3 ++-
 drivers/net/tap.c                   |  5 ++++-
 drivers/net/tun.c                   |  3 ++-
 drivers/net/virtio_net.c            |  3 ++-
 include/linux/virtio_net.h          | 11 ++++-------
 net/packet/af_packet.c              |  4 ++--
 6 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c
index 9065047f844b..77e4ebc206ae 100644
--- a/arch/um/drivers/vector_transports.c
+++ b/arch/um/drivers/vector_transports.c
@@ -120,7 +120,8 @@ static int raw_form_header(uint8_t *header,
 		skb,
 		vheader,
 		virtio_legacy_is_little_endian(),
-		false
+		false,
+		0
 	);
 
 	return 0;
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9b6cb780affe..f0f7cd977667 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -774,13 +774,16 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	int total;
 
 	if (q->flags & IFF_VNET_HDR) {
+		int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
 		struct virtio_net_hdr vnet_hdr;
+
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 		if (iov_iter_count(iter) < vnet_hdr_len)
 			return -EINVAL;
 
 		if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
-					    tap_is_little_endian(q), true))
+					    tap_is_little_endian(q), true,
+					    vlan_hlen))
 			BUG();
 
 		if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 23e9eb66197f..409eb8b74740 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2078,7 +2078,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			return -EINVAL;
 
 		if (virtio_net_hdr_from_skb(skb, &gso,
-					    tun_is_little_endian(tun), true)) {
+					    tun_is_little_endian(tun), true,
+					    vlan_hlen)) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 			pr_err("unexpected GSO type: "
 			       "0x%x, gso_size %d, hdr_len %d\n",
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 032e1ac10a30..8c7207535179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1358,7 +1358,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		hdr = skb_vnet_hdr(skb);
 
 	if (virtio_net_hdr_from_skb(skb, &hdr->hdr,
-				    virtio_is_little_endian(vi->vdev), false))
+				    virtio_is_little_endian(vi->vdev), false,
+				    0))
 		BUG();
 
 	if (vi->mergeable_rx_bufs)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216febc6..9397628a1967 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -58,7 +58,8 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 					  struct virtio_net_hdr *hdr,
 					  bool little_endian,
-					  bool has_data_valid)
+					  bool has_data_valid,
+					  int vlan_hlen)
 {
 	memset(hdr, 0, sizeof(*hdr));   /* no info leak */
 
@@ -83,12 +84,8 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-		if (skb_vlan_tag_present(skb))
-			hdr->csum_start = __cpu_to_virtio16(little_endian,
-				skb_checksum_start_offset(skb) + VLAN_HLEN);
-		else
-			hdr->csum_start = __cpu_to_virtio16(little_endian,
-				skb_checksum_start_offset(skb));
+		hdr->csum_start = __cpu_to_virtio16(little_endian,
+			skb_checksum_start_offset(skb) + vlan_hlen);
 		hdr->csum_offset = __cpu_to_virtio16(little_endian,
 				skb->csum_offset);
 	} else if (has_data_valid &&
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index acb7b86574cd..9198997b39d1 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2037,7 +2037,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
 		return -EINVAL;
 	*len -= sizeof(vnet_hdr);
 
-	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true))
+	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
 		return -EINVAL;
 
 	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
@@ -2304,7 +2304,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (do_vnet) {
 		if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
 					    sizeof(struct virtio_net_hdr),
-					    vio_le(), true)) {
+					    vio_le(), true, 0)) {
 			spin_lock(&sk->sk_receive_queue.lock);
 			goto drop_n_account;
 		}
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* Re: [PATCH net-next 1/6] net: hippi: use pci_zalloc_consistent
From: kbuild test robot @ 2018-06-06 15:04 UTC (permalink / raw)
  To: YueHaibing
  Cc: kbuild-all, davem, netdev, linux-kernel, jcliburn, chris.snook,
	benve, jdmason, chessman, jes, rahul.verma, YueHaibing
In-Reply-To: <20180605122851.23912-2-yuehaibing@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 5410 bytes --]

Hi YueHaibing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/YueHaibing/use-pci_zalloc_consistent/20180606-205531
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/hippi/rrunner.c: In function 'rr_init_one':
>> drivers/net/hippi/rrunner.c:176:7: error: 'trrpriv' undeclared (first use in this function); did you mean 'rrpriv'?
     if (!trrpriv->evt_ring) {
          ^~~~~~~
          rrpriv
   drivers/net/hippi/rrunner.c:176:7: note: each undeclared identifier is reported only once for each function it appears in

vim +176 drivers/net/hippi/rrunner.c

    74	
    75	/*
    76	 * Implementation notes:
    77	 *
    78	 * The DMA engine only allows for DMA within physical 64KB chunks of
    79	 * memory. The current approach of the driver (and stack) is to use
    80	 * linear blocks of memory for the skbuffs. However, as the data block
    81	 * is always the first part of the skb and skbs are 2^n aligned so we
    82	 * are guarantted to get the whole block within one 64KB align 64KB
    83	 * chunk.
    84	 *
    85	 * On the long term, relying on being able to allocate 64KB linear
    86	 * chunks of memory is not feasible and the skb handling code and the
    87	 * stack will need to know about I/O vectors or something similar.
    88	 */
    89	
    90	static int rr_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
    91	{
    92		struct net_device *dev;
    93		static int version_disp;
    94		u8 pci_latency;
    95		struct rr_private *rrpriv;
    96		dma_addr_t ring_dma;
    97		int ret = -ENOMEM;
    98	
    99		dev = alloc_hippi_dev(sizeof(struct rr_private));
   100		if (!dev)
   101			goto out3;
   102	
   103		ret = pci_enable_device(pdev);
   104		if (ret) {
   105			ret = -ENODEV;
   106			goto out2;
   107		}
   108	
   109		rrpriv = netdev_priv(dev);
   110	
   111		SET_NETDEV_DEV(dev, &pdev->dev);
   112	
   113		ret = pci_request_regions(pdev, "rrunner");
   114		if (ret < 0)
   115			goto out;
   116	
   117		pci_set_drvdata(pdev, dev);
   118	
   119		rrpriv->pci_dev = pdev;
   120	
   121		spin_lock_init(&rrpriv->lock);
   122	
   123		dev->netdev_ops = &rr_netdev_ops;
   124	
   125		/* display version info if adapter is found */
   126		if (!version_disp) {
   127			/* set display flag to TRUE so that */
   128			/* we only display this string ONCE */
   129			version_disp = 1;
   130			printk(version);
   131		}
   132	
   133		pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency);
   134		if (pci_latency <= 0x58){
   135			pci_latency = 0x58;
   136			pci_write_config_byte(pdev, PCI_LATENCY_TIMER, pci_latency);
   137		}
   138	
   139		pci_set_master(pdev);
   140	
   141		printk(KERN_INFO "%s: Essential RoadRunner serial HIPPI "
   142		       "at 0x%llx, irq %i, PCI latency %i\n", dev->name,
   143		       (unsigned long long)pci_resource_start(pdev, 0),
   144		       pdev->irq, pci_latency);
   145	
   146		/*
   147		 * Remap the MMIO regs into kernel space.
   148		 */
   149		rrpriv->regs = pci_iomap(pdev, 0, 0x1000);
   150		if (!rrpriv->regs) {
   151			printk(KERN_ERR "%s:  Unable to map I/O register, "
   152				"RoadRunner will be disabled.\n", dev->name);
   153			ret = -EIO;
   154			goto out;
   155		}
   156	
   157		rrpriv->tx_ring = pci_alloc_consistent(pdev, TX_TOTAL_SIZE, &ring_dma);
   158		rrpriv->tx_ring_dma = ring_dma;
   159	
   160		if (!rrpriv->tx_ring) {
   161			ret = -ENOMEM;
   162			goto out;
   163		}
   164	
   165		rrpriv->rx_ring = pci_alloc_consistent(pdev, RX_TOTAL_SIZE, &ring_dma);
   166		rrpriv->rx_ring_dma = ring_dma;
   167	
   168		if (!rrpriv->rx_ring) {
   169			ret = -ENOMEM;
   170			goto out;
   171		}
   172	
   173		rrpriv->evt_ring = pci_alloc_consistent(pdev, EVT_RING_SIZE, &ring_dma);
   174		rrpriv->evt_ring_dma = ring_dma;
   175	
 > 176		if (!trrpriv->evt_ring) {
   177			ret = -ENOMEM;
   178			goto out;
   179		}
   180	
   181		/*
   182		 * Don't access any register before this point!
   183		 */
   184	#ifdef __BIG_ENDIAN
   185		writel(readl(&rrpriv->regs->HostCtrl) | NO_SWAP,
   186			&rrpriv->regs->HostCtrl);
   187	#endif
   188		/*
   189		 * Need to add a case for little-endian 64-bit hosts here.
   190		 */
   191	
   192		rr_init(dev);
   193	
   194		ret = register_netdev(dev);
   195		if (ret)
   196			goto out;
   197		return 0;
   198	
   199	 out:
   200		if (rrpriv->evt_ring)
   201			pci_free_consistent(pdev, EVT_RING_SIZE, rrpriv->evt_ring,
   202					    rrpriv->evt_ring_dma);
   203		if (rrpriv->rx_ring)
   204			pci_free_consistent(pdev, RX_TOTAL_SIZE, rrpriv->rx_ring,
   205					    rrpriv->rx_ring_dma);
   206		if (rrpriv->tx_ring)
   207			pci_free_consistent(pdev, TX_TOTAL_SIZE, rrpriv->tx_ring,
   208					    rrpriv->tx_ring_dma);
   209		if (rrpriv->regs)
   210			pci_iounmap(pdev, rrpriv->regs);
   211		if (pdev)
   212			pci_release_regions(pdev);
   213	 out2:
   214		free_netdev(dev);
   215	 out3:
   216		return ret;
   217	}
   218	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39429 bytes --]

^ permalink raw reply

* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: Tejaswi Tanikella @ 2018-06-06 14:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, f.fainelli, andrew, edumazet
In-Reply-To: <20180604.110640.184022152967598302.davem@davemloft.net>

On Mon, Jun 04, 2018 at 11:06:40AM -0400, David Miller wrote:
> From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> Date: Fri, 1 Jun 2018 19:35:41 +0530
> 
> > On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> > timer is started to send out a response after a random time within
> > max_delay. If the system then moves into suspend state, Report is
> > delayed until system wakes up.
> > 
> > In one reported scenario, on arm64 devices, max_delay was set to 10s,
> > Reports were consistantly delayed if the timer is scheduled after 5 plus
> > seconds.
> > 
> > Hold wakelock while starting the timer to prevent moving into suspend
> > state.
> > 
> > Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
> 
> As Florian stated, this won't be the only networking facility to hit
> a problem like this.  So, if we go down this route, we probably want
> to generically solve this.
> 
> But I have a deeper concern.
> 
> Do we really want every timer based querying mechanism to prevent a
> system from being able to suspend?
> 
> We get to the point where external entities can generate traffic which
> can prevent a remote system from entering suspend state.

Like you suggested maybe aquiring a wakelock will unnecessarily stop the
system from suspending.
Would using alarmtimer be better? It seems similar to a hrtimer but can
wake the system up from suspend.

This should allow the system to suspend till the timer expires. But might
cause repeated wake-ups and suspends.

^ permalink raw reply

* [PATCH net] rxrpc: Fix terminal retransmission connection ID to include the channel
From: David Howells @ 2018-06-06 13:59 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel

When retransmitting the final ACK or ABORT packet for a call, the cid field
in the packet header is set to the connection's cid, but this is incorrect
as it also needs to include the channel number on that connection that the
call was made on.

Fix this by OR'ing in the channel number.

Note that this fixes the bug that:

	commit 1a025028d400b23477341aa7ec2ce55f8b39b554
	rxrpc: Fix handling of call quietly cancelled out on server

works around.  I'm not intending to revert that as it will help protect
against problems that might occur on the server.

Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/conn_event.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index 1350f1be8037..8229a52c2acd 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -70,7 +70,7 @@ static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn,
 	iov[2].iov_len	= sizeof(ack_info);
 
 	pkt.whdr.epoch		= htonl(conn->proto.epoch);
-	pkt.whdr.cid		= htonl(conn->proto.cid);
+	pkt.whdr.cid		= htonl(conn->proto.cid | channel);
 	pkt.whdr.callNumber	= htonl(call_id);
 	pkt.whdr.seq		= 0;
 	pkt.whdr.type		= chan->last_type;

^ permalink raw reply related

* [less-CONFIG_NET 5/7] seccomp: cut off functions not required
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, John Crispin, Simon Horman, Jakub Kicinski,
	Tom Herbert, Eric Dumazet, Sven Eckelmann, WANG Cong, David Ahern,
	Jon Maloy, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

When using CONFIG_SECCOMP_FILTER, not all functions of filter.c and
flow_dissector.c are required. To not pull in more dependencies,
guard the functions that are not required with CONFIG_NET defines.
This way, these functions are enabled in case the file is compiled
because of CONFIG_NET, but they are not present when the file is
compiled because of other configurations.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/filter.c         | 2 ++
 net/core/flow_dissector.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 0d980e9..4ddacb7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1063,6 +1063,7 @@ void bpf_prog_destroy(struct bpf_prog *fp)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_destroy);
 
+#if defined(CONFIG_NET)
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
  *	@sk: sock associated with &sk_buff
@@ -5657,3 +5658,4 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 	release_sock(sk);
 	return ret;
 }
+#endif  // CONFIG_NET
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 70e0679..0903444 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1219,6 +1219,7 @@ u32 skb_get_poff(const struct sk_buff *skb)
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
 }
 
+#if defined(CONFIG_NET)
 __be32 flow_get_u32_src(const struct flow_keys *flow)
 {
 	switch (flow->control.addr_type) {
@@ -1340,6 +1341,7 @@ __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
 	return flow_hash_from_keys(keys);
 }
 EXPORT_SYMBOL(__get_hash_from_flowi6);
+#endif  // CONFIG_NET
 
 static const struct flow_dissector_key flow_keys_dissector_keys[] = {
 	{
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 3/7] seccomp: include net and bpf files
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

When we want to use CONFIG_SECCOMP_FILTER without CONFIG_NET, we have
to ensure that the required files that would be pulled in via
CONFIG_NET are compiled when dropping CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 kernel/bpf/Makefile | 3 ++-
 net/Makefile        | 5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index a713fd2..5d13269 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -4,7 +4,8 @@ obj-y := core.o
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
-ifeq ($(CONFIG_NET),y)
+
+ifneq ($(filter y,$(CONFIG_NET) $(CONFIG_SECCOMP_FILTER)),)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
diff --git a/net/Makefile b/net/Makefile
index a6147c6..08f1875 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_NET)		:= socket.o core/
 tmp-$(CONFIG_COMPAT) 		:= compat.o
 obj-$(CONFIG_NET)		+= $(tmp-y)
 
+ifneq ($(CONFIG_NET),y)
+obj-$(CONFIG_SECCOMP_FILTER)    += core/filter.o
+obj-$(CONFIG_SECCOMP_FILTER)    += core/flow_dissector.o
+endif
+
 # LLC has to be linked before the files in net/802/
 obj-$(CONFIG_LLC)		+= llc/
 obj-$(CONFIG_NET)		+= ethernet/ 802/ sched/ netlink/ bpf/
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 2/7] net: reorder flow_dissector
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, David S. Miller, Simon Horman, Andrew Lunn,
	Jakub Kicinski, Tom Herbert, John Crispin, Eric Dumazet,
	Sven Eckelmann, WANG Cong, David Ahern, Jon Maloy, netdev,
	linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

This commit reorders the definitions, such that in the next step we
can easily cut the file into a commonly used part, as well as a part
that is only required in case CONFIG_NET is used.

This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/flow_dissector.c | 206 +++++++++++++++++++++++-----------------------
 1 file changed, 103 insertions(+), 103 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d29f09b..70e0679 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1085,36 +1085,6 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 	return (sizeof(*flow) - diff) / sizeof(u32);
 }
 
-__be32 flow_get_u32_src(const struct flow_keys *flow)
-{
-	switch (flow->control.addr_type) {
-	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
-		return flow->addrs.v4addrs.src;
-	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-		return (__force __be32)ipv6_addr_hash(
-			&flow->addrs.v6addrs.src);
-	case FLOW_DISSECTOR_KEY_TIPC:
-		return flow->addrs.tipckey.key;
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL(flow_get_u32_src);
-
-__be32 flow_get_u32_dst(const struct flow_keys *flow)
-{
-	switch (flow->control.addr_type) {
-	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
-		return flow->addrs.v4addrs.dst;
-	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-		return (__force __be32)ipv6_addr_hash(
-			&flow->addrs.v6addrs.dst);
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL(flow_get_u32_dst);
-
 static inline void __flow_hash_consistentify(struct flow_keys *keys)
 {
 	int addr_diff, i;
@@ -1162,49 +1132,6 @@ static inline u32 __flow_hash_from_keys(struct flow_keys *keys, u32 keyval)
 	return hash;
 }
 
-u32 flow_hash_from_keys(struct flow_keys *keys)
-{
-	__flow_hash_secret_init();
-	return __flow_hash_from_keys(keys, hashrnd);
-}
-EXPORT_SYMBOL(flow_hash_from_keys);
-
-static inline u32 ___skb_get_hash(const struct sk_buff *skb,
-				  struct flow_keys *keys, u32 keyval)
-{
-	skb_flow_dissect_flow_keys(skb, keys,
-				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
-
-	return __flow_hash_from_keys(keys, keyval);
-}
-
-struct _flow_keys_digest_data {
-	__be16	n_proto;
-	u8	ip_proto;
-	u8	padding;
-	__be32	ports;
-	__be32	src;
-	__be32	dst;
-};
-
-void make_flow_keys_digest(struct flow_keys_digest *digest,
-			   const struct flow_keys *flow)
-{
-	struct _flow_keys_digest_data *data =
-	    (struct _flow_keys_digest_data *)digest;
-
-	BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
-
-	memset(digest, 0, sizeof(*digest));
-
-	data->n_proto = flow->basic.n_proto;
-	data->ip_proto = flow->basic.ip_proto;
-	data->ports = flow->ports.ports;
-	data->src = flow->addrs.v4addrs.src;
-	data->dst = flow->addrs.v4addrs.dst;
-}
-EXPORT_SYMBOL(make_flow_keys_digest);
-
 static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
 
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
@@ -1222,36 +1149,6 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric);
 
-/**
- * __skb_get_hash: calculate a flow hash
- * @skb: sk_buff to calculate flow hash from
- *
- * This function calculates a flow hash based on src/dst addresses
- * and src/dst port numbers.  Sets hash in skb to non-zero hash value
- * on success, zero indicates no valid hash.  Also, sets l4_hash in skb
- * if hash is a canonical 4-tuple hash over transport ports.
- */
-void __skb_get_hash(struct sk_buff *skb)
-{
-	struct flow_keys keys;
-	u32 hash;
-
-	__flow_hash_secret_init();
-
-	hash = ___skb_get_hash(skb, &keys, hashrnd);
-
-	__skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
-}
-EXPORT_SYMBOL(__skb_get_hash);
-
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
-{
-	struct flow_keys keys;
-
-	return ___skb_get_hash(skb, &keys, perturb);
-}
-EXPORT_SYMBOL(skb_get_hash_perturb);
-
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 		   const struct flow_keys *keys, int hlen)
 {
@@ -1322,6 +1219,109 @@ u32 skb_get_poff(const struct sk_buff *skb)
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
 }
 
+__be32 flow_get_u32_src(const struct flow_keys *flow)
+{
+	switch (flow->control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		return flow->addrs.v4addrs.src;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		return (__force __be32)ipv6_addr_hash(
+			&flow->addrs.v6addrs.src);
+	case FLOW_DISSECTOR_KEY_TIPC:
+		return flow->addrs.tipckey.key;
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL(flow_get_u32_src);
+
+__be32 flow_get_u32_dst(const struct flow_keys *flow)
+{
+	switch (flow->control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		return flow->addrs.v4addrs.dst;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		return (__force __be32)ipv6_addr_hash(
+			&flow->addrs.v6addrs.dst);
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL(flow_get_u32_dst);
+
+u32 flow_hash_from_keys(struct flow_keys *keys)
+{
+	__flow_hash_secret_init();
+	return __flow_hash_from_keys(keys, hashrnd);
+}
+EXPORT_SYMBOL(flow_hash_from_keys);
+
+static inline u32 ___skb_get_hash(const struct sk_buff *skb,
+				  struct flow_keys *keys, u32 keyval)
+{
+	skb_flow_dissect_flow_keys(skb, keys,
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+	return __flow_hash_from_keys(keys, keyval);
+}
+
+struct _flow_keys_digest_data {
+	__be16	n_proto;
+	u8	ip_proto;
+	u8	padding;
+	__be32	ports;
+	__be32	src;
+	__be32	dst;
+};
+
+void make_flow_keys_digest(struct flow_keys_digest *digest,
+			   const struct flow_keys *flow)
+{
+	struct _flow_keys_digest_data *data =
+	    (struct _flow_keys_digest_data *)digest;
+
+	BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
+
+	memset(digest, 0, sizeof(*digest));
+
+	data->n_proto = flow->basic.n_proto;
+	data->ip_proto = flow->basic.ip_proto;
+	data->ports = flow->ports.ports;
+	data->src = flow->addrs.v4addrs.src;
+	data->dst = flow->addrs.v4addrs.dst;
+}
+EXPORT_SYMBOL(make_flow_keys_digest);
+
+/**
+ * __skb_get_hash: calculate a flow hash
+ * @skb: sk_buff to calculate flow hash from
+ *
+ * This function calculates a flow hash based on src/dst addresses
+ * and src/dst port numbers.  Sets hash in skb to non-zero hash value
+ * on success, zero indicates no valid hash.  Also, sets l4_hash in skb
+ * if hash is a canonical 4-tuple hash over transport ports.
+ */
+void __skb_get_hash(struct sk_buff *skb)
+{
+	struct flow_keys keys;
+	u32 hash;
+
+	__flow_hash_secret_init();
+
+	hash = ___skb_get_hash(skb, &keys, hashrnd);
+
+	__skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
+}
+EXPORT_SYMBOL(__skb_get_hash);
+
+__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
+{
+	struct flow_keys keys;
+
+	return ___skb_get_hash(skb, &keys, perturb);
+}
+EXPORT_SYMBOL(skb_get_hash_perturb);
+
 __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
 {
 	memset(keys, 0, sizeof(*keys));
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 1/7] net: reorder filter code
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293127-23825-1-git-send-email-nmanthey@amazon.de>

This commit reorders the definition of functions and struct in the
file filter.c, such that in the next step we can easily cut the file
into a commonly used part, as well as a part that is only required in
case CONFIG_NET is actually set.

This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/filter.c | 330 +++++++++++++++++++++++++++---------------------------
 1 file changed, 165 insertions(+), 165 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 201ff36b..0d980e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,58 +59,6 @@
 #include <net/tcp.h>
 #include <linux/bpf_trace.h>
 
-/**
- *	sk_filter_trim_cap - run a packet through a socket filter
- *	@sk: sock associated with &sk_buff
- *	@skb: buffer to filter
- *	@cap: limit on how short the eBPF program may trim the packet
- *
- * Run the eBPF program and then cut skb->data to correct size returned by
- * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
- * than pkt_len we keep whole skb->data. This is the socket level
- * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
- * be accepted or -EPERM if the packet should be tossed.
- *
- */
-int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
-{
-	int err;
-	struct sk_filter *filter;
-
-	/*
-	 * If the skb was allocated from pfmemalloc reserves, only
-	 * allow SOCK_MEMALLOC sockets to use it as this socket is
-	 * helping free memory
-	 */
-	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
-		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
-		return -ENOMEM;
-	}
-	err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
-	if (err)
-		return err;
-
-	err = security_sock_rcv_skb(sk, skb);
-	if (err)
-		return err;
-
-	rcu_read_lock();
-	filter = rcu_dereference(sk->sk_filter);
-	if (filter) {
-		struct sock *save_sk = skb->sk;
-		unsigned int pkt_len;
-
-		skb->sk = sk;
-		pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
-		skb->sk = save_sk;
-		err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
-	}
-	rcu_read_unlock();
-
-	return err;
-}
-EXPORT_SYMBOL(sk_filter_trim_cap);
-
 BPF_CALL_1(__skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
@@ -165,12 +113,6 @@ BPF_CALL_0(__get_raw_cpu_id)
 	return raw_smp_processor_id();
 }
 
-static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
-	.func		= __get_raw_cpu_id,
-	.gpl_only	= false,
-	.ret_type	= RET_INTEGER,
-};
-
 static u32 convert_skb_access(int skb_field, int dst_reg, int src_reg,
 			      struct bpf_insn *insn_buf)
 {
@@ -954,71 +896,6 @@ static void __bpf_prog_release(struct bpf_prog *prog)
 	}
 }
 
-static void __sk_filter_release(struct sk_filter *fp)
-{
-	__bpf_prog_release(fp->prog);
-	kfree(fp);
-}
-
-/**
- * 	sk_filter_release_rcu - Release a socket filter by rcu_head
- *	@rcu: rcu_head that contains the sk_filter to free
- */
-static void sk_filter_release_rcu(struct rcu_head *rcu)
-{
-	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
-
-	__sk_filter_release(fp);
-}
-
-/**
- *	sk_filter_release - release a socket filter
- *	@fp: filter to remove
- *
- *	Remove a filter from a socket and release its resources.
- */
-static void sk_filter_release(struct sk_filter *fp)
-{
-	if (refcount_dec_and_test(&fp->refcnt))
-		call_rcu(&fp->rcu, sk_filter_release_rcu);
-}
-
-void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
-{
-	u32 filter_size = bpf_prog_size(fp->prog->len);
-
-	atomic_sub(filter_size, &sk->sk_omem_alloc);
-	sk_filter_release(fp);
-}
-
-/* try to charge the socket memory if there is space available
- * return true on success
- */
-static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
-	u32 filter_size = bpf_prog_size(fp->prog->len);
-
-	/* same check as in sock_kmalloc() */
-	if (filter_size <= sysctl_optmem_max &&
-	    atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
-		atomic_add(filter_size, &sk->sk_omem_alloc);
-		return true;
-	}
-	return false;
-}
-
-bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
-	if (!refcount_inc_not_zero(&fp->refcnt))
-		return false;
-
-	if (!__sk_filter_charge(sk, fp)) {
-		sk_filter_release(fp);
-		return false;
-	}
-	return true;
-}
-
 static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp)
 {
 	struct sock_filter *old_prog;
@@ -1127,19 +1004,22 @@ static struct bpf_prog *bpf_prepare_filter(struct bpf_prog *fp,
 }
 
 /**
- *	bpf_prog_create - create an unattached filter
+ *	bpf_prog_create_from_user - create an unattached filter from user buffer
  *	@pfp: the unattached filter that is created
  *	@fprog: the filter program
+ *	@trans: post-classic verifier transformation handler
+ *	@save_orig: save classic BPF program
  *
- * Create a filter independent of any socket. We first run some
- * sanity checks on it to make sure it does not explode on us later.
- * If an error occurs or there is insufficient memory for the filter
- * a negative errno code is returned. On success the return is zero.
+ * This function effectively does the same as bpf_prog_create(), only
+ * that it builds up its insns buffer from user space provided buffer.
+ * It also allows for passing a bpf_aux_classic_check_t handler.
  */
-int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
+int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
+			      bpf_aux_classic_check_t trans, bool save_orig)
 {
 	unsigned int fsize = bpf_classic_proglen(fprog);
 	struct bpf_prog *fp;
+	int err;
 
 	/* Make sure new filter is there and in the right amounts. */
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1149,44 +1029,177 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
 	if (!fp)
 		return -ENOMEM;
 
-	memcpy(fp->insns, fprog->filter, fsize);
+	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+		__bpf_prog_free(fp);
+		return -EFAULT;
+	}
 
 	fp->len = fprog->len;
-	/* Since unattached filters are not copied back to user
-	 * space through sk_get_filter(), we do not need to hold
-	 * a copy here, and can spare us the work.
-	 */
 	fp->orig_prog = NULL;
 
+	if (save_orig) {
+		err = bpf_prog_store_orig_filter(fp, fprog);
+		if (err) {
+			__bpf_prog_free(fp);
+			return -ENOMEM;
+		}
+	}
+
 	/* bpf_prepare_filter() already takes care of freeing
 	 * memory in case something goes wrong.
 	 */
-	fp = bpf_prepare_filter(fp, NULL);
+	fp = bpf_prepare_filter(fp, trans);
 	if (IS_ERR(fp))
 		return PTR_ERR(fp);
 
 	*pfp = fp;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(bpf_prog_create);
+EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
+
+void bpf_prog_destroy(struct bpf_prog *fp)
+{
+	__bpf_prog_release(fp);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_destroy);
 
 /**
- *	bpf_prog_create_from_user - create an unattached filter from user buffer
+ *	sk_filter_trim_cap - run a packet through a socket filter
+ *	@sk: sock associated with &sk_buff
+ *	@skb: buffer to filter
+ *	@cap: limit on how short the eBPF program may trim the packet
+ *
+ * Run the eBPF program and then cut skb->data to correct size returned by
+ * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
+ * than pkt_len we keep whole skb->data. This is the socket level
+ * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
+ * be accepted or -EPERM if the packet should be tossed.
+ *
+ */
+int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
+{
+	int err;
+	struct sk_filter *filter;
+
+	/*
+	 * If the skb was allocated from pfmemalloc reserves, only
+	 * allow SOCK_MEMALLOC sockets to use it as this socket is
+	 * helping free memory
+	 */
+	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
+		return -ENOMEM;
+	}
+	err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+	if (err)
+		return err;
+
+	err = security_sock_rcv_skb(sk, skb);
+	if (err)
+		return err;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sock *save_sk = skb->sk;
+		unsigned int pkt_len;
+
+		skb->sk = sk;
+		pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
+		skb->sk = save_sk;
+		err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
+	}
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL(sk_filter_trim_cap);
+
+static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
+	.func		= __get_raw_cpu_id,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
+static void __sk_filter_release(struct sk_filter *fp)
+{
+	__bpf_prog_release(fp->prog);
+	kfree(fp);
+}
+
+/**
+ * 	sk_filter_release_rcu - Release a socket filter by rcu_head
+ *	@rcu: rcu_head that contains the sk_filter to free
+ */
+static void sk_filter_release_rcu(struct rcu_head *rcu)
+{
+	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
+
+	__sk_filter_release(fp);
+}
+
+/**
+ *	sk_filter_release - release a socket filter
+ *	@fp: filter to remove
+ *
+ *	Remove a filter from a socket and release its resources.
+ */
+static void sk_filter_release(struct sk_filter *fp)
+{
+	if (refcount_dec_and_test(&fp->refcnt))
+		call_rcu(&fp->rcu, sk_filter_release_rcu);
+}
+
+void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
+{
+	u32 filter_size = bpf_prog_size(fp->prog->len);
+
+	atomic_sub(filter_size, &sk->sk_omem_alloc);
+	sk_filter_release(fp);
+}
+
+/* try to charge the socket memory if there is space available
+ * return true on success
+ */
+static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+	u32 filter_size = bpf_prog_size(fp->prog->len);
+
+	/* same check as in sock_kmalloc() */
+	if (filter_size <= sysctl_optmem_max &&
+	    atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
+		atomic_add(filter_size, &sk->sk_omem_alloc);
+		return true;
+	}
+	return false;
+}
+
+bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+	if (!refcount_inc_not_zero(&fp->refcnt))
+		return false;
+
+	if (!__sk_filter_charge(sk, fp)) {
+		sk_filter_release(fp);
+		return false;
+	}
+	return true;
+}
+
+/**
+ *	bpf_prog_create - create an unattached filter
  *	@pfp: the unattached filter that is created
  *	@fprog: the filter program
- *	@trans: post-classic verifier transformation handler
- *	@save_orig: save classic BPF program
  *
- * This function effectively does the same as bpf_prog_create(), only
- * that it builds up its insns buffer from user space provided buffer.
- * It also allows for passing a bpf_aux_classic_check_t handler.
+ * Create a filter independent of any socket. We first run some
+ * sanity checks on it to make sure it does not explode on us later.
+ * If an error occurs or there is insufficient memory for the filter
+ * a negative errno code is returned. On success the return is zero.
  */
-int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
-			      bpf_aux_classic_check_t trans, bool save_orig)
+int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
 {
 	unsigned int fsize = bpf_classic_proglen(fprog);
 	struct bpf_prog *fp;
-	int err;
 
 	/* Make sure new filter is there and in the right amounts. */
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1196,39 +1209,26 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
 	if (!fp)
 		return -ENOMEM;
 
-	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
-		__bpf_prog_free(fp);
-		return -EFAULT;
-	}
+	memcpy(fp->insns, fprog->filter, fsize);
 
 	fp->len = fprog->len;
+	/* Since unattached filters are not copied back to user
+	 * space through sk_get_filter(), we do not need to hold
+	 * a copy here, and can spare us the work.
+	 */
 	fp->orig_prog = NULL;
 
-	if (save_orig) {
-		err = bpf_prog_store_orig_filter(fp, fprog);
-		if (err) {
-			__bpf_prog_free(fp);
-			return -ENOMEM;
-		}
-	}
-
 	/* bpf_prepare_filter() already takes care of freeing
 	 * memory in case something goes wrong.
 	 */
-	fp = bpf_prepare_filter(fp, trans);
+	fp = bpf_prepare_filter(fp, NULL);
 	if (IS_ERR(fp))
 		return PTR_ERR(fp);
 
 	*pfp = fp;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
-
-void bpf_prog_destroy(struct bpf_prog *fp)
-{
-	__bpf_prog_release(fp);
-}
-EXPORT_SYMBOL_GPL(bpf_prog_destroy);
+EXPORT_SYMBOL_GPL(bpf_prog_create);
 
 static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
 {
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* Make CONFIG_NET and CONFIG_SECCOMP_FILTER independent of CONFIG_NET
From: Norbert Manthey @ 2018-06-06 13:52 UTC (permalink / raw)
  To: linux-kernel, kvm, netdev, x86

Dear all,

currently, KVM and SECCOMP rely on functionality of CONFIG_NET, and hence the
latter has to be enabled when building the kernel for the first two
configurations. However, there exists scenarios where the system does not need
networking, but KVM and SECCOMP filters. To reduce the kernel image size for
these scenarios, and to be able to drop active code, this commit series allows
to enable CONFIG_KVM and CONFIG_SECCOMP_FILTER without using CONFIG_NET.

The functionality that is required for seccomp filters is kept in the same
files and - after reordering the source code - is guarded with a single ifdef
per file.

I hope these changes are useful for other scenarios than the one I currently
face.

Best,
Norbert

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply

* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:46 UTC (permalink / raw)
  To: Kirill Tkhai, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <f79f8b6f-c137-1eb7-1161-2264c270c025@virtuozzo.com>

On Wed, 2018-06-06 at 16:28 +0300, Kirill Tkhai wrote:
> On 06.06.2018 16:16, Paolo Abeni wrote:
> > KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> > 
> > without acquiring any lock. Moreover, in R() when the MSG_PEEK
> > flag is not present, the skb is peeked and dequeued with two
> > separate, non-atomic, calls.
> > 
> > The above create room for races, which SYZBOT has been able to
> > exploit, causing list corruption and kernel oops:
> > 
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault: 0000 [#1] SMP KASAN
> > Dumping ftrace buffer:
> >     (ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> > RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> > RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> > RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> > RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> > RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> > R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> > FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> >   sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> >   ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> >   __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> >   do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> >   __do_sys_recvmmsg net/socket.c:2485 [inline]
> >   __se_sys_recvmmsg net/socket.c:2481 [inline]
> >   __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> >   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x4417a9
> > RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> > RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> > RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> > R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> > R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> > Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> > 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> > 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> > RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> > RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> > 
> > To fix the above, we need to use the locked version of the socket dequeue
> > helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> > the available skb when not peeking.
> > 
> > RFC -> v1:
> >  - use skb_dequeue(), as suggested by Tom
> >  - explicitly close the race between skb_peek and skb_unlink
> > 
> > Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> > Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > This is an RFC, since I'm really new to this area, anyway the syzport
> > reported success in testing the proposed fix.
> > This is very likely a scenario where the upcoming skb->prev,next -> list_head
> > conversion would have helped a lot, thanks to list poisoning and list debug
> > ---
> >  net/kcm/kcmsock.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> > index d3601d421571..dd2d02bb35ae 100644
> > --- a/net/kcm/kcmsock.c
> > +++ b/net/kcm/kcmsock.c
> > @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> >  	struct sk_buff *skb;
> >  	struct kcm_sock *kcm;
> >  
> > -	while ((skb = __skb_dequeue(head))) {
> > +	while ((skb = skb_dequeue(head))) {
> 
> I try to find how the patch protects against the following race:
> 
> requeue_rx_msgs()         kcm_recvmsg()
>   skb = skb_dequeue()       skb = kcm_wait_data(peek = true)
>   ...                       ...
> free skb                    ...
> ...                         skb_copy_datagram_msg(skb) <--- Use after free?
> 
> Isn't there possible a use-after-free?

You are right, this patch does not fix the above race: is addressing a
different one, when recvmsg() is not peeking.

The race itself is not introduced by this code, and I think a separate
patch for the the above would be better (we probably need to increment
the skb reference count while peeking and consume the skb after the
copy)

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH] netfilter: provide udp*_lib_lookup for nf_tproxy
From: Pablo Neira Ayuso @ 2018-06-06 13:33 UTC (permalink / raw)
  To: David Miller
  Cc: arnd, kuznet, yoshfuji, ecklm94, pabeni, willemb, edumazet,
	dsahern, kafai, netdev, linux-kernel
In-Reply-To: <20180605.105453.339908802413146875.davem@davemloft.net>

On Tue, Jun 05, 2018 at 10:54:53AM -0400, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue,  5 Jun 2018 13:40:34 +0200
> 
> > It is now possible to enable the libified nf_tproxy modules without
> > also enabling NETFILTER_XT_TARGET_TPROXY, which throws off the
> > ifdef logic in the udp core code:
> > 
> > net/ipv6/netfilter/nf_tproxy_ipv6.o: In function `nf_tproxy_get_sock_v6':
> > nf_tproxy_ipv6.c:(.text+0x1a8): undefined reference to `udp6_lib_lookup'
> > net/ipv4/netfilter/nf_tproxy_ipv4.o: In function `nf_tproxy_get_sock_v4':
> > nf_tproxy_ipv4.c:(.text+0x3d0): undefined reference to `udp4_lib_lookup'
> > 
> > We can actually simplify the conditions now to provide the two functions
> > exactly when they are needed.
> > 
> > Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Pablo, I'm going to apply this directly to fix the link failure.

BTW, could you also pick this fix for net-next?

http://patchwork.ozlabs.org/patch/924706/

Thanks.

^ permalink raw reply

* WARNING: refcount bug in smc_tcp_listen_work
From: syzbot @ 2018-06-06 13:32 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun

Hello,

syzbot found the following crash on:

HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1632a6f7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=7b76ec81cd4aad4b
dashboard link: https://syzkaller.appspot.com/bug?extid=9e60d2428a42049a592a
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9e60d2428a42049a592a@syzkaller.appspotmail.com

A link change request failed with some changes committed already. Interface  
bridge0 may have been left with an inconsistent configuration, please check.
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 23870 at lib/refcount.c:187  
refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 23870 Comm: kworker/0:4 Not tainted 4.17.0-rc7+ #79
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events smc_tcp_listen_work
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  panic+0x22f/0x4de kernel/panic.c:184
  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
  report_bug+0x252/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
RSP: 0018:ffff880194aaf4f8 EFLAGS: 00010286
RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffffffff8160b8ad
RDX: 0000000000000000 RSI: ffffffff81610561 RDI: ffff880194aaf058
RBP: ffff880194aaf5e0 R08: ffff88014148e500 R09: 0000000000000006
R10: ffff88014148e500 R11: 0000000000000000 R12: 00000000ffffffff
R13: ffff880194aaf5b8 R14: 0000000000000001 R15: ffff880194aaf728
  refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
  sock_put include/net/sock.h:1668 [inline]
  smc_tcp_listen_work+0xb94/0xec0 net/smc/af_smc.c:1073
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Kirill Tkhai @ 2018-06-06 13:28 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <628e0398546aefabd68669450621909d269e1ba8.1528289745.git.pabeni@redhat.com>

On 06.06.2018 16:16, Paolo Abeni wrote:
> KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> 
> without acquiring any lock. Moreover, in R() when the MSG_PEEK
> flag is not present, the skb is peeked and dequeued with two
> separate, non-atomic, calls.
> 
> The above create room for races, which SYZBOT has been able to
> exploit, causing list corruption and kernel oops:
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
>     (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
>   sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
>   ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
>   __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
>   do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
>   __do_sys_recvmmsg net/socket.c:2485 [inline]
>   __se_sys_recvmmsg net/socket.c:2481 [inline]
>   __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4417a9
> RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> 
> To fix the above, we need to use the locked version of the socket dequeue
> helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> the available skb when not peeking.
> 
> RFC -> v1:
>  - use skb_dequeue(), as suggested by Tom
>  - explicitly close the race between skb_peek and skb_unlink
> 
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is an RFC, since I'm really new to this area, anyway the syzport
> reported success in testing the proposed fix.
> This is very likely a scenario where the upcoming skb->prev,next -> list_head
> conversion would have helped a lot, thanks to list poisoning and list debug
> ---
>  net/kcm/kcmsock.c | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index d3601d421571..dd2d02bb35ae 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
>  	struct sk_buff *skb;
>  	struct kcm_sock *kcm;
>  
> -	while ((skb = __skb_dequeue(head))) {
> +	while ((skb = skb_dequeue(head))) {

I try to find how the patch protects against the following race:

requeue_rx_msgs()         kcm_recvmsg()
  skb = skb_dequeue()       skb = kcm_wait_data(peek = true)
  ...                       ...
free skb                    ...
...                         skb_copy_datagram_msg(skb) <--- Use after free?

Isn't there possible a use-after-free?

Thanks,
Kirill

>  		/* Reset destructor to avoid calling kcm_rcv_ready */
>  		skb->destructor = sock_rfree;
>  		skb_orphan(skb);
> @@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
>  	return err;
>  }
>  
> -static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
> +static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
>  				     long timeo, int *err)
>  {
>  	struct sk_buff *skb;
>  
> -	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
> +	for (;; ) {
> +		skb = peek ? skb_peek(&sk->sk_receive_queue) :
> +			     skb_dequeue(&sk->sk_receive_queue);
> +		if (skb)
> +			break;
> +
>  		if (sk->sk_err) {
>  			*err = sock_error(sk);
>  			return NULL;
> @@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  {
>  	struct sock *sk = sock->sk;
>  	struct kcm_sock *kcm = kcm_sk(sk);
> +	bool peek = flags & MSG_PEEK;
>  	int err = 0;
>  	long timeo;
>  	struct strp_msg *stm;
> @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  
>  	lock_sock(sk);
>  
> -	skb = kcm_wait_data(sk, flags, timeo, &err);
> +	skb = kcm_wait_data(sk, flags, peek, timeo, &err);
>  	if (!skb)
>  		goto out;
>  
> @@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  		goto out;
>  
>  	copied = len;
> -	if (likely(!(flags & MSG_PEEK))) {
> +	if (likely(!peek)) {
>  		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
>  		if (copied < stm->full_len) {
>  			if (sock->type == SOCK_DGRAM) {
> @@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  			/* Finished with message */
>  			msg->msg_flags |= MSG_EOR;
>  			KCM_STATS_INCR(kcm->stats.rx_msgs);
> -			skb_unlink(skb, &sk->sk_receive_queue);
>  			kfree_skb(skb);
>  		}
>  	}
> @@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
>  
>  	lock_sock(sk);
>  
> -	skb = kcm_wait_data(sk, flags, timeo, &err);
> +	skb = kcm_wait_data(sk, flags, true, timeo, &err);
>  	if (!skb)
>  		goto err_out;
>  
> 

^ permalink raw reply

* [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:16 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Tom Herbert, Kirill Tkhai

KCM removes the packets from sk_receive_queue in requeue_rx_msgs()

without acquiring any lock. Moreover, in R() when the MSG_PEEK
flag is not present, the skb is peeked and dequeued with two
separate, non-atomic, calls.

The above create room for races, which SYZBOT has been able to
exploit, causing list corruption and kernel oops:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
  sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
  ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
  __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
  do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
  __do_sys_recvmmsg net/socket.c:2485 [inline]
  __se_sys_recvmmsg net/socket.c:2481 [inline]
  __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4417a9
RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0

To fix the above, we need to use the locked version of the socket dequeue
helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
the available skb when not peeking.

RFC -> v1:
 - use skb_dequeue(), as suggested by Tom
 - explicitly close the race between skb_peek and skb_unlink

Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is an RFC, since I'm really new to this area, anyway the syzport
reported success in testing the proposed fix.
This is very likely a scenario where the upcoming skb->prev,next -> list_head
conversion would have helped a lot, thanks to list poisoning and list debug
---
 net/kcm/kcmsock.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d3601d421571..dd2d02bb35ae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
 	struct sk_buff *skb;
 	struct kcm_sock *kcm;
 
-	while ((skb = __skb_dequeue(head))) {
+	while ((skb = skb_dequeue(head))) {
 		/* Reset destructor to avoid calling kcm_rcv_ready */
 		skb->destructor = sock_rfree;
 		skb_orphan(skb);
@@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	return err;
 }
 
-static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
+static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
 				     long timeo, int *err)
 {
 	struct sk_buff *skb;
 
-	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+	for (;; ) {
+		skb = peek ? skb_peek(&sk->sk_receive_queue) :
+			     skb_dequeue(&sk->sk_receive_queue);
+		if (skb)
+			break;
+
 		if (sk->sk_err) {
 			*err = sock_error(sk);
 			return NULL;
@@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 	struct kcm_sock *kcm = kcm_sk(sk);
+	bool peek = flags & MSG_PEEK;
 	int err = 0;
 	long timeo;
 	struct strp_msg *stm;
@@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	skb = kcm_wait_data(sk, flags, timeo, &err);
+	skb = kcm_wait_data(sk, flags, peek, timeo, &err);
 	if (!skb)
 		goto out;
 
@@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 		goto out;
 
 	copied = len;
-	if (likely(!(flags & MSG_PEEK))) {
+	if (likely(!peek)) {
 		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
 		if (copied < stm->full_len) {
 			if (sock->type == SOCK_DGRAM) {
@@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 			/* Finished with message */
 			msg->msg_flags |= MSG_EOR;
 			KCM_STATS_INCR(kcm->stats.rx_msgs);
-			skb_unlink(skb, &sk->sk_receive_queue);
 			kfree_skb(skb);
 		}
 	}
@@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	skb = kcm_wait_data(sk, flags, timeo, &err);
+	skb = kcm_wait_data(sk, flags, true, timeo, &err);
 	if (!skb)
 		goto err_out;
 
-- 
2.17.1

^ permalink raw reply related

* RE: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:10 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
	wangxi (M)
In-Reply-To: <20180605.144046.766561700879427040.davem@davemloft.net>

Hi Dave,
Sorry about this. I have fixed this in the V2 patch floated just now.

Best regards
Salil.
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 05, 2018 7:41 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox
> receiving unknown message
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Tue, 5 Jun 2018 12:42:00 +0100
> 
> > +		if (unlikely(!hnae3_get_bit(flag,
> HCLGEVF_CMDQ_RX_OUTVLD_B))) {
> 
> This breaks the build, there is no such symbol named hnae3_get_bit().

^ permalink raw reply

* [PATCH V2 net-next 3/3] net: hns3: Optimize PF CMDQ interrupt switching process
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

When the PF frequently switches the CMDQ interrupt, if the CMDQ_SRC is
not cleared before the hardware interrupt is generated, the new interrupt
will not be reported.

This patch optimizes this problem by clearing CMDQ_SRC and RESET_STS
before enabling interrupt and syncing pending IRQ handlers after disabling
interrupt.

Fixes: 466b0c00391b ("net: hns3: Add support for misc interrupt")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2a80134..d318d35 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2557,6 +2557,15 @@ static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type,
 	}
 }
 
+static void hclge_clear_all_event_cause(struct hclge_dev *hdev)
+{
+	hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_RST,
+				BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) |
+				BIT(HCLGE_VECTOR0_CORERESET_INT_B) |
+				BIT(HCLGE_VECTOR0_IMPRESET_INT_B));
+	hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_MBX, 0);
+}
+
 static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable)
 {
 	writel(enable ? 1 : 0, vector->addr);
@@ -5688,6 +5697,8 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	INIT_WORK(&hdev->rst_service_task, hclge_reset_service_task);
 	INIT_WORK(&hdev->mbx_service_task, hclge_mailbox_service_task);
 
+	hclge_clear_all_event_cause(hdev);
+
 	/* Enable MISC vector(vector0) */
 	hclge_enable_vector(&hdev->misc_vector, true);
 
@@ -5817,6 +5828,8 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
 
 	/* Disable MISC vector(vector0) */
 	hclge_enable_vector(&hdev->misc_vector, false);
+	synchronize_irq(hdev->misc_vector.vector_irq);
+
 	hclge_destroy_cmd_queue(&hdev->hw);
 	hclge_misc_irq_uninit(hdev);
 	hclge_pci_uninit(hdev);
-- 
2.7.4

^ permalink raw reply related

* [PATCH V2 net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

Before the firmware updates the crq's tail pointer, if the VF driver
reads the data in the crq, the data may be incomplete at this time,
which will lead to the driver read an unknown message.

This patch fixes it by checking if crq is empty before reading the
message.

Fixes: b11a0bb231f3 ("net: hns3: Add mailbox support to VF driver")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixes the compilation break reported David Miller & Kbuild
   Link: https://lkml.org/lkml/2018/6/5/866
         https://lkml.org/lkml/2018/6/6/147
Patch V1: Initial Submit
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 23 +++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index a286184..b598c06 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -126,6 +126,13 @@ int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
 	return status;
 }
 
+static bool hclgevf_cmd_crq_empty(struct hclgevf_hw *hw)
+{
+	u32 tail = hclgevf_read_dev(hw, HCLGEVF_NIC_CRQ_TAIL_REG);
+
+	return tail == hw->cmq.crq.next_to_use;
+}
+
 void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 {
 	struct hclgevf_mbx_resp_status *resp;
@@ -140,11 +147,22 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 	resp = &hdev->mbx_resp;
 	crq = &hdev->hw.cmq.crq;
 
-	flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
-	while (hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B)) {
+	while (!hclgevf_cmd_crq_empty(&hdev->hw)) {
 		desc = &crq->desc[crq->next_to_use];
 		req = (struct hclge_mbx_pf_to_vf_cmd *)desc->data;
 
+		flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
+		if (unlikely(!hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B))) {
+			dev_warn(&hdev->pdev->dev,
+				 "dropped invalid mailbox message, code = %d\n",
+				 req->msg[0]);
+
+			/* dropping/not processing this invalid message */
+			crq->desc[crq->next_to_use].flag = 0;
+			hclge_mbx_ring_ptr_move_crq(crq);
+			continue;
+		}
+
 		/* synchronous messages are time critical and need preferential
 		 * treatment. Therefore, we need to acknowledge all the sync
 		 * responses as quickly as possible so that waiting tasks do not
@@ -205,7 +223,6 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 		}
 		crq->desc[crq->next_to_use].flag = 0;
 		hclge_mbx_ring_ptr_move_crq(crq);
-		flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
 	}
 
 	/* Write back CMDQ_RQ header pointer, M7 need this pointer */
-- 
2.7.4

^ permalink raw reply related

* [PATCH V2 net-next 1/3] net: hns3: Fix for VF mailbox cannot receiving PF response
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

When the VF frequently switches the CMDQ interrupt, if the CMDQ_SRC is not
cleared, the VF will not receive the new PF response after the interrupt
is re-enabled, the corresponding log is as follows:

[  317.482222] hns3 0000:00:03.0: VF could not get mbx resp(=0) from PF
in 500 tries
[  317.483137] hns3 0000:00:03.0: VF request to get tqp info from PF
failed -5

This patch fixes this problem by clearing CMDQ_SRC before enabling
interrupt and syncing pending IRQ handlers after disabling interrupt.

Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index dd8e8e6..d55ee9c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1576,6 +1576,8 @@ static int hclgevf_misc_irq_init(struct hclgevf_dev *hdev)
 		return ret;
 	}
 
+	hclgevf_clear_event_cause(hdev, 0);
+
 	/* enable misc. vector(vector 0) */
 	hclgevf_enable_vector(&hdev->misc_vector, true);
 
@@ -1586,6 +1588,7 @@ static void hclgevf_misc_irq_uninit(struct hclgevf_dev *hdev)
 {
 	/* disable misc vector(vector 0) */
 	hclgevf_enable_vector(&hdev->misc_vector, false);
+	synchronize_irq(hdev->misc_vector.vector_irq);
 	free_irq(hdev->misc_vector.vector_irq, hdev);
 	hclgevf_free_vector(hdev, 0);
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH V2 net-next 0/3] Bug fixes & optimization for HNS3 Driver
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm

This patch-set presents miscellaneous bug fixes and an optimization
for HNS3 driver

V1->V2:
	* Fixes the compilation break reported by David Miller & Kbuild

Xi Wang (3):
  net: hns3: Fix for VF mailbox cannot receiving PF response
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Optimize PF CMDQ interrupt switching process

 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 13 ++++++++++++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  3 +++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 23 +++++++++++++++++++---
 3 files changed, 36 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH] bnx2x: use the right constant
From: Julia Lawall @ 2018-06-06 13:03 UTC (permalink / raw)
  To: Ariel Elior
  Cc: kernel-janitors, everest-linux-l2, David S. Miller, netdev,
	linux-kernel

Nearby code that also tests port suggests that the P0 constant should be
used when port is zero.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e,e1;
@@

* e ? e1 : e1
// </smpl>


Fixes: 6c3218c6f7e5 ("bnx2x: Adjust ETS to 578xx")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index 7dd83d0..22243c4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -588,7 +588,7 @@ static void bnx2x_ets_e3b0_nig_disabled(const struct link_params *params,
 	 * slots for the highest priority.
 	 */
 	REG_WR(bp, (port) ? NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS :
-		   NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
+		   NIG_REG_P0_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
 	/* Mapping between the CREDIT_WEIGHT registers and actual client
 	 * numbers
 	 */

^ permalink raw reply related

* [PATCH bpf-next v5 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Jesper Dangaard Brouer

Add two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 tools/testing/selftests/bpf/trace_helpers.c |   49 ++++++++++++++++++++++++++-
 tools/testing/selftests/bpf/trace_helpers.h |    4 ++
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..0673e8840cc8 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
 
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
 {
 	void *base;
 	int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
 		return -1;
 	}
 
-	header = base;
+	*header = base;
 	return 0;
 }
 
+int perf_event_mmap(int fd)
+{
+	return perf_event_mmap_header(fd, &header);
+}
+
 static int perf_event_poll(int fd)
 {
 	struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,43 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
 
 	return ret;
 }
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn)
+{
+	enum bpf_perf_event_ret ret;
+	struct pollfd *pfds;
+	void *buf = NULL;
+	size_t len = 0;
+	int i;
+
+	pfds = malloc(sizeof(*pfds) * num_fds);
+	if (!pfds)
+		return LIBBPF_PERF_EVENT_ERROR;
+
+	memset(pfds, 0, sizeof(*pfds) * num_fds);
+	for (i = 0; i < num_fds; i++) {
+		pfds[i].fd = fds[i];
+		pfds[i].events = POLLIN;
+	}
+
+	for (;;) {
+		poll(pfds, num_fds, 1000);
+		for (i = 0; i < num_fds; i++) {
+			if (!pfds[i].revents)
+				continue;
+
+			ret = bpf_perf_event_read_simple(headers[i],
+							 page_cnt * page_size,
+							 page_size, &buf, &len,
+							 bpf_perf_event_print,
+							 output_fn);
+			if (ret != LIBBPF_PERF_EVENT_CONT)
+				break;
+		}
+	}
+	free(buf);
+	free(pfds);
+
+	return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
 #define __TRACE_HELPER_H
 
 #include <libbpf.h>
+#include <linux/perf_event.h>
 
 struct ksym {
 	long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
 int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
 /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
 int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn);
 #endif

^ permalink raw reply related

* [PATCH bpf-next v5 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Jesper Dangaard Brouer
In-Reply-To: <152828901936.9599.10143519677183418086.stgit@alrua-kau>

Add an example program showing how to sample packets from XDP using the
perf event buffer. The example userspace program just prints the ethernet
header for every packet sampled.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 samples/bpf/Makefile               |    4 +
 samples/bpf/xdp_sample_pkts_kern.c |   66 ++++++++++++++
 samples/bpf/xdp_sample_pkts_user.c |  176 ++++++++++++++++++++++++++++++++++++
 3 files changed, 246 insertions(+)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..9ea2f7b64869 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES		+= $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4		+= -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index 000000000000..f7ca8b850978
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 128
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+
+	/* Metadata will be in the perf event before the packet data. */
+	struct S {
+		u16 cookie;
+		u16 pkt_len;
+	} __packed metadata;
+
+	if (data < data_end) {
+		/* The XDP perf_event_output handler will use the upper 32 bits
+		 * of the flags argument as a number of bytes to include of the
+		 * packet payload in the event data. If the size is too big, the
+		 * call to bpf_perf_event_output will fail and return -EFAULT.
+		 *
+		 * See bpf_xdp_event_output in net/core/filter.c.
+		 *
+		 * The BPF_F_CURRENT_CPU flag means that the event output fd
+		 * will be indexed by the CPU number in the event map.
+		 */
+		u64 flags = BPF_F_CURRENT_CPU;
+		u16 sample_size;
+		int ret;
+
+		metadata.cookie = 0xdead;
+		metadata.pkt_len = (u16)(data_end - data);
+		sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
+		flags |= (u64)sample_size << 32;
+
+		ret = bpf_perf_event_output(ctx, &my_map, flags,
+					    &metadata, sizeof(metadata));
+		if (ret)
+			bpf_printk("perf_event_output failed: %d\n", ret);
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index 000000000000..944ff7512d2b
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <net/if.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <libbpf.h>
+#include <bpf/bpf.h>
+
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define MAX_CPUS 128
+static int pmu_fds[MAX_CPUS], if_idx;
+static struct perf_event_mmap_page *headers[MAX_CPUS];
+static char *if_name;
+
+static int do_attach(int idx, int fd, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, fd, 0);
+	if (err < 0)
+		printf("ERROR: failed to attach program to %s\n", name);
+
+	return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, -1, 0);
+	if (err < 0)
+		printf("ERROR: failed to detach program from %s\n", name);
+
+	return err;
+}
+
+#define SAMPLE_SIZE 64
+
+static int print_bpf_output(void *data, int size)
+{
+	struct {
+		__u16 cookie;
+		__u16 pkt_len;
+		__u8  pkt_data[SAMPLE_SIZE];
+	} __packed *e = data;
+	int i;
+
+	if (e->cookie != 0xdead) {
+		printf("BUG cookie %x sized %d\n",
+		       e->cookie, size);
+		return LIBBPF_PERF_EVENT_ERROR;
+	}
+
+	printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
+	for (i = 0; i < 14 && i < e->pkt_len; i++)
+		printf("%02x ", e->pkt_data[i]);
+	printf("\n");
+
+	return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void test_bpf_perf_event(int map_fd, int num)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.wakeup_events = 1, /* get an fd notification for every event */
+	};
+	int i;
+
+	for (i = 0; i < num; i++) {
+		int key = i;
+
+		pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/,
+						 -1/*group_fd*/, 0);
+
+		assert(pmu_fds[i] >= 0);
+		assert(bpf_map_update_elem(map_fd, &key,
+					   &pmu_fds[i], BPF_ANY) == 0);
+		ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
+	}
+}
+
+static void sig_handler(int signo)
+{
+	do_detach(if_idx, if_name);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type	= BPF_PROG_TYPE_XDP,
+	};
+	struct bpf_object *obj;
+	struct bpf_map *map;
+	int prog_fd, map_fd;
+	char filename[256];
+	int ret, err, i;
+	int numcpus;
+
+	if (argc < 2) {
+		printf("Usage: %s <ifname>\n", argv[0]);
+		return 1;
+	}
+
+	numcpus = get_nprocs();
+	if (numcpus > MAX_CPUS)
+		numcpus = MAX_CPUS;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	if (!prog_fd) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	map = bpf_map__next(NULL, obj);
+	if (!map) {
+		printf("finding a map in obj file failed\n");
+		return 1;
+	}
+	map_fd = bpf_map__fd(map);
+
+	if_idx = if_nametoindex(argv[1]);
+	if (!if_idx)
+		if_idx = strtoul(argv[1], NULL, 0);
+
+	if (!if_idx) {
+		fprintf(stderr, "Invalid ifname\n");
+		return 1;
+	}
+	if_name = argv[1];
+	err = do_attach(if_idx, prog_fd, argv[1]);
+	if (err)
+		return err;
+
+	if (signal(SIGINT, sig_handler) ||
+	    signal(SIGHUP, sig_handler) ||
+	    signal(SIGTERM, sig_handler)) {
+		perror("signal");
+		return 1;
+	}
+
+	test_bpf_perf_event(map_fd, numcpus);
+
+	for (i = 0; i < numcpus; i++)
+		if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
+			return 1;
+
+	ret = perf_event_poller_multi(pmu_fds, headers, numcpus,
+				      print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
+}

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox