Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH ethtool v2 0/3] ethtool: Wake-on-LAN using filters
From: John W. Linville @ 2018-08-16 18:32 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, davem, andrew
In-Reply-To: <20180809180402.19430-1-f.fainelli@gmail.com>

On Thu, Aug 09, 2018 at 11:03:59AM -0700, Florian Fainelli wrote:
> Hi John,
> 
> This patch series syncs up ethtool-copy.h to get the new definitions
> required for supporting wake-on-LAN using filters: WAKE_FILTER and
> RX_CLS_FLOW_WAKE and then updates the rxclass.c code to allow us to
> specify action -2 (RX_CLS_FLOW_WAKE).
> 
> Let me know if you would like this to be done differently.
> 
> Thanks!
> 
> Changes in v2:
> 
> - properly put the man page hunk describing action -2 into patch #3
> 
> Florian Fainelli (3):
>   ethtool-copy.h: sync with net-next
>   ethtool: Add support for WAKE_FILTER (WoL using filters)
>   ethtool: Add support for action value -2 (wake-up filter)
> 
>  ethtool-copy.h | 15 +++++++++++----
>  ethtool.8.in   |  4 +++-
>  ethtool.c      |  5 +++++
>  rxclass.c      |  8 +++++---
>  4 files changed, 24 insertions(+), 8 deletions(-)

Thanks, Florian -- LGTM!

Patches merged and pushed-out, queued for next release (probably next week)...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Edward Cree @ 2018-08-16 18:34 UTC (permalink / raw)
  To: Petar Penkov, netdev
  Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-2-peterpenkov96@gmail.com>

On 16/08/18 17:44, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is kept as a global variable so it is
> accessible to all flow dissectors.
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---

This looks really great.

> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> +{
> +	struct bpf_prog *attached;
> +
> +	mutex_lock(&flow_dissector_mutex);
> +	attached = rcu_dereference_protected(flow_dissector_prog,
> +					     lockdep_is_held(&flow_dissector_mutex));
> +	if (!flow_dissector_prog) {
> +		mutex_unlock(&flow_dissector_mutex);
> +		return -EINVAL;
Wouldn't -ENOENT be more usual here (as the counterpart to -EEXIST in
 the skb_flow_dissector_bpf_prog_attach() version just above)?

-Ed

^ permalink raw reply

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
From: Steve Wise @ 2018-08-16 18:32 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, Jason Gunthorpe
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'
In-Reply-To: <4a13541c-db48-beca-4ee7-932528b22986@grimberg.me>



On 8/16/2018 1:26 PM, Sagi Grimberg wrote:
>
>> Let me know if you want me to try this or any particular fix.
>
> Steve, can you test this one?

Yes!  I'll try it out tomorrow. 

Stevo

> -- 
> [PATCH rfc] block: fix rdma queue mapping
>
> nvme-rdma attempts to map queues based on irq vector affinity.
> However, for some devices, completion vector irq affinity is
> configurable by the user which can break the existing assumption
> that irq vectors are optimally arranged over the host cpu cores.
>
> So we map queues in two stages:
> First map queues according to corresponding to the completion
> vector IRQ affinity taking the first cpu in the vector affinity map.
> if the current irq affinity is arranged such that a vector is not
> assigned to any distinct cpu, we map it to a cpu that is on the same
> node. If numa affinity can not be sufficed, we map it to any unmapped
> cpu we can find. Then, map the remaining cpus in the possible cpumap
> naively.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Steve, can you test out this patch?
>  block/blk-mq-cpumap.c  | 39 +++++++++++++-----------
>  block/blk-mq-rdma.c    | 80
> +++++++++++++++++++++++++++++++++++++++++++-------
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 93 insertions(+), 27 deletions(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 3eb169f15842..34811db8cba9 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
>         return cpu;
>  }
>
> -int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
>  {
>         unsigned int *map = set->mq_map;
>         unsigned int nr_queues = set->nr_hw_queues;
> -       unsigned int cpu, first_sibling;
> +       unsigned int first_sibling;
>
> -       for_each_possible_cpu(cpu) {
> -               /*
> -                * First do sequential mapping between CPUs and queues.
> -                * In case we still have CPUs to map, and we have some
> number of
> -                * threads per cores then map sibling threads to the
> same queue for
> -                * performace optimizations.
> -                */
> -               if (cpu < nr_queues) {
> +       /*
> +        * First do sequential mapping between CPUs and queues.
> +        * In case we still have CPUs to map, and we have some number of
> +        * threads per cores then map sibling threads to the same
> queue for
> +        * performace optimizations.
> +        */
> +       if (cpu < nr_queues) {
> +               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> +       } else {
> +               first_sibling = get_first_sibling(cpu);
> +               if (first_sibling == cpu)
>                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> -               } else {
> -                       first_sibling = get_first_sibling(cpu);
> -                       if (first_sibling == cpu)
> -                               map[cpu] =
> cpu_to_queue_index(nr_queues, cpu);
> -                       else
> -                               map[cpu] = map[first_sibling];
> -               }
> +               else
> +                       map[cpu] = map[first_sibling];
>         }
> +}
> +
> +int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +{
> +       unsigned int cpu;
>
> +       for_each_possible_cpu(cpu)
> +               blk_mq_map_queue_cpu(set, cpu);
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_map_queues);
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..d04cbb1925f5 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -14,6 +14,61 @@
>  #include <linux/blk-mq-rdma.h>
>  #include <rdma/ib_verbs.h>
>
> +static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
> +               struct ib_device *dev, int first_vec, unsigned int queue)
> +{
> +       const struct cpumask *mask;
> +       unsigned int cpu;
> +       bool mapped = false;
> +
> +       mask = ib_get_vector_affinity(dev, first_vec + queue);
> +       if (!mask)
> +               return -ENOTSUPP;
> +
> +       /* map with an unmapped cpu according to affinity mask */
> +       for_each_cpu(cpu, mask) {
> +               if (set->mq_map[cpu] == UINT_MAX) {
> +                       set->mq_map[cpu] = queue;
> +                       mapped = true;
> +                       break;
> +               }
> +       }
> +
> +       if (!mapped) {
> +               int n;
> +
> +               /* map with an unmapped cpu in the same numa node */
> +               for_each_node(n) {
> +                       const struct cpumask *node_cpumask =
> cpumask_of_node(n);
> +
> +                       if (!cpumask_intersects(mask, node_cpumask))
> +                               continue;
> +
> +                       for_each_cpu(cpu, node_cpumask) {
> +                               if (set->mq_map[cpu] == UINT_MAX) {
> +                                       set->mq_map[cpu] = queue;
> +                                       mapped = true;
> +                                       break;
> +                               }
> +                       }
> +               }
> +       }
> +
> +       if (!mapped) {
> +               /* map with any unmapped cpu we can find */
> +               for_each_possible_cpu(cpu) {
> +                       if (set->mq_map[cpu] == UINT_MAX) {
> +                               set->mq_map[cpu] = queue;
> +                               mapped = true;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       WARN_ON_ONCE(!mapped);
> +       return 0;
> +}
> +
>  /**
>   * blk_mq_rdma_map_queues - provide a default queue mapping for rdma
> device
>   * @set:       tagset to provide the mapping for
> @@ -21,31 +76,36 @@
>   * @first_vec: first interrupt vectors to use for queues (usually 0)
>   *
>   * This function assumes the rdma device @dev has at least as many
> available
> - * interrupt vetors as @set has queues.  It will then query it's
> affinity mask
> - * and built queue mapping that maps a queue to the CPUs that have
> irq affinity
> - * for the corresponding vector.
> + * interrupt vetors as @set has queues.  It will then query vector
> affinity mask
> + * and attempt to build irq affinity aware queue mappings. If optimal
> affinity
> + * aware mapping cannot be acheived for a given queue, we look for
> any unmapped
> + * cpu to map it. Lastly, we map naively all other unmapped cpus in
> the mq_map.
>   *
>   * In case either the driver passed a @dev with less vectors than
>   * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
>   * vector, we fallback to the naive mapping.
>   */
>  int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> -               struct ib_device *dev, int first_vec)
> +                struct ib_device *dev, int first_vec)
>  {
> -       const struct cpumask *mask;
>         unsigned int queue, cpu;
>
> +       /* reset cpu mapping */
> +       for_each_possible_cpu(cpu)
> +               set->mq_map[cpu] = UINT_MAX;
> +
>         for (queue = 0; queue < set->nr_hw_queues; queue++) {
> -               mask = ib_get_vector_affinity(dev, first_vec + queue);
> -               if (!mask)
> +               if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
>                         goto fallback;
> +       }
>
> -               for_each_cpu(cpu, mask)
> -                       set->mq_map[cpu] = queue;
> +       /* map any remaining unmapped cpus */
> +       for_each_possible_cpu(cpu) {
> +               if (set->mq_map[cpu] == UINT_MAX)
> +                       blk_mq_map_queue_cpu(set, cpu);;
>         }
>
>         return 0;
> -
>  fallback:
>         return blk_mq_map_queues(set);
>  }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d710e92874cc..6eb09c4de34f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -285,6 +285,7 @@ int blk_mq_freeze_queue_wait_timeout(struct
> request_queue *q,
>                                      unsigned long timeout);
>
>  int blk_mq_map_queues(struct blk_mq_tag_set *set);
> +void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int
> nr_hw_queues);
>
>  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> -- 

^ permalink raw reply

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
From: Sagi Grimberg @ 2018-08-16 18:26 UTC (permalink / raw)
  To: Steve Wise, Max Gurtovoy, Jason Gunthorpe
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'
In-Reply-To: <47178d4d-f730-6e59-5c19-58331cc3864a@opengridcomputing.com>


> Let me know if you want me to try this or any particular fix.

Steve, can you test this one?
--
[PATCH rfc] block: fix rdma queue mapping

nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Steve, can you test out this patch?
  block/blk-mq-cpumap.c  | 39 +++++++++++++-----------
  block/blk-mq-rdma.c    | 80 
+++++++++++++++++++++++++++++++++++++++++++-------
  include/linux/blk-mq.h |  1 +
  3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
         return cpu;
  }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
  {
         unsigned int *map = set->mq_map;
         unsigned int nr_queues = set->nr_hw_queues;
-       unsigned int cpu, first_sibling;
+       unsigned int first_sibling;

-       for_each_possible_cpu(cpu) {
-               /*
-                * First do sequential mapping between CPUs and queues.
-                * In case we still have CPUs to map, and we have some 
number of
-                * threads per cores then map sibling threads to the 
same queue for
-                * performace optimizations.
-                */
-               if (cpu < nr_queues) {
+       /*
+        * First do sequential mapping between CPUs and queues.
+        * In case we still have CPUs to map, and we have some number of
+        * threads per cores then map sibling threads to the same queue for
+        * performace optimizations.
+        */
+       if (cpu < nr_queues) {
+               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+       } else {
+               first_sibling = get_first_sibling(cpu);
+               if (first_sibling == cpu)
                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-               } else {
-                       first_sibling = get_first_sibling(cpu);
-                       if (first_sibling == cpu)
-                               map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);
-                       else
-                               map[cpu] = map[first_sibling];
-               }
+               else
+                       map[cpu] = map[first_sibling];
         }
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+       unsigned int cpu;

+       for_each_possible_cpu(cpu)
+               blk_mq_map_queue_cpu(set, cpu);
         return 0;
  }
  EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..d04cbb1925f5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
  #include <linux/blk-mq-rdma.h>
  #include <rdma/ib_verbs.h>

+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+               struct ib_device *dev, int first_vec, unsigned int queue)
+{
+       const struct cpumask *mask;
+       unsigned int cpu;
+       bool mapped = false;
+
+       mask = ib_get_vector_affinity(dev, first_vec + queue);
+       if (!mask)
+               return -ENOTSUPP;
+
+       /* map with an unmapped cpu according to affinity mask */
+       for_each_cpu(cpu, mask) {
+               if (set->mq_map[cpu] == UINT_MAX) {
+                       set->mq_map[cpu] = queue;
+                       mapped = true;
+                       break;
+               }
+       }
+
+       if (!mapped) {
+               int n;
+
+               /* map with an unmapped cpu in the same numa node */
+               for_each_node(n) {
+                       const struct cpumask *node_cpumask = 
cpumask_of_node(n);
+
+                       if (!cpumask_intersects(mask, node_cpumask))
+                               continue;
+
+                       for_each_cpu(cpu, node_cpumask) {
+                               if (set->mq_map[cpu] == UINT_MAX) {
+                                       set->mq_map[cpu] = queue;
+                                       mapped = true;
+                                       break;
+                               }
+                       }
+               }
+       }
+
+       if (!mapped) {
+               /* map with any unmapped cpu we can find */
+               for_each_possible_cpu(cpu) {
+                       if (set->mq_map[cpu] == UINT_MAX) {
+                               set->mq_map[cpu] = queue;
+                               mapped = true;
+                               break;
+                       }
+               }
+       }
+
+       WARN_ON_ONCE(!mapped);
+       return 0;
+}
+
  /**
   * blk_mq_rdma_map_queues - provide a default queue mapping for rdma 
device
   * @set:       tagset to provide the mapping for
@@ -21,31 +76,36 @@
   * @first_vec: first interrupt vectors to use for queues (usually 0)
   *
   * This function assumes the rdma device @dev has at least as many 
available
- * interrupt vetors as @set has queues.  It will then query it's 
affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq 
affinity
- * for the corresponding vector.
+ * interrupt vetors as @set has queues.  It will then query vector 
affinity mask
+ * and attempt to build irq affinity aware queue mappings. If optimal 
affinity
+ * aware mapping cannot be acheived for a given queue, we look for any 
unmapped
+ * cpu to map it. Lastly, we map naively all other unmapped cpus in the 
mq_map.
   *
   * In case either the driver passed a @dev with less vectors than
   * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
   * vector, we fallback to the naive mapping.
   */
  int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
-               struct ib_device *dev, int first_vec)
+                struct ib_device *dev, int first_vec)
  {
-       const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset cpu mapping */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
-               mask = ib_get_vector_affinity(dev, first_vec + queue);
-               if (!mask)
+               if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
                         goto fallback;
+       }

-               for_each_cpu(cpu, mask)
-                       set->mq_map[cpu] = queue;
+       /* map any remaining unmapped cpus */
+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       blk_mq_map_queue_cpu(set, cpu);;
         }

         return 0;
-
  fallback:
         return blk_mq_map_queues(set);
  }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d710e92874cc..6eb09c4de34f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -285,6 +285,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,
                                      unsigned long timeout);

  int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);

^ permalink raw reply related

* Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
From: Stefano Brivio @ 2018-08-16 21:07 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: ovs dev, Justin Pettit, netdev, Jiri Benc
In-Reply-To: <CAOrHB_DaA-+J=jzNOdQiUYrA7RJi30HmRESjsmGs7_z1ffpVOA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org> wrote:

> My argument is not about proposed fairness algorithm. It is about cost
> of the fairness and I do not see it is addressed in any of the follow
> ups.

We are still working on it (especially on the points that you mentioned
already), that's why there hasn't been any follow-up yet.

After all, we marked this patch as RFC because we thought we needed to
gather some feedback before we'd reach a solution that's good enough :)

> I revisited the original patch, here is what I see in term of added
> cost to existing upcall processing:

Thanks for the review and the detailed summary, some answers follow:

> 1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
> and initialize memory

We would use kmem_cache_alloc() here, the queue is rather small.
Initialisation, we don't really need it, we can drop it.

> 2. copy flow key which is more than 1 KB (upcall->key = *key)

The current idea here is to find a way to safely hold the pointer to the
flow key. Do you have any suggestion?

> 3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
> half processing on CPU while waiting for the global lock.

A double list, whose pointer is swapped when we start dequeuing
packets (same as it's done e.g. for the flow table on rehashing), would
avoid the need for this spinlock. We're trying that out.

> 4. Iterate list of queued upcalls, one of objective it is to avoid out
> of order packet. But I do not see point of ordering packets from
> different streams.

Please note, though, that we also have packets from the same stream.
Actually, the whole point of this exercise is to get packets from
different streams out of order, while maintaining order for a given
stream.

> 5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
> further to the latency.

The idea behind ovs_dp_upcall_delay() is to schedule without delay if
we don't currently have a storm of upcalls.

But if we do, we're probably introducing less latency by doing this than
by letting ovs-vswitchd handle them. It's also a fundamental requirement
to have fairness: we need to schedule upcalls, and to schedule we need
some (small, in the overall picture) delay. This is another point where
we need to show some detailed measurements, I guess.

> 6. upcall is then handed over to different thread (context switch),
> likely on different CPU.
> 8. the upcall object is freed on remote CPU.

The solution could be to use cmwq instead and have per-CPU workers and
queues. But I wonder what would be the drawbacks of having per-CPU
fairness. I think this depends a lot on how ovs-vswitchd handles the
upcalls. We could check how that performs. Any thoughts?

> 9. single lock essentially means OVS kernel datapath upcall processing
> is single threaded no matter number of cores in system.

This should also be solved by keeping two queues.

> I would be interested in how are we going to address these issues.
> 
> In example you were talking about netlink fd issue on server with 48
> core, how does this solution works when there are 5K ports each
> triggering upcall ? Can you benchmark your patch? Do you have
> performance numbers for TCP_CRR with and without this patch ? Also
> publish latency numbers for this patch. Please turn off megaflow to
> exercise upcall handling.

We just run some tests that show that fairness is maintained with a
much lower number of ports, but we have no performance numbers at the
moment -- other than the consideration that when flooding with upcalls,
ovs-vswitchd is the bottleneck. We'll run proper performance tests,
focusing especially on latency (which we kind of ignored so far).

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

We tried with one socket per thread, it just doesn't work. We can
definitely try a bit harder. The problem I see here is that the current
mechanism is not actually a fairness scheme. It kind of works for most
workloads, but if a port happens to be flooding with a given timing, I
don't see how fairness can be guaranteed.

-- 
Stefano

^ permalink raw reply

* Re: samples don't build on v4.18
From: Joel Fernandes @ 2018-08-16 20:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: LKML, wangnan0, open list:BPF (Safe dynamic programs and tools),
	Alexei Starovoitov, acme, Chenbo Feng
In-Reply-To: <20180815123418.2765f5b6@cakuba.netronome.com>

On Wed, Aug 15, 2018 at 12:34 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Tue, 14 Aug 2018 20:01:32 -0700, Joel Fernandes wrote:
>> On Tue, Aug 14, 2018 at 06:22:21PM -0700, Joel Fernandes wrote:
>> > Forgot to add the patch author, doing so now. thanks
>> >
>> > On Tue, Aug 14, 2018 at 6:20 PM, Joel Fernandes <joelaf@google.com> wrote:
>> > >
>> > > Hi,
>> > > When building BPF samples on v4.18, I get the following errors:
>> > >
>> > > $ cd samples/bpf/
>> > > $ make
>> > >
>> > > Auto-detecting system features:
>> > > ...                        libelf: [ OFF ]
>> > > ...                           bpf: [ OFF ]
>> > >
>> > > No libelf found
>> > > Makefile:213: recipe for target 'elfdep' failed
>> > > -----------
>> > >
>> > > I bissected it down to commit 5f9380572b4bb24f60cd492b1
>> > >
>> > > Author: Jakub Kicinski <jakub.kicinski@netronome.com>
>> > > Date:   Thu May 10 10:24:39 2018 -0700
>> > >
>> > >     samples: bpf: compile and link against full libbpf
>> > > ---------
>> > >
>> > > Checking out a kernel before this commit makes the samples build. Also I do
>> > > have libelf on my system.
>> > >
>> > > Any thoughts on this issue?
>>
>> There is some weirdness going on with my kernel tree. If I do a fresh clone
>> of v4.18 and build samples, everything works.
>>
>> However if I take my existing checkout, do a:
>> git clean -f -d
>> make mrproper
>>
>> Then I try to build the samples, I get the "No libelf found".
>>
>> Obviously the existing checked out kernel tree is in some weird state that I
>> am not yet able to fix. But atleast if I blow the whole tree and clone again,
>> I'm able to build...
>>
>> Is this related to the intermittent "No libelf found" issues that were
>> recently discussed?
>
> Can't reproduce, could you provide all exact commands you run to see
> this, including the initial clone?

Not sure if you saw that I replied to my own email. As I was saying,
doing a fresh clone and build of the kernel tree makes things work for
me. The problematic kernel tree which I cloned many months ago was the
one I was using when I reported the issue.

On the problematic tree, the steps I did to reproduce issue were:
git clean -f -d
make mrproper
make x86_64_defconfig
cd samples/bpf/
make

I have since moved onto using the fresh cloned tree since that's working for me.

I will let you know if I run into this again. Thanks for your time!

 - Joel

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Marc Haber @ 2018-08-16 20:35 UTC (permalink / raw)
  To: Peter Robinson
  Cc: linux-arm-kernel, netdev, labbott, Eric Dumazet, Daniel Borkmann
In-Reply-To: <CALeDE9OOrZUnaNpzkYPU30iN=4HFQaqEomjf14EO5EtcnHu8OQ@mail.gmail.com>

On Mon, Jun 25, 2018 at 05:41:27PM +0100, Peter Robinson wrote:
> So with that and the other fix there was no improvement, with those
> and the BPF JIT disabled it works, I'm not sure if the two patches
> have any effect with the JIT disabled though.

I can confirm the crash with the released 4.18.1 on Banana Pi, and I can
also confirm that disabling BPF JIT makes the Banana Pi work again.,

Greetings
Marc

[    0.004930] /cpus/cpu@0 missing clock-frequency property
[    0.004965] /cpus/cpu@1 missing clock-frequency property
[    4.959858] zswap: default zpool zbud not available
[    4.964820] zswap: pool creation failed
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
[   10.721077] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[   10.722949] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[   10.729288] pgd = (ptrval)
[   10.729299] [0000000c] *pgd=6dc65003, *pmd=00000000
[   10.737464] pgd = (ptrval)
[   10.740176] Internal error: Oops: a06 [#1] SMP ARM
[   10.745056] [0000000c] *pgd=6e72a003
[   10.747742] Modules linked in: ip_tables x_tables autofs4 btrfs
[   10.752561] , *pmd=00000000
[   10.756113]  libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash
[   10.764833]  zlib_deflate raid6_pq dm_mod dax axp20x_regulator realtek ahci_sunxi dwmac_sunxi stmmac_platform libahci_platform stmmac i2c_mv64xxx libahci libata scsi_mod ohci_platform ohci_hcd ehci_platform ehci_hcd phy_sun4i_usb sunxi_mmc
[   10.793306] CPU: 1 PID: 238 Comm: systemd-udevd Not tainted 4.18.1-zgbpi-armmp-lpae #3
[   10.801212] Hardware name: Allwinner sun7i (A20) Family
[   10.806448] PC is at sk_filter_trim_cap+0xa0/0x1d4
[   10.811238] LR is at   (null)
[   10.814205] pc : [<c06de388>]    lr : [<00000000>]    psr: 600f0013
[   10.820466] sp : edc7dcf8  ip : 00000000  fp : edc7dd34
[   10.825686] r10: 00000000  r9 : 00000000  r8 : 00000000
[   10.830907] r7 : 00000001  r6 : f0e96000  r5 : c0e04cc8  r4 : 00000000
[   10.837428] r3 : 00000007  r2 : fb5e2d70  r1 : 00000000  r0 : 00000000
[   10.843952] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   10.851081] Control: 30c5387d  Table: 6e6c7580  DAC: 2c983336
[   10.856822] Process systemd-udevd (pid: 238, stack limit = 0x(ptrval))
[   10.863344] Stack: (0xedc7dcf8 to 0xedc7e000)
[   10.867700] dce0:                                                       edc7dd1c edc7dd08
[   10.875873] dd00: c06a41dc c06a4048 ee7d39c0 fb5e2d70 ee479800 ee6c2400 edc33840 c0e6aac0
[   10.884046] dd20: 00000000 00000001 edc7dd8c edc7dd38 c0705884 c06de2f4 edc7de24 00000001
[   10.892219] dd40: c0ec649c ee479864 00000000 00000000 ee7d39c0 00000000 00000000 00000002
[   10.900391] dd60: 00000000 edc7df44 c0e04cc8 ee7d39c0 ee6c2400 00000000 0000008c 00000002
[   10.908565] dd80: edc7ddf4 edc7dd90 c0705ee0 c0705610 006000c0 00000000 00000000 fb5e2d70
[   10.916737] dda0: 00000008 00000000 00000000 ef357c80 00000000 000000ee 00000000 00000000
[   10.924910] ddc0: 00000000 fb5e2d70 0000008c edc7df44 eef08700 00000040 00000000 eef08700
[   10.933083] dde0: 00000000 edc7dedc edc7de0c edc7ddf8 c069b948 c0705b78 edc7df44 c0e04cc8
[   10.941256] de00: edc7df2c edc7de10 c069c2f8 c069b910 c0e04cc8 edc7dec0 00000000 be8dcfac
[   10.949428] de20: 00000028 0186a660 00000064 bf387954 edc7df48 be8dcf80 00000000 00000000
[   10.957602] de40: be8dcf80 b6f19ce8 00000128 40000028 b6e01346 00000000 0000000e 00000010
[   10.965774] de60: 00000000 00000002 00000000 00000000 00000000 00000000 be8dcf80 00000000
[   10.973948] de80: b6f19ce8 00000000 00000000 fb5e2d70 edc7deb4 ffffe000 00000000 c0e04cc8
[   10.982120] dea0: 00000128 c0201204 00000000 00000080 edc7df6c edc7dec0 c02f5e2c c02f5c18
[   10.990293] dec0: 00000000 fb5e2d70 edc7def4 a0010013 c9f1e000 c03f986c edc7df50 00000000
[   10.998466] dee0: 0000000e 00004000 edc7df3c fb5e2d70 c0409c98 c0409d34 edc7df14 fb5e2d70
[   11.006639] df00: c0409d34 c0e04cc8 be8dcf80 00000000 eef08700 c0201204 edc7c000 00000128
[   11.014812] df20: edc7df94 edc7df30 c069d818 c069c0a0 00000000 00000000 c0e04cc8 00000000
[   11.022984] df40: fffffff7 edc7de5c 0000000c 00000001 00000000 00000000 edc7de2c 00000000
[   11.031156] df60: edc7df7c 00000000 00000000 00000040 00000000 fb5e2d70 be8dcf80 b6f19ce8
[   11.039329] df80: 01878670 00000128 edc7dfa4 edc7df98 c069d870 c069d7c4 00000000 edc7dfa8
[   11.047502] dfa0: c02011cc c069d860 be8dcf80 b6f19ce8 0000000e be8dcf80 00000000 00000000
[   11.055675] dfc0: be8dcf80 b6f19ce8 01878670 00000128 00000000 00000064 01878e80 00000000
[   11.063848] dfe0: 00000128 be8dcf50 b6e003e3 b6e01346 200f0030 0000000e 00000000 00000000
[   11.072038] [<c06de388>] (sk_filter_trim_cap) from [<c0705884>] (netlink_broadcast_filtered+0x280/0x460)
[   11.081517] [<c0705884>] (netlink_broadcast_filtered) from [<c0705ee0>] (netlink_sendmsg+0x374/0x3b0)
[   11.090734] [<c0705ee0>] (netlink_sendmsg) from [<c069b948>] (sock_sendmsg+0x44/0x54)
[   11.098567] [<c069b948>] (sock_sendmsg) from [<c069c2f8>] (___sys_sendmsg+0x264/0x278)
[   11.106485] [<c069c2f8>] (___sys_sendmsg) from [<c069d818>] (__sys_sendmsg+0x60/0x9c)
[   11.114315] [<c069d818>] (__sys_sendmsg) from [<c069d870>] (sys_sendmsg+0x1c/0x20)
[   11.121886] [<c069d870>] (sys_sendmsg) from [<c02011cc>] (__sys_trace_return+0x0/0x10)
[   11.129793] Exception stack(0xedc7dfa8 to 0xedc7dff0)
[   11.134845] dfa0:                   be8dcf80 b6f19ce8 0000000e be8dcf80 00000000 00000000
[   11.143019] dfc0: be8dcf80 b6f19ce8 01878670 00000128 00000000 00000064 01878e80 00000000
[   11.151188] dfe0: 00000128 be8dcf50 b6e003e3 b6e01346
[   11.156243] Code: e3130010 e1a0c000 1a000030 e35c0000 (e584900c) 
[   11.162340] Internal error: Oops: a06 [#2] SMP ARM
[   11.162559] ---[ end trace 1b60255ae59ac006 ]---
[   11.167129] Modules linked in: ip_tables x_tables autofs4 btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq dm_mod dax axp20x_regulator realtek ahci_sunxi dwmac_sunxi stmmac_platform libahci_platform stmmac i2c_mv64xxx libahci libata scsi_mod ohci_platform ohci_hcd ehci_platform ehci_hcd phy_sun4i_usb sunxi_mmc
[   11.185005] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[   11.203186] CPU: 0 PID: 237 Comm: systemd-udevd Tainted: G      D           4.18.1-zgbpi-armmp-lpae #3
[   11.203191] Hardware name: Allwinner sun7i (A20) Family
[   11.203216] PC is at sk_filter_trim_cap+0xa0/0x1d4
[   11.203223] LR is at   (null)
[   11.203229] pc : [<c06de388>]    lr : [<00000000>]    psr: 600f0013
[   11.203234] sp : edc41cf8  ip : 00000000  fp : edc41d34
[   11.203239] r10: 00000000  r9 : 00000000  r8 : 00000000
[   11.203245] r7 : 00000001  r6 : f0e96000  r5 : c0e04cc8  r4 : 00000000
[   11.203250] r3 : 00000007  r2 : fb5e2d70  r1 : 00000000  r0 : 00000000
[   11.203258] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   11.203264] Control: 30c5387d  Table: 6e6c84c0  DAC: fffffffd
[   11.203270] Process systemd-udevd (pid: 237, stack limit = 0x(ptrval))
[   11.203276] Stack: (0xedc41cf8 to 0xedc42000)
[   11.203288] 1ce0:                                                       edc41d1c edc41d08
[   11.211398] pgd = (ptrval)
[   11.220660] 1d00: c06a41dc c06a4048 c9c16cc0 fb5e2d70 ee479800 ee6c6400 c9c16240 c0e6aac0
[   11.220670] 1d20: 00000000 00000001 edc41d8c edc41d38 c0705884 c06de2f4 edc41e24 00000001
[   11.220680] 1d40: c0ec649c ee479864 00000000 00000000 c9c16cc0 00000000 00000000 00000002
[   11.220693] 1d60: 00000000 edc41f44 c0e04cc8 c9c16cc0 ee6c6400 00000000 00000085 00000002
[   11.226034] [0000000c] *pgd=6dc79003
[   11.230697] 1d80: edc41df4 edc41d90 c0705ee0 c0705610 006000c0 00000000 00000000 fb5e2d70
[   11.230707] 1da0: 00000008 00000000 00000000 ef357300 00000000 000000ed 00000000 00000000
[   11.230717] 1dc0: 00000000 fb5e2d70 00000085 edc41f44 ee0591c0 00000040 00000000 ee0591c0
[   11.230730] 1de0: 00000000 edc41edc edc41e0c edc41df8 c069b948 c0705b78 edc41f44 c0e04cc8
[   11.233692] , *pmd=00000000
[   11.239953] 1e00: edc41f2c edc41e10 c069c2f8 c069b910 c0e04cc8 edc41ec0 00000000 be8dcfac
[   11.239963] 1e20: 00000028 0186a660 0000005d bf387954 edc41f48 be8dcf80 00000000 00000000
[   11.239973] 1e40: be8dcf80 b6f19ce8 00000128 40000028 b6e01346 00000000 0000000d 00000010
[   11.239982] 1e60: 00000000 00000002 00000000 00000000 00000000 00000000 be8dcf80 00000000
[   11.239992] 1e80: b6f19ce8 00000000 00000000 fb5e2d70 edc41eb4 ffffe000 00000000 c0e04cc8
[   11.240002] 1ea0: 00000128 c0201204 00000000 00000080 edc41f6c edc41ec0 c02f5e2c c02f5c18
[   11.250433] 1ec0: 00000000 fb5e2d70 edc41ef4 a0010013 c9def000 c03f986c edc41f50 00000000
[   11.250443] 1ee0: 0000000d 00004000 edc41f3c fb5e2d70 c0409c98 c0409d34 edc41f14 fb5e2d70
[   11.250454] 1f00: c0409d34 c0e04cc8 be8dcf80 00000000 ee0591c0 c0201204 edc40000 00000128
[   11.250463] 1f20: edc41f94 edc41f30 c069d818 c069c0a0 00000000 00000000 c0e04cc8 00000000
[   11.451342] 1f40: fffffff7 edc41e5c 0000000c 00000001 00000000 00000000 edc41e2c 00000000
[   11.459515] 1f60: edc41f7c 00000000 00000000 00000040 00000000 fb5e2d70 be8dcf80 b6f19ce8
[   11.467688] 1f80: 0186d740 00000128 edc41fa4 edc41f98 c069d870 c069d7c4 00000000 edc41fa8
[   11.475861] 1fa0: c02011cc c069d860 be8dcf80 b6f19ce8 0000000d be8dcf80 00000000 00000000
[   11.484034] 1fc0: be8dcf80 b6f19ce8 0186d740 00000128 00000000 0000005d 018776c0 00000000
[   11.492207] 1fe0: 00000128 be8dcf50 b6e003e3 b6e01346 200f0030 0000000d 00000000 00000000
[   11.500397] [<c06de388>] (sk_filter_trim_cap) from [<c0705884>] (netlink_broadcast_filtered+0x280/0x460)
[   11.509876] [<c0705884>] (netlink_broadcast_filtered) from [<c0705ee0>] (netlink_sendmsg+0x374/0x3b0)
[   11.519093] [<c0705ee0>] (netlink_sendmsg) from [<c069b948>] (sock_sendmsg+0x44/0x54)
[   11.526925] [<c069b948>] (sock_sendmsg) from [<c069c2f8>] (___sys_sendmsg+0x264/0x278)
[   11.534842] [<c069c2f8>] (___sys_sendmsg) from [<c069d818>] (__sys_sendmsg+0x60/0x9c)
[   11.542673] [<c069d818>] (__sys_sendmsg) from [<c069d870>] (sys_sendmsg+0x1c/0x20)
[   11.550244] [<c069d870>] (sys_sendmsg) from [<c02011cc>] (__sys_trace_return+0x0/0x10)
[   11.558151] Exception stack(0xedc41fa8 to 0xedc41ff0)
[   11.563202] 1fa0:                   be8dcf80 b6f19ce8 0000000d be8dcf80 00000000 00000000
[   11.571375] 1fc0: be8dcf80 b6f19ce8 0186d740 00000128 00000000 0000005d 018776c0 00000000
[   11.579544] 1fe0: 00000128 be8dcf50 b6e003e3 b6e01346
[   11.584600] Code: e3130010 e1a0c000 1a000030 e35c0000 (e584900c) 
[   11.590702] Internal error: Oops: a06 [#3] SMP ARM
[   11.590859] ---[ end trace 1b60255ae59ac007 ]---
[   11.595493] Modules linked in: ip_tables x_tables autofs4 btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq dm_mod dax axp20x_regulator realtek ahci_sunxi dwmac_sunxi stmmac_platform libahci_platform stmmac i2c_mv64xxx libahci libata scsi_mod ohci_platform ohci_hcd ehci_platform ehci_hcd phy_sun4i_usb sunxi_mmc
[   11.602116] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[   11.631550] CPU: 1 PID: 240 Comm: systemd-udevd Tainted: G      D           4.18.1-zgbpi-armmp-lpae #3
[   11.631555] Hardware name: Allwinner sun7i (A20) Family
[   11.631576] PC is at sk_filter_trim_cap+0xa0/0x1d4
[   11.631582] LR is at   (null)
[   11.631593] pc : [<c06de388>]    lr : [<00000000>]    psr: 600f0013
[   11.639693] pgd = (ptrval)
[   11.648959] sp : edc81cf8  ip : 00000000  fp : edc81d34
[   11.648964] r10: 00000000  r9 : 00000000  r8 : 00000000
[   11.648970] r7 : 00000001  r6 : f0e96000  r5 : c0e04cc8  r4 : 00000000
[   11.648976] r3 : 00000007  r2 : fb5e2d70  r1 : 00000000  r0 : 00000000
[   11.648983] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   11.648990] Control: 30c5387d  Table: 6e71a180  DAC: 2c983336
[   11.654224] [0000000c] *pgd=6dc6e003
[   11.658989] Process systemd-udevd (pid: 240, stack limit = 0x(ptrval))
[   11.658995] Stack: (0xedc81cf8 to 0xedc82000)
[   11.659002] 1ce0:                                                       edc81d1c edc81d08
[   11.659013] 1d00: c06a41dc c06a4048 ee7d36c0 fb5e2d70 ee479800 edc77800 ee7d3d80 c0e6aac0
[   11.661987] , *pmd=00000000
[   11.668231] 1d20: 00000000 00000001 edc81d8c edc81d38 c0705884 c06de2f4 edc81e24 00000001
[   11.668241] 1d40: c0ec649c ee479864 00000000 00000000 ee7d36c0 00000000 00000000 00000002
[   11.668251] 1d60: 00000000 edc81f44 c0e04cc8 ee7d36c0 edc77800 00000000 0000008a 00000002
[   11.676168] 1d80: edc81df4 edc81d90 c0705ee0 c0705610 006000c0 00000000 00000000 fb5e2d70
[   11.676178] 1da0: 00000008 00000000 00000000 ef0ea980 00000000 000000f0 00000000 00000000
[   11.676188] 1dc0: 00000000 fb5e2d70 0000008a edc81f44 ee059a80 00000040 00000000 ee059a80
[   11.789803] 1de0: 00000000 edc81edc edc81e0c edc81df8 c069b948 c0705b78 edc81f44 c0e04cc8
[   11.797977] 1e00: edc81f2c edc81e10 c069c2f8 c069b910 c0e04cc8 edc81ec0 00000000 be8dcfac
[   11.806149] 1e20: 00000028 0186ade8 00000062 bf387954 edc81f48 be8dcf80 00000000 00000000
[   11.814322] 1e40: be8dcf80 b6f19ce8 00000128 40000028 b6e01346 00000000 0000000e 00000010
[   11.822494] 1e60: 00000000 00000002 00000000 00000000 00000000 00000000 be8dcf80 00000000
[   11.830667] 1e80: b6f19ce8 00000000 00000000 fb5e2d70 edc81eb4 ffffe000 00000000 c0e04cc8
[   11.838840] 1ea0: 00000128 c0201204 00000000 00000080 edc81f6c edc81ec0 c02f5e2c c02f5c18
[   11.847013] 1ec0: 00000000 fb5e2d70 edc81ef4 a00b0013 ef3c3000 c03f986c edc81f50 00000000
[   11.855186] 1ee0: 0000000e 00004000 edc81f3c fb5e2d70 c0409c98 c0409d34 edc81f14 fb5e2d70
[   11.863359] 1f00: c0409d34 c0e04cc8 be8dcf80 00000000 ee059a80 c0201204 edc80000 00000128
[   11.871532] 1f20: edc81f94 edc81f30 c069d818 c069c0a0 00000000 00000000 c0e04cc8 00000000
[   11.879705] 1f40: fffffff7 edc81e5c 0000000c 00000001 00000000 00000000 edc81e2c 00000000
[   11.887877] 1f60: edc81f7c 00000000 00000000 00000040 00000000 fb5e2d70 be8dcf80 b6f19ce8
[   11.896051] 1f80: 0186aea0 00000128 edc81fa4 edc81f98 c069d870 c069d7c4 00000000 edc81fa8
[   11.904223] 1fa0: c02011cc c069d860 be8dcf80 b6f19ce8 0000000e be8dcf80 00000000 00000000
[   11.912397] 1fc0: be8dcf80 b6f19ce8 0186aea0 00000128 00000000 00000062 0186b6e8 00000000
[   11.920569] 1fe0: 00000128 be8dcf50 b6e003e3 b6e01346 200f0030 0000000e 00000000 00000000
[   11.928757] [<c06de388>] (sk_filter_trim_cap) from [<c0705884>] (netlink_broadcast_filtered+0x280/0x460)
[   11.938235] [<c0705884>] (netlink_broadcast_filtered) from [<c0705ee0>] (netlink_sendmsg+0x374/0x3b0)
[   11.947452] [<c0705ee0>] (netlink_sendmsg) from [<c069b948>] (sock_sendmsg+0x44/0x54)
[   11.955284] [<c069b948>] (sock_sendmsg) from [<c069c2f8>] (___sys_sendmsg+0x264/0x278)
[   11.963201] [<c069c2f8>] (___sys_sendmsg) from [<c069d818>] (__sys_sendmsg+0x60/0x9c)
[   11.971031] [<c069d818>] (__sys_sendmsg) from [<c069d870>] (sys_sendmsg+0x1c/0x20)
[   11.978602] [<c069d870>] (sys_sendmsg) from [<c02011cc>] (__sys_trace_return+0x0/0x10)
[   11.986509] Exception stack(0xedc81fa8 to 0xedc81ff0)
[   11.991560] 1fa0:                   be8dcf80 b6f19ce8 0000000e be8dcf80 00000000 00000000
[   11.999732] 1fc0: be8dcf80 b6f19ce8 0186aea0 00000128 00000000 00000062 0186b6e8 00000000
[   12.007902] 1fe0: 00000128 be8dcf50 b6e003e3 b6e01346
[   12.012957] Code: e3130010 e1a0c000 1a000030 e35c0000 (e584900c) 
[   12.019056] Internal error: Oops: a06 [#4] SMP ARM
[   12.019171] ---[ end trace 1b60255ae59ac008 ]---


-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply

* Re: [iproute PATCH 00/10] Review help texts and man pages
From: Stephen Hemminger @ 2018-08-16 17:29 UTC (permalink / raw)
  To: Phil Sutter; +Cc: netdev
In-Reply-To: <20180816102802.31782-1-phil@nwl.cc>

On Thu, 16 Aug 2018 12:27:52 +0200
Phil Sutter <phil@nwl.cc> wrote:

> This series fixes a number of issues identified by an automated scan
> over man pages and help texts.
> 
> Phil Sutter (10):
>   man: bridge.8: Document -oneline option
>   bridge: trivial: Make help text consistent
>   devlink: trivial: Make help text consistent
>   man: devlink.8: Document -verbose option
>   genl: Fix help text
>   man: ifstat.8: Document --json and --pretty options
>   ip: Add missing -M flag to help text
>   man: rtacct.8: Fix nstat options
>   rtmon: List options in help text
>   man: ss.8: Describe --events option
> 
>  bridge/bridge.c    |  2 +-
>  devlink/devlink.c  |  2 +-
>  genl/genl.c        |  4 ++--
>  ip/ip.c            |  2 +-
>  ip/rtmon.c         |  4 +++-
>  man/man8/bridge.8  | 15 ++++++++++++++-
>  man/man8/devlink.8 |  4 ++++
>  man/man8/ifstat.8  |  8 ++++++++
>  man/man8/rtacct.8  | 14 +++++++++-----
>  man/man8/ss.8      |  3 +++
>  10 files changed, 46 insertions(+), 12 deletions(-)
> 

Sure, applied. Had to do one bit of whitespace cleanup on genl.c

^ permalink raw reply

* Under what conditions is phy_device "adjust_link()" called?
From: rpjday @ 2018-08-16 17:26 UTC (permalink / raw)
  To: netdev

I can see from the documentation that the callback adjust_link() is invoked
"for the enet controller to respond to changes in the link state." Is there
a specific list of the events that would generate such a change? Are we
talking initially opening the device, ifup/ifdown, physically unplugging
from the port, some or all of the above?

Not a network expert (yet), so I'm still digging through the code. Thanks.

rday

^ permalink raw reply

* ANNOUNCE: pahole v1.12 (BTF edition)
From: Arnaldo Carvalho de Melo @ 2018-08-16 20:09 UTC (permalink / raw)
  To: dwarves
  Cc: Alexei Starovoitov, Martin KaFai Lau, Daniel Borkmann,
	kernel-team, Wang Nan, Jiri Olsa, Yonghong Song, Martin Cermak,
	Jan Engelhardt, Thomas Girard, Domenico Andreoli,
	Matthias Schwarzott, David Seifert, Pavel Borzenkov,
	Steven Rostedt, Matthew Wilcox, Namhyung Kim, Ingo Molnar,
	Borislav Petkov, Eric Blake, Eduardo Habkost, Thiago Macieira

	After a long time without announces, here is pahole 1.12,
available at:

	https://fedorapeople.org/~acme/dwarves/dwarves-1.12.tar.bz2

	git://git.kernel.org/pub/scm/devel/pahole/pahole.git	

	Some distros haven't picked 1.11, that comes with several
goodies, my bad for not having announced it at that time more widely,
the most interesting changes are listed at the end of this message,

	Please report any problems to me, I'll try and get problems
fixed and implement any nice suggestion you guys may have, time
permitting 8-)

	Thanks a lot to all that reported problems and provided
suggestions over the years, that is really appreciated and is what makes
these tools to remain useful,

	Now lets try to get the packages in the distros updated...

Regards,

- Arnaldo

The changes in 1.12 are the following:

- Add a BTF encoder (Martin KaFai Lau)

	BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map.  Hence, it basically focus on the C
programming language which the modern BPF is primary using.  The first
use case is to provide a generic pretty print capability for a BPF map.

	BTF has its root from CTF (Compact C-Type format).

- Add Documentation on how to use the BTF encoder: (Arnaldo Carvalho de Melo)

	Using the Linux 'perf' tools integration with BPF/llvm/clang to
show how to generate an object file that then gets its DWARF info used
to create a .BTF ELF section with this new BTF format. That augmented
eBPF ELF object file is then loaded while 'perf ftrace -g *bpf*' is used
to show the kernel BTF validation process.

- Initial support for DW_TAG_partial_unit (Arnaldo Carvalho de Melo)

	Just by treating these sections as DW_TAG_compile_unit, which is
enough for the structs that don't contain cross-section type references
to be correctly loaded and pretty-printed with pahole.

	This doesn't affect the kernel or modules, where such DWARF
compression techniques are not used so far. (Arnaldo Carvalho de Melo)

- Print cacheline boundaries in multiple union members, (Arnaldo Carvalho de Melo)

	We were showing it just on the first inner union member members,
as if it was a struct, now we restart the cacheline boundaries when
moving to print the next inner struct.

	As an example, look at 'struct audit_context' where the only
cacheline boundary printed for the following unnamed union was the first
one, for the 'socketcall' struct member, now that cacheline boundary
appears in each of the union member inner structs:

    struct audit_context {
    <SNIP>
            union {
                    struct {
                            int        nargs;                /*   824     4 */

                            /* XXX 4 bytes hole, try to pack */

                            /* --- cacheline 13 boundary (832 bytes) --- */
                            long int   args[6];              /*   832    48 */
                    } socketcall;                            /*   824    56 */
                    struct {
                            kuid_t     uid;                  /*   824     4 */
                            kgid_t     gid;                  /*   828     4 */
                            /* --- cacheline 13 boundary (832 bytes) --- */
                            umode_t    mode;                 /*   832     2 */

                            /* XXX 2 bytes hole, try to pack */

                            u32        osid;                 /*   836     4 */
                            int        has_perm;             /*   840     4 */
                            uid_t      perm_uid;             /*   844     4 */
                            gid_t      perm_gid;             /*   848     4 */
                            umode_t    perm_mode;            /*   852     2 */

                            /* XXX 2 bytes hole, try to pack */

                            long unsigned int qbytes;        /*   856     8 */
                    } ipc;                                   /*   824    40 */
                    struct {
                            mqd_t      mqdes;                /*   824     4 */

                            /* XXX 4 bytes hole, try to pack */

                            /* --- cacheline 13 boundary (832 bytes) --- */
                            struct mq_attr mqstat;           /*   832    64 */
                    } mq_getsetattr;                         /*   824    72 */
                    struct {
                            mqd_t      mqdes;                /*   824     4 */
                            int        sigev_signo;          /*   828     4 */
                    } mq_notify;                             /*   824     8 */
                    struct {
                            mqd_t      mqdes;                /*   824     4 */

                            /* XXX 4 bytes hole, try to pack */

                            /* --- cacheline 13 boundary (832 bytes) --- */
                            size_t     msg_len;              /*   832     8 */
                            unsigned int msg_prio;           /*   840     4 */

                            /* XXX 4 bytes hole, try to pack */

                            struct timespec64 abs_timeout;   /*   848    16 */
                    } mq_sendrecv;                           /*   824    40 */
                    struct {
                            int        oflag;                /*   824     4 */
                            umode_t    mode;                 /*   828     2 */

                            /* XXX 2 bytes hole, try to pack */

                            /* --- cacheline 13 boundary (832 bytes) --- */
                            struct mq_attr attr;             /*   832    64 */
                    } mq_open;                               /*   824    72 */
                    struct {
                            pid_t      pid;                  /*   824     4 */
                            struct audit_cap_data cap;       /*   828    32 */
                    } capset;                                /*   824    36 */
                    struct {
                            int        fd;                   /*   824     4 */
                            int        flags;                /*   828     4 */
                    } mmap;                                  /*   824     8 */
                    struct {
                            int        argc;                 /*   824     4 */
                    } execve;                                /*   824     4 */
                    struct {
                            char *     name;                 /*   824     8 */
                    } module;                                /*   824     8 */
            };                                               /*   824    72 */
            /* --- cacheline 14 boundary (896 bytes) --- */
            int                        fds[2];               /*   896     8 */
            struct audit_proctitle     proctitle;            /*   904    16 */

            /* size: 920, cachelines: 15, members: 46 */
            /* sum members: 912, holes: 2, sum holes: 8 */
            /* last cacheline: 24 bytes */
    };

- Show where a struct was used, e.g.

      $ pahole -I vmlinux
    <SNIP>
      /* Used at: /home/acme/git/perf/init/main.c */
      /* <1f4a5> /home/acme/git/perf/arch/x86/include/asm/orc_types.h:85 */
      struct orc_entry {
              s16                        sp_offset;            /*     0     2 */
              s16                        bp_offset;            /*     2     2 */
     <SNIP>

- Show offsets at union members (Arnaldo Carvalho de Melo, suggested by Matthew Wilcox):

	In complex structs with multiple complex unions figuring out the
offset for a given union member is difficult, as one needs to figure out
the union, go to the end of it to see the offset.

    This way, for instance, the Linux kernel's 'struct page' shows now as:

    struct page {
            long unsigned int          flags;                /*     0     8 */
            union {
                    struct address_space * mapping;          /*     8     8 */
                    void *             s_mem;                /*     8     8 */
                    atomic_t           compound_mapcount;    /*     8     4 */
            };                                               /*     8     8 */
            union {
                    long unsigned int  index;                /*    16     8 */
                    void *             freelist;             /*    16     8 */
            };                                               /*    16     8 */
            union {
                    long unsigned int  counters;             /*    24     8 */
                    struct {
                            union {
                                    atomic_t _mapcount;      /*    24     4 */
                                    unsigned int active;     /*    24     4 */
                                    struct {
                                            unsigned int inuse:16; /*    24:16  4 */
                                            unsigned int objects:15; /*    24: 1  4 */
                                            unsigned int frozen:1; /*    24: 0  4 */
                                    };                       /*    24     4 */
                                    int units;               /*    24     4 */
                            };                               /*    24     4 */
                            atomic_t   _refcount;            /*    28     4 */
                    };                                       /*    24     8 */
            };                                               /*    24     8 */
            union {
                    struct list_head   lru;                  /*    32    16 */
                    struct dev_pagemap * pgmap;              /*    32     8 */
                    struct {
                            struct page * next;              /*    32     8 */
                            int        pages;                /*    40     4 */
                            int        pobjects;             /*    44     4 */
                    };                                       /*    32    16 */
                    struct callback_head callback_head;      /*    32    16 */
                    struct {
                            long unsigned int compound_head; /*    32     8 */
                            unsigned int compound_dtor;      /*    40     4 */
                            unsigned int compound_order;     /*    44     4 */
                    };                                       /*    32    16 */
                    struct {
                            long unsigned int __pad;         /*    32     8 */
                            pgtable_t  pmd_huge_pte;         /*    40     8 */
                    };                                       /*    32    16 */
            };                                               /*    32    16 */
            union {
                    long unsigned int  private;              /*    48     8 */
                    spinlock_t         ptl;                  /*    48     4 */
                    struct kmem_cache * slab_cache;          /*    48     8 */
            };                                               /*    48     8 */
            struct mem_cgroup *        mem_cgroup;           /*    56     8 */

            /* size: 64, cachelines: 1, members: 7 */
    };

- Search and use running kernel vmlinux when no file is passed (Arnaldo Carvalho de Melo)

	Now it is possible to use it just as:

    $ pahole -C sk_buff_head
    struct sk_buff_head {
            struct sk_buff *           next;                 /*     0     8 */
            struct sk_buff *           prev;                 /*     8     8 */
            __u32                      qlen;                 /*    16     4 */
            spinlock_t                 lock;                 /*    20     4 */

            /* size: 24, cachelines: 1, members: 4 */
            /* last cacheline: 24 bytes */
    };
    $

	This will look at /sys/kernel/notes, find the running kernel
build-id, and then search the usual locations (vmlinux,
/lib/modules/`uname -r`/build/vmlinux, the debuginfo package paths, etc)
to find the matching vmlinux with the DWARF info to use. Build-ids are
now ubiquitous, so this shortens a the most common binary used.

- Document 'pahole --hex' in the man page (Arnaldo Carvalho de Melo)

	This option shows offsets and sizes in hexadecimal, helping to
correlate with reports using that notation.

	E.g.:

    $ pahole --hex -C sk_buff_head
    struct sk_buff_head {
            struct sk_buff *           next;                 /*     0   0x8 */
            struct sk_buff *           prev;                 /*   0x8   0x8 */
            __u32                      qlen;                 /*  0x10   0x4 */
            spinlock_t                 lock;                 /*  0x14   0x4 */

            /* size: 24, cachelines: 1, members: 4 */
            /* last cacheline: 24 bytes */
    };
    $

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

-------------------------------------------------------------------------------

Notable changes for v1.11:

    dwarves_fprintf: Find holes when expanding types
    
    When --expand_types/-E is used we go on expanding internal types, and
    when doing that for structs we were not looking for holes in them, only
    on the main struct, fix it.
    
    With that we can see these extra holes in a expanded Linux kernel's
    'struct task_struct':
    
    @@ -46,6 +46,9 @@
                            struct list_head * prev;                                         /*   176     8 */
                    } group_node; /*   168    16 */
                    unsigned int       on_rq;                                                /*   184     4 */
    +
    +               /* XXX 4 bytes hole, try to pack */
    +
                    /* --- cacheline 3 boundary (192 bytes) --- */
                    /* typedef u64 */ long long unsigned int exec_start;                     /*   192     8 */
                    /* typedef u64 */ long long unsigned int sum_exec_runtime;               /*   200     8 */
    @@ -86,9 +89,15 @@
                    } statistics; /*   232   216 */
                    /* --- cacheline 7 boundary (448 bytes) --- */
                    int                depth;                                                /*   448     4 */
    +
    +               /* XXX 4 bytes hole, try to pack */
    +
                    struct sched_entity * parent;                                            /*   456     8 */
                    struct cfs_rq *    cfs_rq;                                               /*   464     8 */
                    struct cfs_rq *    my_q;                                                 /*   472     8 */
    +
    +               /* XXX 32 bytes hole, try to pack */
    +
                    /* --- cacheline 8 boundary (512 bytes) --- */
                    struct sched_avg {
                            /* typedef u64 */ long long unsigned int last_update_time;       /*   512     8 */
    @@ -153,6 +162,9 @@
                            struct hrtimer_clock_base * base;                                /*   768     8 */
                            /* typedef u8 */ unsigned char state;                            /*   776     1 */
                            /* typedef u8 */ unsigned char is_rel;                           /*   777     1 */
    +
    +                       /* XXX 2 bytes hole, try to pack */
    +
                            int        start_pid;                                            /*   780     4 */
                            void *     start_site;                                           /*   784     8 */
                            char       start_comm[16];                                       /*   792    16 */
    @@ -197,6 +209,9 @@
            } tasks; /*   912    16 */
            struct plist_node {
                    int                prio;                                                 /*   928     4 */
    +
    +               /* XXX 4 bytes hole, try to pack */
    +
                    struct list_head {
                            struct list_head * next;                                         /*   936     8 */
                            struct list_head * prev;                                         /*   944     8 */
    @@ -258,12 +273,18 @@
                                    /* typedef u32 */ unsigned int val;                      /*  1136     4 */
                                    /* typedef u32 */ unsigned int flags;                    /*  1140     4 */
                                    /* typedef u32 */ unsigned int bitset;                   /*  1144     4 */
    +
    +                               /* XXX 4 bytes hole, try to pack */
    +
                                    /* --- cacheline 18 boundary (1152 bytes) --- */
                                    /* typedef u64 */ long long unsigned int time;           /*  1152     8 */
                                    u32 * uaddr2;                                            /*  1160     8 */
                            } futex;                                                         /*          40 */
                            struct {
                                    /* typedef clockid_t -> __kernel_clockid_t */ int clockid; /*  1128     4 */
    +
    +                               /* XXX 4 bytes hole, try to pack */
    +
                                    struct timespec * rmtp;                                  /*  1136     8 */
                                    struct compat_timespec * compat_rmtp;                    /*  1144     8 */
                                    /* typedef u64 */ long long unsigned int expires;        /*  1152     8 */
    @@ -426,6 +447,9 @@
            unsigned int               sessionid;                                            /*  1804     4 */
            struct seccomp {
                    int                mode;                                                 /*  1808     4 */
    +
    +               /* XXX 4 bytes hole, try to pack */
    +
                    struct seccomp_filter * filter;                                          /*  1816     8 */
            } seccomp; /*  1808    16 */
            /* typedef u32 */ unsigned int               parent_exec_id;                     /*  1824     4 */
    @@ -602,6 +626,9 @@
                    long unsigned int  backtrace[12];                                        /*  2472    96 */
                    /* --- cacheline 40 boundary (2560 bytes) was 8 bytes ago --- */
                    unsigned int       count;                                                /*  2568     4 */
    +
    +               /* XXX 4 bytes hole, try to pack */
    +
                    long unsigned int  time;                                                 /*  2576     8 */
                    long unsigned int  max;                                                  /*  2584     8 */
            } latency_record[32]; /*  2472  3840 */
    @@ -686,12 +713,18 @@
                    long unsigned int * io_bitmap_ptr;                                       /*  6600     8 */
                    long unsigned int  iopl;                                                 /*  6608     8 */
                    unsigned int       io_bitmap_max;                                        /*  6616     4 */
    +
    +               /* XXX 36 bytes hole, try to pack */
    +
                    /* --- cacheline 104 boundary (6656 bytes) --- */
                    struct fpu {
                            unsigned int last_cpu;                                           /*  6656     4 */
                            unsigned char fpstate_active;                                    /*  6660     1 */
                            unsigned char fpregs_active;                                     /*  6661     1 */
                            unsigned char counter;                                           /*  6662     1 */
    +
    +                       /* XXX 57 bytes hole, try to pack */
    +
                            /* --- cacheline 105 boundary (6720 bytes) --- */
                            union fpregs_state {
                                    struct fregs_state {
    @@ -751,6 +784,9 @@
                                            /* typedef u8 */ unsigned char no_update;        /*  6831     1 */
                                            /* typedef u8 */ unsigned char rm;               /*  6832     1 */
                                            /* typedef u8 */ unsigned char alimit;           /*  6833     1 */
    +
    +                                       /* XXX 6 bytes hole, try to pack */
    +
                                            struct math_emu_info * info;                     /*  6840     8 */
                                            /* typedef u32 */ unsigned int entry_eip;        /*  6848     4 */
                                    } soft; /*         136 */
    
-------------------------------------------------------------------------------------------------------

    dwarves_fprintf: Find holes on structs embedded in other structs
    
    Take 'struct task_struct' in the Linux kernel, these fields:
    
            /* --- cacheline 2 boundary (128 bytes) --- */
            struct sched_entity        se;                   /*   128   448 */
    
            /* XXX last struct has 24 bytes of padding */
    
            /* --- cacheline 9 boundary (576 bytes) --- */
            struct sched_rt_entity     rt;                   /*   576    48 */
    
    The sched_entity struct has 24 bytes of padding, and that info would
    only appear when printing 'struct task_struct' if class__find_holes()
    had previously been run on 'struct sched_entity' which wasn't always the
    case, make sure that happens.
    
    This results in this extra stat being printed for 'struct task_struct':
    
            /* paddings: 4, sum paddings: 38 */

-------------------------------------------------------------------------------------------------------

    dwarves_fprintf: Fixup cacheline boundary printing on expanded structs
    
    A diff for 'pahole -EC task_struct vmlinux' should clarify what this fixes:
    
      [acme@jouet linux]$ diff -u /tmp/before.c /tmp/after.c | head -30
      --- /tmp/before.c     2016-06-29 17:00:38.082647281 -0300
      +++ /tmp/a.c  2016-06-29 17:03:36.913124779 -0300
      @@ -43,8 +43,8 @@
                            struct list_head * prev;                                         /*   176     8 */
                    } group_node; /*   168    16 */
                    unsigned int       on_rq;                                                /*   184     4 */
      +             /* --- cacheline 3 boundary (192 bytes) --- */
                    /* typedef u64 */ long long unsigned int exec_start;                     /*   192     8 */
      -             /* --- cacheline 1 boundary (64 bytes) was 4 bytes ago --- */
                    /* typedef u64 */ long long unsigned int sum_exec_runtime;               /*   200     8 */
                    /* typedef u64 */ long long unsigned int vruntime;                       /*   208     8 */
                    /* typedef u64 */ long long unsigned int prev_sum_exec_runtime;          /*   216     8 */
      @@ -53,40 +53,40 @@
                            /* typedef u64 */ long long unsigned int wait_start;             /*   232     8 */
                            /* typedef u64 */ long long unsigned int wait_max;               /*   240     8 */
                            /* typedef u64 */ long long unsigned int wait_count;             /*   248     8 */
      +                     /* --- cacheline 4 boundary (256 bytes) --- */
                            /* typedef u64 */ long long unsigned int wait_sum;               /*   256     8 */
                            /* typedef u64 */ long long unsigned int iowait_count;           /*   264     8 */
                            /* typedef u64 */ long long unsigned int iowait_sum;             /*   272     8 */
                            /* typedef u64 */ long long unsigned int sleep_start;            /*   280     8 */
                            /* typedef u64 */ long long unsigned int sleep_max;              /*   288     8 */
      -                     /* --- cacheline 1 boundary (64 bytes) --- */
                            /* typedef s64 */ long long int sum_sleep_runtime;               /*   296     8 */
                            /* typedef u64 */ long long unsigned int block_start;            /*   304     8 */
                            /* typedef u64 */ long long unsigned int block_max;              /*   312     8 */
      +                     /* --- cacheline 5 boundary (320 bytes) --- */
                            /* typedef u64 */ long long unsigned int exec_max;               /*   320     8 */
                            /* typedef u64 */ long long unsigned int slice_max;              /*   328     8 */
                            /* typedef u64 */ long long unsigned int nr_migrations_cold;     /*   336     8 */
      [acme@jouet linux]$
    
    I.e. the boundary detection was being reset at each expanded struct, do the math globally,
    using the member offset, that was already done globally and correctly.
    
    Reported-and-Tested-by: Peter Zijlstra <peterz@infradead.org>

-------------------------------------------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: Heiner Kallweit @ 2018-08-16 19:50 UTC (permalink / raw)
  To: David Miller; +Cc: jian-hong, nic_swsd, netdev, linux-kernel, linux
In-Reply-To: <20180816.123958.750435252621963789.davem@davemloft.net>

On 16.08.2018 21:39, David Miller wrote:
> From: Heiner Kallweit <hkallweit1@gmail.com>
> Date: Thu, 16 Aug 2018 21:37:31 +0200
> 
>> On 16.08.2018 21:21, David Miller wrote:
>>> From: <jian-hong@endlessm.com>
>>> Date: Wed, 15 Aug 2018 14:21:10 +0800
>>>
>>>> Found the ethernet network on ASUS X441UAR doesn't come back on resume
>>>> from suspend when using MSI-X.  The chip is RTL8106e - version 39.
>>>
>>> Heiner, please take a look at this.
>>>
>>> You recently disabled MSI-X on RTL8168g for similar reasons.
>>>
>>> Now that we've seen two chips like this, maybe there is some other
>>> problem afoot.
>>>
>> Thanks for the hint. I saw it already and just contacted Realtek
>> whether they are aware of any MSI-X issues with particular chip
>> versions. With the chip versions I have access to MSI-X works fine.
>>
>> There's also the theoretical option that the issues are caused by
>> broken BIOS's. But so far only chip versions have been reported
>> which are very similar, at least with regard to version number
>> (2x VER_40, 1x VER_39). So they may share some buggy component.
>>
>> Let's see whether Realtek can provide some hint.
>> If more chip versions are reported having problems with MSI-X,
>> then we could switch to a whitelist or disable MSI-X in general.
> 
> It could be that we need to reprogram some register(s) on resume,
> which normally might not be needed, and that is what is causing the
> problem with some chips.
> 
Indeed. That's what I'm checking with Realtek.
In the register list in the r8169 driver there's one entry which
seems to indicate that there are MSI-X specific settings.
However this register isn't used, and the r8168 vendor driver
uses only MSI. And there are no public datasheets.

^ permalink raw reply

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
From: Eric Biggers @ 2018-08-16 19:46 UTC (permalink / raw)
  To: D. J. Bernstein
  Cc: Jason A. Donenfeld, Eric Biggers, Linux Crypto Mailing List, LKML,
	Netdev, David Miller, Andrew Lutomirski, Greg Kroah-Hartman,
	Samuel Neves, Tanja Lange, Jean-Philippe Aumasson,
	Karthikeyan Bhargavan
In-Reply-To: <20180816042454.15529.qmail@cr.yp.to>

Hi Dan,

(I reordered your responses slightly to group together similar topics)

On Thu, Aug 16, 2018 at 04:24:54AM -0000, D. J. Bernstein wrote:
> Eric Biggers writes:
> > You'd probably attract more contributors if you followed established
> > open source conventions.
> 
> SUPERCOP already has thousands of implementations from hundreds of
> contributors. New speed records are more likely to appear in SUPERCOP
> than in any other cryptographic software collection. The API is shared
> by state-of-the-art benchmarks, state-of-the-art tests, three ongoing
> competitions, and increasingly popular production libraries.
[...]
> > there doesn't appear to be an official git repository for SUPERCOP,
> > nor is there any mention of how to send patches, nor is there any
> > COPYING or LICENSE file, nor even a README file.
> 
> https://bench.cr.yp.to/call-stream.html explains the API and submission
> procedure for stream ciphers. There are similar pages for other types of
> cryptographic primitives. https://bench.cr.yp.to/tips.html explains the
> develop-test cycle and various useful options.
> 
> Licenses vary across implementations. There's a minimum requirement of
> public distribution for verifiability of benchmark results, but it's up
> to individual implementors to decide what they'll allow beyond that.
> Patent status also varies; constant-time status varies; verification
> status varies; code quality varies; cryptographic security varies; etc.
> As I mentioned, SUPERCOP includes MD5 and Speck and RSA-512.

Many people may have contributed to SUPERCOP already, but that doesn't mean
there aren't things you could do to make it more appealing to contributors and
more of a community project, especially since you're suggesting it to the Linux
kernel community.  There should be a git repository with the development
history, including all documentation files so they are available even if your
website goes offline (i.e. the "bus factor" should be > 1), and there should be
a way to contribute by a conventional means that invites public review, e.g.
sending a patch to a mailing list or opening a Github pull request.

The apparent lack of a license for the SUPERCOP benchmarking framework itself is
also problematic.  Though you've marked some individual files as "Public
domain", not all files are such marked (e.g. crypto_stream/measure.c), so AFAIK
there is no explicit permission given to redistribute them.  In some people's
view, license-less projects like this aren't really free software.  So Linux
distributions may not want to take on the legal risk of distributing it, nor may
companies want to take on the risk of contributing.  It may seem silly, but some
people do take these things *very* seriously!

Fortunately, it's easy for you to fix the licensing: just add a standard license
like the MIT license as a file named COPYING in the top-level directory.

> Am I correctly gathering from this thread that someone adding a new
> implementation of a crypto primitive to the kernel has to worry about
> checking the architecture and CPU features to figure out whether the
> implementation will run? Wouldn't it make more sense to take this
> error-prone work away from the implementor and have a robust automated
> central testing mechanism, as in SUPERCOP?
> 
> Am I also correctly gathering that adding an extra implementation to the
> kernel can hurt performance, unless the implementor goes to extra effort
> to check for the CPUs where the previous implementation is faster---or
> to build some ad-hoc timing mechanism ("raid6: using algorithm avx2x4
> gen() 31737 MB/s")? Wouldn't it make more sense to take this error-prone
> work away from the implementor and have a robust automated central
> timing mechanism, as in SUPERCOP?

If you're talking about things like "don't use the AVX2 implementation if the
CPU doesn't support AVX2", then of course has to be checked, but that's
straightforward.

If (more likely) you're talking about things like "use this NEON implementation
on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
developers and community to test different CPUs and make appropriate decisions,
and yes it can be very useful to have external benchmarks like SUPERCOP to refer
to, and I appreciate your work in that area.  Note that in practice, the
priority order in Linux's crypto API has actually tended to be straightforward,
like generic -> SSE2 -> SSE3 -> AVX2, since historically people haven't cared
quite enough about crypto performance in the kernel to microoptimize it for
individual CPU microarchitectures, nor have there been many weird cases like ARM
NEON where scalar instructions are free when interleaved with vector
instructions on some CPUs but not others.  Maybe that's changing as more people
need optimal crypto performance in the kernel.

> I also didn't notice anyone disputing Jason's comment about the "general
> clunkiness" of the kernel's internal crypto API---but is there really no
> consensus as to what the replacement API is supposed to be? Someone who
> simply wants to implement some primitives has to decide on function-call
> details, argue about the software location, add configuration options,
> etc.? Wouldn't it make more sense to do this centrally, as in SUPERCOP?

Not yet; that's the purpose of code review, so that a consensus can be reached.
But if/when the new APIs land, the next contributor who wants to do things
similarly will have a much easier time.

As a side note, are you certain you've received and read all responses to this
thread?  I haven't always bothered replying to your "qsecretary notice", and I
suspect that many others do the same.  (Imagine if everyone used that, so
everyone had to reply to 10 "qsecretary notices" on threads like this!)

> And then there's the bigger question of how the community is organizing
> ongoing work on accelerating---and auditing, and fixing, and hopefully
> verifying---implementations of cryptographic primitives. Does it really
> make sense that people looking for what's already been done have to go
> poking around a bunch of separate libraries? Wouldn't it make more sense
> to have one central collection of code, as in SUPERCOP? Is there any
> fundamental obstacle to having libraries share code for primitives?

A lot of code can be shared, but in practice different environments have
different constraints, and kernel programming in particular has some distinct
differences from userspace programming.  For example, you cannot just use the
FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
most architectures they can't be used in some contexts such as hardirq context,
and even when they *can* be used you have to run special code before and after
which does things like saving all the FPU registers to the task_struct,
disabling preemption, and/or enabling the FPU.  But disabling preemption for
long periods of time hurts responsiveness, so it's also desirable to yield the
processor occasionally, which means that assembly implementations should be
incremental rather than having a single entry point that does everything.

There are also many other rules that must be followed in kernel code, like not
being free to use the %rbp register on x86 for anything other than the frame
base pointer, or having to make indirect calls in asm code through a special
macro that mitigates the Spectre vulnerability.

So yes, crypto code can sometimes be shared, but changes often do need to be
made for the kernel.

> For comparison, where can I find an explanation of how to test kernel
> crypto patches, and how fast is the develop-test cycle? Okay, I don't
> have a kernel crypto patch, but I did write a kernel patch recently that
> (I think) fixes some recent Lenovo ACPI stupidity:
> 
>    https://marc.info/?l=qubes-users&m=153308905514481
> 
> I'd propose this for review and upstream adoption _if_ it survives
> enough tests---but what's the right test procedure? I see superficial
> documentation of where to submit a patch for review, but am I really
> supposed to do this before serious testing? The patch works on my
> laptop, and several other people say it works, but obviously this is
> missing the big question of whether the patch breaks _other_ laptops.
> I see an online framework for testing, but using it looks awfully
> complicated, and the level of coverage is unclear to me. Has anyone
> tried to virtualize kernel testing---to capture hardware data from many
> machines and then centrally simulate kernels running on those machines,
> for example to check that those machines don't take certain code paths?
> I suppose that people who work with the kernel all the time would know
> what to do, but for me the lack of information was enough of a deterrent
> that I switched to doing something else.

The process for submitting Linux kernel patches is very well documented.  If
you're interested in contributing, you're welcome to read
Documentation/process/submitting-patches.rst and the other documentation.

As for testing, things are different for different parts of the kernel.  Some
parts are covered well by automated tests, others rely more on manual tests, and
others rely more on a large community running the kernel, especially -rc
(release candidate) kernels, and reporting any problems.  For the crypto API
specifically, there are correctness self-tests that are automatically run when
an option is set in the kernel .config.  Developers should always run them, but
even if they don't, other people are running them too on various hardware.

ACPI workarounds for firmware bugs are a somewhat different story...  AFAIK,
those types of patches rely on testing and review done by the developer, the
subsystem maintainters, and the broader community; as well as the community
reporting any problems they experience in -rc kernels.  Many people, such as
myself, run -rc kernels on their computer(s) and will report any problems found.
Yes, the standards of correctness for those types of things are not as high for
crypto algorithms, which is a shame; but when you're talking about hacking
around bugs in specific firmware versions in specific laptops, there's probably
not much more that can be done in practice...  So I encourage you to still send
out your ACPI patch!

Thanks,

- Eric

^ permalink raw reply

* [bpf-next RFC 3/3] selftests/bpf: test bpf flow dissection
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 49938d72cf63..e61a85ac4b79 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -19,3 +19,5 @@ test_btf
 test_sockmap
 test_lirc_mode2_user
 get_cgroup_id_user
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
 	test_tunnel.sh \
 	test_lwt_seg6local.sh \
 	test_lirc_mode2.sh \
-	test_skb_cgroup_id.sh
+	test_skb_cgroup_id.sh \
+	test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \
+	flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index 000000000000..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+#include <errno.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+	struct bpf_program *prog, *main_prog;
+	struct bpf_map *prog_array;
+	int i, fd, prog_fd, ret;
+	struct bpf_object *obj;
+	int prog_array_fd;
+
+	ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+			    &prog_fd);
+	if (ret)
+		error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+	main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+	if (!main_prog)
+		error(1, 0, "bpf_object__find_program_by_title %s",
+		      cfg_section_name);
+
+	prog_fd = bpf_program__fd(main_prog);
+	if (prog_fd < 0)
+		error(1, 0, "bpf_program__fd");
+
+	prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+	if (!prog_array)
+		error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+	prog_array_fd = bpf_map__fd(prog_array);
+	if (prog_array_fd < 0)
+		error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+	i = 0;
+	bpf_object__for_each_program(prog, obj) {
+		fd = bpf_program__fd(prog);
+		if (fd < 0)
+			error(1, 0, "bpf_program__fd");
+
+		if (fd != prog_fd) {
+			printf("%d: %s\n", i, bpf_program__title(prog, false));
+			bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+			++i;
+		}
+	}
+
+	ret = bpf_prog_attach(prog_fd, 0 /* Ignore */, BPF_FLOW_DISSECTOR, 0);
+	if (ret)
+		error(1, 0, "bpf_prog_attach %s", cfg_path_name);
+
+	ret = bpf_object__pin(obj, cfg_pin_path);
+	if (ret)
+		error(1, 0, "bpf_object__pin %s", cfg_pin_path);
+
+}
+
+static void detach_program(void)
+{
+	char command[64];
+	int ret;
+
+	ret = bpf_prog_detach(0, BPF_FLOW_DISSECTOR);
+	if (ret)
+		error(1, 0, "bpf_prog_detach");
+
+	/* To unpin, it is necessary and sufficient to just remove this dir */
+	sprintf(command, "rm -r %s", cfg_pin_path);
+	ret = system(command);
+	if (ret)
+		error(1, errno, command);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	bool attach = false;
+	bool detach = false;
+	int c;
+
+	while ((c = getopt(argc, argv, "adp:s:")) != -1) {
+		switch (c) {
+		case 'a':
+			if (detach)
+				error(1, 0, "attach/detach are exclusive");
+			attach = true;
+			break;
+		case 'd':
+			if (attach)
+				error(1, 0, "attach/detach are exclusive");
+			detach = true;
+			break;
+		case 'p':
+			if (cfg_path_name)
+				error(1, 0, "only one prog name can be given");
+
+			cfg_path_name = optarg;
+			break;
+		case 's':
+			if (cfg_section_name)
+				error(1, 0, "only one section can be given");
+
+			cfg_section_name = optarg;
+			break;
+		}
+	}
+
+	if (detach)
+		cfg_attach = false;
+
+	if (cfg_attach && !cfg_path_name)
+		error(1, 0, "must provide a path to the BPF program");
+
+	if (cfg_attach && !cfg_section_name)
+		error(1, 0, "must provide a section name");
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	if (cfg_attach)
+		load_and_attach_program();
+	else
+		detach_program();
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.c b/tools/testing/selftests/bpf/test_flow_dissector.c
new file mode 100644
index 000000000000..12b784afba31
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.c
@@ -0,0 +1,782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Inject packets with all sorts of encapsulation into the kernel.
+ *
+ * IPv4/IPv6	outer layer 3
+ * GRE/GUE/BARE outer layer 4, where bare is IPIP/SIT/IPv4-in-IPv6/..
+ * IPv4/IPv6    inner layer 3
+ */
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <asm/byteorder.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <netinet/ip.h>
+#include <netinet/in.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define CFG_PORT_INNER	8000
+
+/* Add some protocol definitions that do not exist in userspace */
+
+struct grehdr {
+	uint16_t unused;
+	uint16_t protocol;
+} __attribute__((packed));
+
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+static uint8_t	cfg_dsfield_inner;
+static uint8_t	cfg_dsfield_outer;
+static uint8_t	cfg_encap_proto;
+static bool	cfg_expect_failure = false;
+static int	cfg_l3_extra = AF_UNSPEC;	/* optional SIT prefix */
+static int	cfg_l3_inner = AF_UNSPEC;
+static int	cfg_l3_outer = AF_UNSPEC;
+static int	cfg_num_pkt = 10;
+static int	cfg_num_secs = 0;
+static char	cfg_payload_char = 'a';
+static int	cfg_payload_len = 100;
+static int	cfg_port_gue = 6080;
+static bool	cfg_only_rx;
+static bool	cfg_only_tx;
+static int	cfg_src_port = 9;
+
+static char	buf[ETH_DATA_LEN];
+
+#define INIT_ADDR4(name, addr4, port)				\
+	static struct sockaddr_in name = {			\
+		.sin_family = AF_INET,				\
+		.sin_port = __constant_htons(port),		\
+		.sin_addr.s_addr = __constant_htonl(addr4),	\
+	};
+
+#define INIT_ADDR6(name, addr6, port)				\
+	static struct sockaddr_in6 name = {			\
+		.sin6_family = AF_INET6,			\
+		.sin6_port = __constant_htons(port),		\
+		.sin6_addr = addr6,				\
+	};
+
+INIT_ADDR4(in_daddr4, INADDR_LOOPBACK, CFG_PORT_INNER)
+INIT_ADDR4(in_saddr4, INADDR_LOOPBACK + 2, 0)
+INIT_ADDR4(out_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(out_saddr4, INADDR_LOOPBACK + 1, 0)
+INIT_ADDR4(extra_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(extra_saddr4, INADDR_LOOPBACK + 1, 0)
+
+INIT_ADDR6(in_daddr6, IN6ADDR_LOOPBACK_INIT, CFG_PORT_INNER)
+INIT_ADDR6(in_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+
+static unsigned long util_gettime(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void util_printaddr(const char *msg, struct sockaddr *addr)
+{
+	unsigned long off = 0;
+	char nbuf[INET6_ADDRSTRLEN];
+
+	switch (addr->sa_family) {
+	case PF_INET:
+		off = __builtin_offsetof(struct sockaddr_in, sin_addr);
+		break;
+	case PF_INET6:
+		off = __builtin_offsetof(struct sockaddr_in6, sin6_addr);
+		break;
+	default:
+		error(1, 0, "printaddr: unsupported family %u\n",
+		      addr->sa_family);
+	}
+
+	if (!inet_ntop(addr->sa_family, ((void *) addr) + off, nbuf,
+		       sizeof(nbuf)))
+		error(1, errno, "inet_ntop");
+
+	fprintf(stderr, "%s: %s\n", msg, nbuf);
+}
+
+static unsigned long add_csum_hword(const uint16_t *start, int num_u16)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_u16; i++)
+		sum += start[i];
+
+	return sum;
+}
+
+static uint16_t build_ip_csum(const uint16_t *start, int num_u16,
+			      unsigned long sum)
+{
+	sum += add_csum_hword(start, num_u16);
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	return ~sum;
+}
+
+static void build_ipv4_header(void *header, uint8_t proto,
+			      uint32_t src, uint32_t dst,
+			      int payload_len, uint8_t tos)
+{
+	struct iphdr *iph = header;
+
+	iph->ihl = 5;
+	iph->version = 4;
+	iph->tos = tos;
+	iph->ttl = 8;
+	iph->tot_len = htons(sizeof(*iph) + payload_len);
+	iph->id = htons(1337);
+	iph->protocol = proto;
+	iph->saddr = src;
+	iph->daddr = dst;
+	iph->check = build_ip_csum((void *) iph, iph->ihl << 1, 0);
+}
+
+static void ipv6_set_dsfield(struct ipv6hdr *ip6h, uint8_t dsfield)
+{
+	uint16_t val, *ptr = (uint16_t *)ip6h;
+
+	val = ntohs(*ptr);
+	val &= 0xF00F;
+	val |= ((uint16_t) dsfield) << 4;
+	*ptr = htons(val);
+}
+
+static void build_ipv6_header(void *header, uint8_t proto,
+			      struct sockaddr_in6 *src,
+			      struct sockaddr_in6 *dst,
+			      int payload_len, uint8_t dsfield)
+{
+	struct ipv6hdr *ip6h = header;
+
+	ip6h->version = 6;
+	ip6h->payload_len = htons(payload_len);
+	ip6h->nexthdr = proto;
+	ip6h->hop_limit = 8;
+	ipv6_set_dsfield(ip6h, dsfield);
+
+	memcpy(&ip6h->saddr, &src->sin6_addr, sizeof(ip6h->saddr));
+	memcpy(&ip6h->daddr, &dst->sin6_addr, sizeof(ip6h->daddr));
+}
+
+static uint16_t build_udp_v4_csum(const struct iphdr *iph,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(iph->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &iph->saddr, num_u16);
+	pseudo_sum += htons(IPPROTO_UDP);
+	pseudo_sum += udph->len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static uint16_t build_udp_v6_csum(const struct ipv6hdr *ip6h,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(ip6h->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &ip6h->saddr, num_u16);
+	pseudo_sum += htons(ip6h->nexthdr);
+	pseudo_sum += ip6h->payload_len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static void build_udp_header(void *header, int payload_len,
+			     uint16_t dport, int family)
+{
+	struct udphdr *udph = header;
+	int len = sizeof(*udph) + payload_len;
+
+	udph->source = htons(cfg_src_port);
+	udph->dest = htons(dport);
+	udph->len = htons(len);
+	udph->check = 0;
+	if (family == AF_INET)
+		udph->check = build_udp_v4_csum(header - sizeof(struct iphdr),
+						udph, len >> 1);
+	else
+		udph->check = build_udp_v6_csum(header - sizeof(struct ipv6hdr),
+						udph, len >> 1);
+}
+
+static void build_gue_header(void *header, uint8_t proto)
+{
+	struct guehdr *gueh = header;
+
+	gueh->proto_ctype = proto;
+}
+
+static void build_gre_header(void *header, uint16_t proto)
+{
+	struct grehdr *greh = header;
+
+	greh->protocol = htons(proto);
+}
+
+static int l3_length(int family)
+{
+	if (family == AF_INET)
+		return sizeof(struct iphdr);
+	else
+		return sizeof(struct ipv6hdr);
+}
+
+static int build_packet(void)
+{
+	int ol3_len = 0, ol4_len = 0, il3_len = 0, il4_len = 0;
+	int el3_len = 0;
+
+	if (cfg_l3_extra)
+		el3_len = l3_length(cfg_l3_extra);
+
+	/* calculate header offsets */
+	if (cfg_encap_proto) {
+		ol3_len = l3_length(cfg_l3_outer);
+
+		if (cfg_encap_proto == IPPROTO_GRE)
+			ol4_len = sizeof(struct grehdr);
+		else if (cfg_encap_proto == IPPROTO_UDP)
+			ol4_len = sizeof(struct udphdr) + sizeof(struct guehdr);
+	}
+
+	il3_len = l3_length(cfg_l3_inner);
+	il4_len = sizeof(struct udphdr);
+
+	if (el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len >=
+	    sizeof(buf))
+		error(1, 0, "packet too large\n");
+
+	/*
+	 * Fill packet from inside out, to calculate correct checksums.
+	 * But create ip before udp headers, as udp uses ip for pseudo-sum.
+	 */
+	memset(buf + el3_len + ol3_len + ol4_len + il3_len + il4_len,
+	       cfg_payload_char, cfg_payload_len);
+
+	/* add zero byte for udp csum padding */
+	buf[el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len] = 0;
+
+	switch (cfg_l3_inner) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  in_saddr4.sin_addr.s_addr,
+				  in_daddr4.sin_addr.s_addr,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  &in_saddr6, &in_daddr6,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	}
+
+	build_udp_header(buf + el3_len + ol3_len + ol4_len + il3_len,
+			 cfg_payload_len, CFG_PORT_INNER, cfg_l3_inner);
+
+	if (!cfg_encap_proto)
+		return il3_len + il4_len + cfg_payload_len;
+
+	switch (cfg_l3_outer) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len, cfg_encap_proto,
+				  out_saddr4.sin_addr.s_addr,
+				  out_daddr4.sin_addr.s_addr,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len, cfg_encap_proto,
+				  &out_saddr6, &out_daddr6,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	}
+
+	switch (cfg_encap_proto) {
+	case IPPROTO_UDP:
+		build_gue_header(buf + el3_len + ol3_len + ol4_len -
+				 sizeof(struct guehdr),
+				 cfg_l3_inner == PF_INET ? IPPROTO_IPIP
+							 : IPPROTO_IPV6);
+		build_udp_header(buf + el3_len + ol3_len,
+				 sizeof(struct guehdr) + il3_len + il4_len +
+				 cfg_payload_len,
+				 cfg_port_gue, cfg_l3_outer);
+		break;
+	case IPPROTO_GRE:
+		build_gre_header(buf + el3_len + ol3_len,
+				 cfg_l3_inner == PF_INET ? ETH_P_IP
+							 : ETH_P_IPV6);
+		break;
+	}
+
+	switch (cfg_l3_extra) {
+	case PF_INET:
+		build_ipv4_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  extra_saddr4.sin_addr.s_addr,
+				  extra_daddr4.sin_addr.s_addr,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  &extra_saddr6, &extra_daddr6,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	}
+
+	return el3_len + ol3_len + ol4_len + il3_len + il4_len +
+	       cfg_payload_len;
+}
+
+/* sender transmits encapsulated over RAW or unencap'd over UDP */
+static int setup_tx(void)
+{
+	int family, fd, ret;
+
+	if (cfg_l3_extra)
+		family = cfg_l3_extra;
+	else if (cfg_l3_outer)
+		family = cfg_l3_outer;
+	else
+		family = cfg_l3_inner;
+
+	fd = socket(family, SOCK_RAW, IPPROTO_RAW);
+	if (fd == -1)
+		error(1, errno, "socket tx");
+
+	if (cfg_l3_extra) {
+		if (cfg_l3_extra == PF_INET)
+			ret = connect(fd, (void *) &extra_daddr4,
+				      sizeof(extra_daddr4));
+		else
+			ret = connect(fd, (void *) &extra_daddr6,
+				      sizeof(extra_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else if (cfg_l3_outer) {
+		/* connect to destination if not encapsulated */
+		if (cfg_l3_outer == PF_INET)
+			ret = connect(fd, (void *) &out_daddr4,
+				      sizeof(out_daddr4));
+		else
+			ret = connect(fd, (void *) &out_daddr6,
+				      sizeof(out_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else {
+		/* otherwise using loopback */
+		if (cfg_l3_inner == PF_INET)
+			ret = connect(fd, (void *) &in_daddr4,
+				      sizeof(in_daddr4));
+		else
+			ret = connect(fd, (void *) &in_daddr6,
+				      sizeof(in_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	}
+
+	return fd;
+}
+
+/* receiver reads unencapsulated UDP */
+static int setup_rx(void)
+{
+	int fd, ret;
+
+	fd = socket(cfg_l3_inner, SOCK_DGRAM, 0);
+	if (fd == -1)
+		error(1, errno, "socket rx");
+
+	if (cfg_l3_inner == PF_INET)
+		ret = bind(fd, (void *) &in_daddr4, sizeof(in_daddr4));
+	else
+		ret = bind(fd, (void *) &in_daddr6, sizeof(in_daddr6));
+	if (ret)
+		error(1, errno, "bind rx");
+
+	return fd;
+}
+
+static int do_tx(int fd, const char *pkt, int len)
+{
+	int ret;
+
+	ret = write(fd, pkt, len);
+	if (ret == -1)
+		error(1, errno, "send");
+	if (ret != len)
+		error(1, errno, "send: len (%d < %d)\n", ret, len);
+
+	return 1;
+}
+
+static int do_poll(int fd, short events, int timeout)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.fd = fd;
+	pfd.events = events;
+
+	ret = poll(&pfd, 1, timeout);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret && !(pfd.revents & POLLIN))
+		error(1, errno, "poll: unexpected event 0x%x\n", pfd.revents);
+
+	return ret;
+}
+
+static int do_rx(int fd)
+{
+	char rbuf;
+	int ret, num = 0;
+
+	while (1) {
+		ret = recv(fd, &rbuf, 1, MSG_DONTWAIT);
+		if (ret == -1 && errno == EAGAIN)
+			break;
+		if (ret == -1)
+			error(1, errno, "recv");
+		if (rbuf != cfg_payload_char)
+			error(1, 0, "recv: payload mismatch");
+		num++;
+	};
+
+	return num;
+}
+
+static int do_main(void)
+{
+	unsigned long tstop, treport, tcur;
+	int fdt = -1, fdr = -1, len, tx = 0, rx = 0;
+
+	if (!cfg_only_tx)
+		fdr = setup_rx();
+	if (!cfg_only_rx)
+		fdt = setup_tx();
+
+	len = build_packet();
+
+	tcur = util_gettime();
+	treport = tcur + 1000;
+	tstop = tcur + (cfg_num_secs * 1000);
+
+	while (1) {
+		if (!cfg_only_rx)
+			tx += do_tx(fdt, buf, len);
+
+		if (!cfg_only_tx)
+			rx += do_rx(fdr);
+
+		if (cfg_num_secs) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+			if (tcur >= treport) {
+				fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+				tx = 0;
+				rx = 0;
+				treport = tcur + 1000;
+			}
+		} else {
+			if (tx == cfg_num_pkt)
+				break;
+		}
+	}
+
+	/* read straggler packets, if any */
+	if (rx < tx) {
+		tstop = util_gettime() + 100;
+		while (rx < tx) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+
+			do_poll(fdr, POLLIN, tstop - tcur);
+			rx += do_rx(fdr);
+		}
+	}
+
+	fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+
+	if (fdr != -1 && close(fdr))
+		error(1, errno, "close rx");
+	if (fdt != -1 && close(fdt))
+		error(1, errno, "close tx");
+
+	/*
+	 * success (== 0) only if received all packets
+	 * unless failure is expected, in which case none must arrive.
+	 */
+	if (cfg_expect_failure)
+		return rx != 0;
+	else
+		return rx != tx;
+}
+
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+	fprintf(stderr, "Usage: %s [-e gre|gue|bare|none] [-i 4|6] [-l len] "
+			"[-O 4|6] [-o 4|6] [-n num] [-t secs] [-R] [-T] "
+			"[-s <osrc> [-d <odst>] [-S <isrc>] [-D <idst>] "
+			"[-x <otos>] [-X <itos>] [-f <isport>] [-F]\n",
+		filepath);
+	exit(1);
+}
+
+static void parse_addr(int family, void *addr, const char *optarg)
+{
+	int ret;
+
+	ret = inet_pton(family, optarg, addr);
+	if (ret == -1)
+		error(1, errno, "inet_pton");
+	if (ret == 0)
+		error(1, 0, "inet_pton: bad string");
+}
+
+static void parse_addr4(struct sockaddr_in *addr, const char *optarg)
+{
+	parse_addr(AF_INET, &addr->sin_addr, optarg);
+}
+
+static void parse_addr6(struct sockaddr_in6 *addr, const char *optarg)
+{
+	parse_addr(AF_INET6, &addr->sin6_addr, optarg);
+}
+
+static int parse_protocol_family(const char *filepath, const char *optarg)
+{
+	if (!strcmp(optarg, "4"))
+		return PF_INET;
+	if (!strcmp(optarg, "6"))
+		return PF_INET6;
+
+	usage(filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "d:D:e:f:Fhi:l:n:o:O:Rs:S:t:Tx:X:")) != -1) {
+		switch (c) {
+		case 'd':
+			if (cfg_l3_outer == AF_UNSPEC)
+				error(1, 0, "-d must be preceded by -o");
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_daddr4, optarg);
+			else
+				parse_addr6(&out_daddr6, optarg);
+			break;
+		case 'D':
+			if (cfg_l3_inner == AF_UNSPEC)
+				error(1, 0, "-D must be preceded by -i");
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_daddr4, optarg);
+			else
+				parse_addr6(&in_daddr6, optarg);
+			break;
+		case 'e':
+			if (!strcmp(optarg, "gre"))
+				cfg_encap_proto = IPPROTO_GRE;
+			else if (!strcmp(optarg, "gue"))
+				cfg_encap_proto = IPPROTO_UDP;
+			else if (!strcmp(optarg, "bare"))
+				cfg_encap_proto = IPPROTO_IPIP;
+			else if (!strcmp(optarg, "none"))
+				cfg_encap_proto = IPPROTO_IP;	/* == 0 */
+			else
+				usage(argv[0]);
+			break;
+		case 'f':
+			cfg_src_port = strtol(optarg, NULL, 0);
+			break;
+		case 'F':
+			cfg_expect_failure = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			break;
+		case 'i':
+			if (!strcmp(optarg, "4"))
+				cfg_l3_inner = PF_INET;
+			else if (!strcmp(optarg, "6"))
+				cfg_l3_inner = PF_INET6;
+			else
+				usage(argv[0]);
+			break;
+		case 'l':
+			cfg_payload_len = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			cfg_num_pkt = strtol(optarg, NULL, 0);
+			break;
+		case 'o':
+			cfg_l3_outer = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'O':
+			cfg_l3_extra = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'R':
+			cfg_only_rx = true;
+			break;
+		case 's':
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_saddr4, optarg);
+			else
+				parse_addr6(&out_saddr6, optarg);
+			break;
+		case 'S':
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_saddr4, optarg);
+			else
+				parse_addr6(&in_saddr6, optarg);
+			break;
+		case 't':
+			cfg_num_secs = strtol(optarg, NULL, 0);
+			break;
+		case 'T':
+			cfg_only_tx = true;
+			break;
+		case 'x':
+			cfg_dsfield_outer = strtol(optarg, NULL, 0);
+			break;
+		case 'X':
+			cfg_dsfield_inner = strtol(optarg, NULL, 0);
+			break;
+		}
+	}
+
+	if (cfg_only_rx && cfg_only_tx)
+		error(1, 0, "options: cannot combine rx-only and tx-only");
+
+	if (cfg_encap_proto && cfg_l3_outer == AF_UNSPEC)
+		error(1, 0, "options: must specify outer with encap");
+	else if ((!cfg_encap_proto) && cfg_l3_outer != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and outer");
+	else if ((!cfg_encap_proto) && cfg_l3_extra != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and extra");
+
+	if (cfg_l3_inner == AF_UNSPEC)
+		cfg_l3_inner = AF_INET6;
+	if (cfg_l3_inner == AF_INET6 && cfg_encap_proto == IPPROTO_IPIP)
+		cfg_encap_proto = IPPROTO_IPV6;
+
+	/* RFC 6040 4.2:
+	 *   on decap, if outer encountered congestion (CE == 0x3),
+	 *   but inner cannot encode ECN (NoECT == 0x0), then drop packet.
+	 */
+	if (((cfg_dsfield_outer & 0x3) == 0x3) &&
+	    ((cfg_dsfield_inner & 0x3) == 0x0))
+		cfg_expect_failure = true;
+}
+
+static void print_opts(void)
+{
+	if (cfg_l3_inner == PF_INET6) {
+		util_printaddr("inner.dest6", (void *) &in_daddr6);
+		util_printaddr("inner.source6", (void *) &in_saddr6);
+	} else {
+		util_printaddr("inner.dest4", (void *) &in_daddr4);
+		util_printaddr("inner.source4", (void *) &in_saddr4);
+	}
+
+	if (!cfg_l3_outer)
+		return;
+
+	fprintf(stderr, "encap proto:   %u\n", cfg_encap_proto);
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("outer.dest6", (void *) &out_daddr6);
+		util_printaddr("outer.source6", (void *) &out_saddr6);
+	} else {
+		util_printaddr("outer.dest4", (void *) &out_daddr4);
+		util_printaddr("outer.source4", (void *) &out_saddr4);
+	}
+
+	if (!cfg_l3_extra)
+		return;
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("extra.dest6", (void *) &extra_daddr6);
+		util_printaddr("extra.source6", (void *) &extra_saddr6);
+	} else {
+		util_printaddr("extra.dest4", (void *) &extra_daddr4);
+		util_printaddr("extra.source4", (void *) &extra_saddr4);
+	}
+
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	print_opts();
+	return do_main();
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh
new file mode 100755
index 000000000000..c0fb073b5eab
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Load BPF flow dissector and verify it correctly dissects traffic
+export TESTNAME=test_flow_dissector
+unmount=0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+msg="skip all tests:"
+if [ $UID != 0 ]; then
+	echo $msg please run this as root >&2
+	exit $ksft_skip
+fi
+
+# This test needs to be run in a network namespace with in_netns.sh. Check if
+# this is the case and run it with in_netns.sh if it is being run in the root
+# namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+	../net/in_netns.sh "$0" "$@"
+	exit $?
+fi
+
+# Determine selftest success via shell exit code
+exit_handler()
+{
+	if (( $? == 0 )); then
+		echo "selftests: $TESTNAME [PASS]";
+	else
+		echo "selftests: $TESTNAME [FAILED]";
+	fi
+
+	set +e
+
+	# Cleanup
+	tc filter del dev lo ingress pref 1337 2> /dev/null
+	tc qdisc del dev lo ingress 2> /dev/null
+	./flow_dissector_load -d 2> /dev/null
+	if [ $unmount -ne 0 ]; then
+		umount bpffs 2> /dev/null
+	fi
+}
+
+# Exit script immediately (well catched by trap handler) if any
+# program/thing exits with a non-zero status.
+set -e
+
+# (Use 'trap -l' to list meaning of numbers)
+trap exit_handler 0 2 3 6 9
+
+# Mount BPF file system
+if /bin/mount | grep /sys/fs/bpf > /dev/null; then
+	echo "bpffs already mounted"
+else
+	echo "bpffs not mounted. Mounting..."
+	unmount=1
+	/bin/mount bpffs /sys/fs/bpf -t bpf
+fi
+
+# Attach BPF program
+./flow_dissector_load -p bpf_flow.o -s dissect
+
+# Setup
+tc qdisc add dev lo ingress
+
+echo "Testing IPv4..."
+# Drops all IP/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ip pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv4/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 4 -f 8
+# Send 10 IPv4/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 4 -f 9 -F
+# Send 10 IPv4/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 4 -f 10
+
+echo "Testing IPIP..."
+# Send 10 IPv4/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+echo "Testing IPv4 + GRE..."
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+tc filter del dev lo ingress pref 1337
+
+echo "Testing IPv6..."
+# Drops all IPv6/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ipv6 pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv6/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 6 -f 8
+# Send 10 IPv6/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 6 -f 9 -F
+# Send 10 IPv6/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 6 -f 10
+
+exit 0
diff --git a/tools/testing/selftests/bpf/with_addr.sh b/tools/testing/selftests/bpf/with_addr.sh
new file mode 100755
index 000000000000..ffcd3953f94c
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_addr.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# add private ipv4 and ipv6 addresses to loopback
+
+readonly V6_INNER='100::a/128'
+readonly V4_INNER='192.168.0.1/32'
+
+if getopts ":s" opt; then
+  readonly SIT_DEV_NAME='sixtofourtest0'
+  readonly V6_SIT='2::/64'
+  readonly V4_SIT='172.17.0.1/32'
+  shift
+fi
+
+fail() {
+  echo "error: $*" 1>&2
+  exit 1
+}
+
+setup() {
+  ip -6 addr add "${V6_INNER}" dev lo || fail 'failed to setup v6 address'
+  ip -4 addr add "${V4_INNER}" dev lo || fail 'failed to setup v4 address'
+
+  if [[ -n "${V6_SIT}" ]]; then
+    ip link add "${SIT_DEV_NAME}" type sit remote any local any \
+	    || fail 'failed to add sit'
+    ip link set dev "${SIT_DEV_NAME}" up \
+	    || fail 'failed to bring sit device up'
+    ip -6 addr add "${V6_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v6 SIT address'
+    ip -4 addr add "${V4_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v4 SIT address'
+  fi
+
+  sleep 2	# avoid race causing bind to fail
+}
+
+cleanup() {
+  if [[ -n "${V6_SIT}" ]]; then
+    ip -4 addr del "${V4_SIT}" dev "${SIT_DEV_NAME}"
+    ip -6 addr del "${V6_SIT}" dev "${SIT_DEV_NAME}"
+    ip link del "${SIT_DEV_NAME}"
+  fi
+
+  ip -4 addr del "${V4_INNER}" dev lo
+  ip -6 addr del "${V6_INNER}" dev lo
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
diff --git a/tools/testing/selftests/bpf/with_tunnels.sh b/tools/testing/selftests/bpf/with_tunnels.sh
new file mode 100755
index 000000000000..e24949ed3a20
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_tunnels.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# setup tunnels for flow dissection test
+
+readonly SUFFIX="test_$(mktemp -u XXXX)"
+CONFIG="remote 127.0.0.2 local 127.0.0.1 dev lo"
+
+setup() {
+  ip link add "ipip_${SUFFIX}" type ipip ${CONFIG}
+  ip link add "gre_${SUFFIX}" type gre ${CONFIG}
+  ip link add "sit_${SUFFIX}" type sit ${CONFIG}
+
+  echo "tunnels before test:"
+  ip tunnel show
+
+  ip link set "ipip_${SUFFIX}" up
+  ip link set "gre_${SUFFIX}" up
+  ip link set "sit_${SUFFIX}" up
+}
+
+
+cleanup() {
+  ip tunnel del "ipip_${SUFFIX}"
+  ip tunnel del "gre_${SUFFIX}"
+  ip tunnel del "sit_${SUFFIX}"
+
+  echo "tunnels after test:"
+  ip tunnel show
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related

* [bpf-next RFC 2/3] flow_dissector: implements eBPF parser
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP
encapsulation, MPLS, GUE, and VLAN, along with IPv4/IPv6 and extension
headers. This program is meant to show how flow dissection and key
extraction can be done in eBPF.

It is initially meant to be used for demonstration rather than as a
complete replacement of the existing flow dissector.

This includes parsing of GUE and MPLS payload, which cannot be done
in production in general, as GUE tunnels and MPLS payloads cannot
unambiguously be detected in general.

In closed environments, however, it can be enabled. Another example
where the programmability of BPF aids flow dissection.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 542 +++++++++++++++++++++++++
 2 files changed, 543 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
 	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-	test_skb_cgroup_id_kern.o
+	test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index 000000000000..9c11c644b713
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,542 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/icmp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_packet.h>
+#include <sys/socket.h>
+#include <linux/if_tunnel.h>
+#include <linux/mpls.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be cropped.
+ */
+enum {
+	IP,
+	IPV6,
+	IPV6OP,	/* Destination/Hop-by-Hop Options IPv6 Extension header */
+	IPV6FR,	/* Fragmentation IPv6 Extension Header */
+	MPLS,
+	VLAN,
+	GUE,
+};
+
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+#define IP6_MF		0x0001
+#define IP6_OFFSET	0xFFF8
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+	__be16 flags;
+	__be16 proto;
+};
+
+#define GUE_PORT 6080
+/* Taken from include/net/gue.h. Move that to uapi, instead? */
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+enum flow_dissector_key_id {
+	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
+	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
+	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
+	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
+	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
+	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
+	FLOW_DISSECTOR_KEY_TIPC, /* struct flow_dissector_key_tipc */
+	FLOW_DISSECTOR_KEY_ARP, /* struct flow_dissector_key_arp */
+	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
+	FLOW_DISSECTOR_KEY_FLOW_LABEL, /* struct flow_dissector_key_flow_tags */
+	FLOW_DISSECTOR_KEY_GRE_KEYID, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_MPLS_ENTROPY, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_ENC_KEYID, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
+	FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
+	FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */
+	FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
+	FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
+	FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
+	FLOW_DISSECTOR_KEY_CVLAN, /* struct flow_dissector_key_flow_vlan */
+
+	FLOW_DISSECTOR_KEY_MAX,
+};
+
+struct flow_dissector_key_control {
+	__u16	thoff;
+	__u16	addr_type;
+	__u32	flags;
+};
+
+#define FLOW_DIS_IS_FRAGMENT	(1 << 0)
+#define FLOW_DIS_FIRST_FRAG	(1 << 1)
+#define FLOW_DIS_ENCAPSULATION	(1 << 2)
+
+struct flow_dissector_key_basic {
+	__be16	n_proto;
+	__u8	ip_proto;
+	__u8	padding;
+};
+
+struct flow_dissector_key_ipv4_addrs {
+	__be32 src;
+	__be32 dst;
+};
+
+struct flow_dissector_key_ipv6_addrs {
+	struct in6_addr src;
+	struct in6_addr dst;
+};
+
+struct flow_dissector_key_addrs {
+	union {
+		struct flow_dissector_key_ipv4_addrs v4addrs;
+		struct flow_dissector_key_ipv6_addrs v6addrs;
+	};
+};
+
+struct flow_dissector_key_ports {
+	union {
+		__be32 ports;
+		struct {
+			__be16 src;
+			__be16 dst;
+		};
+	};
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 8
+};
+
+struct bpf_dissect_cb {
+	__u16 nhoff;
+	__u16 flags;
+};
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+	switch (proto) {
+	case bpf_htons(ETH_P_IP):
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case bpf_htons(ETH_P_IPV6):
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case bpf_htons(ETH_P_MPLS_MC):
+	case bpf_htons(ETH_P_MPLS_UC):
+		bpf_tail_call(skb, &jmp_table, MPLS);
+		break;
+	case bpf_htons(ETH_P_8021Q):
+	case bpf_htons(ETH_P_8021AD):
+		bpf_tail_call(skb, &jmp_table, VLAN);
+		break;
+	default:
+		/* Protocol not supported */
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int write_ports(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_ports ports;
+
+	/* The supported protocols always start with the ports */
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &ports, sizeof(ports)))
+		return BPF_DROP;
+
+	if (proto == IPPROTO_UDP && ports.dst == bpf_htons(GUE_PORT)) {
+		/* GUE encapsulation */
+		cb->nhoff += sizeof(struct udphdr);
+		bpf_tail_call(skb, &jmp_table, GUE);
+		return BPF_DROP;
+	}
+
+	if (bpf_flow_dissector_write_keys(skb, &ports, sizeof(ports),
+					  FLOW_DISSECTOR_KEY_PORTS))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+	if (!skb->vlan_present)
+		return parse_eth_proto(skb, skb->protocol);
+	else
+		return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 *data_end = (__u8 *)(long)skb->data_end;
+	__u8 *data = (__u8 *)(long)skb->data;
+	__u32 data_len = data_end - data;
+	struct gre_hdr gre;
+	struct ethhdr eth;
+	struct tcphdr tcp;
+
+	switch (proto) {
+	case IPPROTO_ICMP:
+		if (cb->nhoff + sizeof(struct icmphdr) > data_len)
+			return BPF_DROP;
+		return BPF_OK;
+	case IPPROTO_IPIP:
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case IPPROTO_IPV6:
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case IPPROTO_GRE:
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &gre, sizeof(gre)))
+			return BPF_DROP;
+
+		if (bpf_htons(gre.flags & GRE_VERSION))
+			/* Only inspect standard GRE packets with version 0 */
+			return BPF_OK;
+
+		cb->nhoff += sizeof(gre); /* Step over GRE Flags and Protocol */
+		if (GRE_IS_CSUM(gre.flags))
+			cb->nhoff += 4; /* Step over chksum and Padding */
+		if (GRE_IS_KEY(gre.flags))
+			cb->nhoff += 4; /* Step over key */
+		if (GRE_IS_SEQ(gre.flags))
+			cb->nhoff += 4; /* Step over sequence number */
+
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+
+		if (gre.proto == bpf_htons(ETH_P_TEB)) {
+			if (bpf_skb_load_bytes(skb, cb->nhoff, &eth,
+					       sizeof(eth)))
+				return BPF_DROP;
+
+			cb->nhoff += sizeof(eth);
+
+			return parse_eth_proto(skb, eth.h_proto);
+		} else {
+			return parse_eth_proto(skb, gre.proto);
+		}
+
+	case IPPROTO_TCP:
+		if (cb->nhoff + sizeof(struct tcphdr) > data_len)
+			return BPF_DROP;
+
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &tcp, sizeof(tcp)))
+			return BPF_DROP;
+
+		if (tcp.doff < 5)
+			return BPF_DROP;
+
+		if (cb->nhoff + (tcp.doff << 2) > data_len)
+			return BPF_DROP;
+
+		return write_ports(skb, proto);
+	case IPPROTO_UDP:
+	case IPPROTO_UDPLITE:
+		if (cb->nhoff + sizeof(struct udphdr) > data_len)
+			return BPF_DROP;
+
+		return write_ports(skb, proto);
+	default:
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_control control;
+	struct flow_dissector_key_basic basic;
+
+	switch (nexthdr) {
+	case IPPROTO_HOPOPTS:
+	case IPPROTO_DSTOPTS:
+		bpf_tail_call(skb, &jmp_table, IPV6OP);
+		break;
+	case IPPROTO_FRAGMENT:
+		bpf_tail_call(skb, &jmp_table, IPV6FR);
+		break;
+	default:
+		control.thoff = cb->nhoff;
+		control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+		control.flags = cb->flags;
+		if (bpf_flow_dissector_write_keys(skb, &control,
+						  sizeof(control),
+						  FLOW_DISSECTOR_KEY_CONTROL))
+			return BPF_DROP;
+
+		memset(&basic, 0, sizeof(basic));
+		basic.n_proto = bpf_htons(ETH_P_IPV6);
+		basic.ip_proto = nexthdr;
+		if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
+					      FLOW_DISSECTOR_KEY_BASIC))
+			return BPF_DROP;
+
+		return parse_ip_proto(skb, nexthdr);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(IP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 *data_end = (__u8 *)(long)skb->data_end;
+	struct flow_dissector_key_control control;
+	struct flow_dissector_key_addrs addrs;
+	struct flow_dissector_key_basic basic;
+	__u8 *data = (__u8 *)(long)skb->data;
+	__u32 data_len = data_end - data;
+	bool done = false;
+	struct iphdr iph;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &iph, sizeof(iph)))
+		return BPF_DROP;
+
+	/* IP header cannot be smaller than 20 bytes */
+	if (iph.ihl < 5)
+		return BPF_DROP;
+
+	addrs.v4addrs.src = iph.saddr;
+	addrs.v4addrs.dst = iph.daddr;
+	if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v4addrs),
+				      FLOW_DISSECTOR_KEY_IPV4_ADDRS))
+		return BPF_DROP;
+
+	cb->nhoff += iph.ihl << 2;
+	if (cb->nhoff > data_len)
+		return BPF_DROP;
+
+	if (iph.frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
+		cb->flags |= FLOW_DIS_IS_FRAGMENT;
+		if (iph.frag_off & bpf_htons(IP_OFFSET))
+			/* From second fragment on, packets do not have headers
+			 * we can parse.
+			 */
+			done = true;
+		else
+			cb->flags |= FLOW_DIS_FIRST_FRAG;
+	}
+
+
+	control.thoff = cb->nhoff;
+	control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	control.flags = cb->flags;
+	if (bpf_flow_dissector_write_keys(skb, &control, sizeof(control),
+					  FLOW_DISSECTOR_KEY_CONTROL))
+		return BPF_DROP;
+
+	memset(&basic, 0, sizeof(basic));
+	basic.n_proto = bpf_htons(ETH_P_IP);
+	basic.ip_proto = iph.protocol;
+	if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
+				      FLOW_DISSECTOR_KEY_BASIC))
+		return BPF_DROP;
+
+	if (done)
+		return BPF_OK;
+
+	return parse_ip_proto(skb, iph.protocol);
+}
+
+PROG(IPV6)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_addrs addrs;
+	struct ipv6hdr ip6h;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &ip6h, sizeof(ip6h)))
+		return BPF_DROP;
+
+	addrs.v6addrs.src = ip6h.saddr;
+	addrs.v6addrs.dst = ip6h.daddr;
+	if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v6addrs),
+				      FLOW_DISSECTOR_KEY_IPV6_ADDRS))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(struct ipv6hdr);
+
+	return parse_ipv6_proto(skb, ip6h.nexthdr);
+}
+
+PROG(IPV6OP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 proto;
+	__u8 hlen;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
+		return BPF_DROP;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff + sizeof(proto), &hlen,
+			       sizeof(hlen)))
+		return BPF_DROP;
+	/* hlen is in 8-octects and does not include the first 8 bytes
+	 * of the header
+	 */
+	cb->nhoff += (1 + hlen) << 3;
+
+	return parse_ipv6_proto(skb, proto);
+}
+
+PROG(IPV6FR)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__be16 frag_off;
+	__u8 proto;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
+		return BPF_DROP;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff + 2, &frag_off, sizeof(frag_off)))
+		return BPF_DROP;
+
+	cb->nhoff += 8;
+	cb->flags |= FLOW_DIS_IS_FRAGMENT;
+	if (!(frag_off & bpf_htons(IP6_OFFSET)))
+		cb->flags |= FLOW_DIS_FIRST_FRAG;
+
+	return parse_ipv6_proto(skb, proto);
+}
+
+PROG(MPLS)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct mpls_label mpls;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &mpls, sizeof(mpls)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(mpls);
+
+	if (mpls.entry & MPLS_LS_S_MASK) {
+		/* This is the last MPLS header. The network layer packet always
+		 * follows the MPLS header. Peek forward and dispatch based on
+		 * that.
+		 */
+		__u8 version;
+
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &version,
+				       sizeof(version)))
+			return BPF_DROP;
+
+		/* IP version is always the first 4 bits of the header */
+		switch (version & 0xF0) {
+		case 4:
+			bpf_tail_call(skb, &jmp_table, IP);
+			break;
+		case 6:
+			bpf_tail_call(skb, &jmp_table, IPV6);
+			break;
+		default:
+			return BPF_DROP;
+		}
+	} else {
+		bpf_tail_call(skb, &jmp_table, MPLS);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(VLAN)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct vlan_hdr vlan;
+	__be16 proto;
+
+	/* Peek back to see if single or double-tagging */
+	if (bpf_skb_load_bytes(skb, cb->nhoff - sizeof(proto), &proto,
+			       sizeof(proto)))
+		return BPF_DROP;
+
+	/* Account for double-tagging */
+	if (proto == bpf_htons(ETH_P_8021AD)) {
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
+			return BPF_DROP;
+
+		if (vlan.h_vlan_encapsulated_proto != bpf_htons(ETH_P_8021Q))
+			return BPF_DROP;
+
+		cb->nhoff += sizeof(vlan);
+	}
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(vlan);
+	/* Only allow 8021AD + 8021Q double tagging and no triple tagging.*/
+	if (vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021AD) ||
+	    vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021Q))
+		return BPF_DROP;
+
+	return parse_eth_proto(skb, vlan.h_vlan_encapsulated_proto);
+}
+
+PROG(GUE)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct guehdr gue;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &gue, sizeof(gue)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(gue);
+	cb->nhoff += gue.hlen << 2;
+
+	cb->flags |= FLOW_DIS_ENCAPSULATION;
+	return parse_ip_proto(skb, gue.proto_ctype);
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related

* [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180816164423.14368-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is kept as a global variable so it is
accessible to all flow dissectors.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/bpf_types.h                 |   1 +
 include/linux/skbuff.h                    |   7 +
 include/net/flow_dissector.h              |  16 +++
 include/uapi/linux/bpf.h                  |  14 +-
 kernel/bpf/syscall.c                      |   8 ++
 kernel/bpf/verifier.c                     |   2 +
 net/core/filter.c                         | 157 ++++++++++++++++++++++
 net/core/flow_dissector.c                 |  76 +++++++++++
 tools/bpf/bpftool/prog.c                  |   1 +
 tools/include/uapi/linux/bpf.h            |   5 +-
 tools/lib/bpf/libbpf.c                    |   2 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 12 files changed, 290 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 6a4586dcdede..edb919d320c1 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
 extern struct flow_dissector flow_keys_dissector;
 extern struct flow_dissector flow_keys_basic_dissector;
 
+/* struct bpf_flow_dissect_cb:
+ *
+ * This struct is used to pass parameters to BPF programs of type
+ * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
+ * the control block of the skb to be a struct of this type. The first field is
+ * used to communicate the next header offset between the BPF programs and the
+ * first value of it is passed from the kernel. The last two fields are used for
+ * writing out flow keys.
+ */
+struct bpf_flow_dissect_cb {
+	u16 nhoff;
+	u16 unused;
+	void *target_container;
+	struct flow_dissector *flow_dissector;
+};
+
 /* struct flow_keys_digest:
  *
  * This structure is used to hold a digest of the full flow keys. This is a
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..8bc0fdab685d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2141,6 +2143,15 @@ union bpf_attr {
  *		request in the skb.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
+ *	Description
+ *		Try to write *len* bytes from the source pointer into the offset
+ *		of the key with id *key_id*. If *len* is different from the
+ *		size of the key, an error is returned. If the key is not used,
+ *		this function exits with no effect and code 0.
+ *	Return
+ *		0 on success, negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2226,7 +2237,8 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(flow_dissector_write_keys),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 43727ed0d94a..a06568841a92 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_LIRC_MODE2:
 		ptype = BPF_PROG_TYPE_LIRC_MODE2;
 		break;
+	case BPF_FLOW_DISSECTOR:
+		ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_LIRC_MODE2:
 		ret = lirc_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
 	case BPF_LIRC_MODE2:
 		return lirc_prog_detach(attr);
+	case BPF_FLOW_DISSECTOR:
+		return skb_flow_dissector_bpf_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679a7fe5..6d3f268fa8e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (meta)
 			return meta->pkt_access;
 
@@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce3da34..03d3037e6508 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
+BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
+	   const void *, from, u32, len, enum flow_dissector_key_id, key_id)
+{
+	struct bpf_flow_dissect_cb *cb;
+	void *dest;
+
+	cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
+
+	/* Make sure the dissector actually uses the key. It is not an error if
+	 * it does not, but we should not continue past this point in that case
+	 */
+	if (!dissector_uses_key(cb->flow_dissector, key_id))
+		return 0;
+
+	/* Make sure the length is correct */
+	switch (key_id) {
+	case FLOW_DISSECTOR_KEY_CONTROL:
+	case FLOW_DISSECTOR_KEY_ENC_CONTROL:
+		if (len != sizeof(struct flow_dissector_key_control))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_BASIC:
+		if (len != sizeof(struct flow_dissector_key_basic))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+	case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+	case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ICMP:
+		if (len != sizeof(struct flow_dissector_key_icmp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_PORTS:
+	case FLOW_DISSECTOR_KEY_ENC_PORTS:
+		if (len != sizeof(struct flow_dissector_key_ports))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ETH_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_eth_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_TIPC:
+		if (len != sizeof(struct flow_dissector_key_tipc))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ARP:
+		if (len != sizeof(struct flow_dissector_key_arp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_VLAN:
+	case FLOW_DISSECTOR_KEY_CVLAN:
+		if (len != sizeof(struct flow_dissector_key_vlan))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_FLOW_LABEL:
+		if (len != sizeof(struct flow_dissector_key_tags))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_GRE_KEYID:
+	case FLOW_DISSECTOR_KEY_ENC_KEYID:
+	case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
+		if (len != sizeof(struct flow_dissector_key_keyid))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_MPLS:
+		if (len != sizeof(struct flow_dissector_key_mpls))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_TCP:
+		if (len != sizeof(struct flow_dissector_key_tcp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IP:
+	case FLOW_DISSECTOR_KEY_ENC_IP:
+		if (len != sizeof(struct flow_dissector_key_ip))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
+					 cb->target_container);
+
+	memcpy(dest, from, len);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
+	.func		= bpf_flow_dissector_write_keys,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_flow_dissector_write_keys:
+		return &bpf_flow_dissector_write_keys_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool flow_dissector_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range(struct __sk_buff, cb[0]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+	case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
+		return false;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 const struct bpf_prog_ops sk_msg_prog_ops = {
 };
 
+const struct bpf_verifier_ops flow_dissector_verifier_ops = {
+	.get_func_proto		= flow_dissector_func_proto,
+	.is_valid_access	= flow_dissector_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+};
+
+const struct bpf_prog_ops flow_dissector_prog_ops = {
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index ce9eeeb7c024..767daa231f04 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -25,6 +25,11 @@
 #include <net/flow_dissector.h>
 #include <scsi/fc/fc_fcoe.h>
 #include <uapi/linux/batadv_packet.h>
+#include <linux/bpf.h>
+
+/* BPF program accessible by all flow dissectors */
+static struct bpf_prog __rcu *flow_dissector_prog;
+static DEFINE_MUTEX(flow_dissector_mutex);
 
 static void dissector_set_key(struct flow_dissector *flow_dissector,
 			      enum flow_dissector_key_id key_id)
@@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 }
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog)
+{
+	struct bpf_prog *attached;
+
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (attached) {
+		/* Only one BPF program can be attached at a time */
+		mutex_unlock(&flow_dissector_mutex);
+		return -EEXIST;
+	}
+	rcu_assign_pointer(flow_dissector_prog, prog);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct bpf_prog *attached;
+
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (!flow_dissector_prog) {
+		mutex_unlock(&flow_dissector_mutex);
+		return -EINVAL;
+	}
+	bpf_prog_put(attached);
+	RCU_INIT_POINTER(flow_dissector_prog, NULL);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
 /**
  * skb_flow_get_be16 - extract be16 entity
  * @skb: sk_buff to extract from
@@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_vlan *key_vlan;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
+	struct bpf_prog *attached;
 	int num_hdrs = 0;
 	u8 ip_proto = 0;
 	bool ret;
@@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 					      FLOW_DISSECTOR_KEY_BASIC,
 					      target_container);
 
+	rcu_read_lock();
+	attached = rcu_dereference(flow_dissector_prog);
+	if (attached) {
+		/* Note that even though the const qualifier is discarded
+		 * throughout the execution of the BPF program, all changes(the
+		 * control block) are reverted after the BPF program returns.
+		 * Therefore, __skb_flow_dissect does not alter the skb.
+		 */
+		struct bpf_flow_dissect_cb *cb;
+		u8 cb_saved[BPF_SKB_CB_LEN];
+		u32 result;
+
+		cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
+
+		/* Save Control Block */
+		memcpy(cb_saved, cb, sizeof(cb_saved));
+		memset(cb, 0, sizeof(cb_saved));
+
+		/* Pass parameters to the BPF program */
+		cb->nhoff = nhoff;
+		cb->target_container = target_container;
+		cb->flow_dissector = flow_dissector;
+
+		bpf_compute_data_pointers((struct sk_buff *)skb);
+		result = BPF_PROG_RUN(attached, skb);
+
+		/* Restore state */
+		memcpy(cb, cb_saved, sizeof(cb_saved));
+
+		key_control->thoff = min_t(u16, key_control->thoff,
+					   skb ? skb->len : hlen);
+		rcu_read_unlock();
+		return result == BPF_OK;
+	}
+	rcu_read_unlock();
+
 	if (dissector_uses_key(flow_dissector,
 			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
 		struct ethhdr *eth = eth_hdr(skb);
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_FLOW_DISSECTOR]	= "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..acd74a0dd063 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2226,7 +2228,8 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(flow_dissector_write_keys),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f112627..0c749ce1b717 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
 	BPF_PROG_SEC("sk_skb",		BPF_PROG_TYPE_SK_SKB),
 	BPF_PROG_SEC("sk_msg",		BPF_PROG_TYPE_SK_MSG),
 	BPF_PROG_SEC("lirc_mode2",	BPF_PROG_TYPE_LIRC_MODE2),
+	BPF_PROG_SEC("flow_dissector",	BPF_PROG_TYPE_FLOW_DISSECTOR),
 	BPF_SA_PROG_SEC("cgroup/bind4",	BPF_CGROUP_INET4_BIND),
 	BPF_SA_PROG_SEC("cgroup/bind6",	BPF_CGROUP_INET6_BIND),
 	BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..4204c496a04f 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
 	(void *) BPF_FUNC_skb_cgroup_id;
 static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
 	(void *) BPF_FUNC_skb_ancestor_cgroup_id;
+static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
+					    int key) =
+	(void *) BPF_FUNC_flow_dissector_write_keys;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related

* [bpf-next RFC 0/3] Introduce eBPF flow dissector
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 3 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

This RFC patchset exposes a few design considerations:

1/ Because the flow dissector key definitions live in
include/linux/net/flow_dissector.h, they are not visible from userspace,
and the flow keys definitions need to be copied in the BPF program.

2/ An alternative to adding a new hook would have been to attach flow
dissection programs at the XDP hook. Because this hook is executed before
GRO, it would have to execute on every MSS, which would be more
computationally expensive. Furthermore, the XDP hook is executed before an
SKB has been allocated and there is no clear way to move the dissected keys
into the SKB after it has been allocated. Eventually, perhaps a single pass
can implement both GRO and flow dissection -- but napi_gro_cb shows that a
lot more flow state would need to be parsed for this.

3/ The BPF program cannot use direct packet access everywhere because it
uses an offset, initially supplied by the flow dissector.  Because the
initial value of this non-constant offset comes from outside of the
program, the verifier does not know what its value is, and it cannot verify
that it is within packet bounds. Therefore, direct packet access programs
get rejected.

4/ Loading and attaching the BPF program requires capable(), as opposed to
ns_capable(), because a malicious program might be able to return bad
values that would trigger bugs in the kernel, such as the nhoff value bug
fixed in commit 324f8305e59b ("net-backports: flow_dissector: properly cap
thoff field").

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (3):
  flow_dissector: implements flow dissector BPF hook
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf_types.h                     |   1 +
 include/linux/skbuff.h                        |   7 +
 include/net/flow_dissector.h                  |  16 +
 include/uapi/linux/bpf.h                      |  14 +-
 kernel/bpf/syscall.c                          |   8 +
 kernel/bpf/verifier.c                         |   2 +
 net/core/filter.c                             | 157 ++++
 net/core/flow_dissector.c                     |  76 ++
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   5 +-
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c        | 542 ++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h     |   3 +
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 21 files changed, 1967 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: David Miller @ 2018-08-16 19:39 UTC (permalink / raw)
  To: hkallweit1; +Cc: jian-hong, nic_swsd, netdev, linux-kernel, linux
In-Reply-To: <458efbf9-5971-653a-e7cd-8c56ba055648@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Thu, 16 Aug 2018 21:37:31 +0200

> On 16.08.2018 21:21, David Miller wrote:
>> From: <jian-hong@endlessm.com>
>> Date: Wed, 15 Aug 2018 14:21:10 +0800
>> 
>>> Found the ethernet network on ASUS X441UAR doesn't come back on resume
>>> from suspend when using MSI-X.  The chip is RTL8106e - version 39.
>> 
>> Heiner, please take a look at this.
>> 
>> You recently disabled MSI-X on RTL8168g for similar reasons.
>> 
>> Now that we've seen two chips like this, maybe there is some other
>> problem afoot.
>> 
> Thanks for the hint. I saw it already and just contacted Realtek
> whether they are aware of any MSI-X issues with particular chip
> versions. With the chip versions I have access to MSI-X works fine.
> 
> There's also the theoretical option that the issues are caused by
> broken BIOS's. But so far only chip versions have been reported
> which are very similar, at least with regard to version number
> (2x VER_40, 1x VER_39). So they may share some buggy component.
> 
> Let's see whether Realtek can provide some hint.
> If more chip versions are reported having problems with MSI-X,
> then we could switch to a whitelist or disable MSI-X in general.

It could be that we need to reprogram some register(s) on resume,
which normally might not be needed, and that is what is causing the
problem with some chips.

^ permalink raw reply

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: Heiner Kallweit @ 2018-08-16 19:37 UTC (permalink / raw)
  To: David Miller, jian-hong; +Cc: nic_swsd, netdev, linux-kernel, linux
In-Reply-To: <20180816.122131.604270853620318143.davem@davemloft.net>

On 16.08.2018 21:21, David Miller wrote:
> From: <jian-hong@endlessm.com>
> Date: Wed, 15 Aug 2018 14:21:10 +0800
> 
>> Found the ethernet network on ASUS X441UAR doesn't come back on resume
>> from suspend when using MSI-X.  The chip is RTL8106e - version 39.
> 
> Heiner, please take a look at this.
> 
> You recently disabled MSI-X on RTL8168g for similar reasons.
> 
> Now that we've seen two chips like this, maybe there is some other
> problem afoot.
> 
Thanks for the hint. I saw it already and just contacted Realtek
whether they are aware of any MSI-X issues with particular chip
versions. With the chip versions I have access to MSI-X works fine.

There's also the theoretical option that the issues are caused by
broken BIOS's. But so far only chip versions have been reported
which are very similar, at least with regard to version number
(2x VER_40, 1x VER_39). So they may share some buggy component.

Let's see whether Realtek can provide some hint.
If more chip versions are reported having problems with MSI-X,
then we could switch to a whitelist or disable MSI-X in general.

Heiner

> Thanks.
> 

^ permalink raw reply

* Re: [PATCH 1/1] tap: comment fix
From: David Miller @ 2018-08-16 19:30 UTC (permalink / raw)
  To: jianjian.wang1
  Cc: jasowang, girish.moodalbail, mst, willemb, viro, wexu, netdev,
	linux-kernel
In-Reply-To: <CAP4sYWUDb8wtNVvgPO+3a5MWoiBGuafQpvG6KR30RWinf36Msw@mail.gmail.com>

From: Wang Jian <jianjian.wang1@gmail.com>
Date: Thu, 16 Aug 2018 21:01:27 +0800

> The tap_queue and the "tap_dev" are loosely coupled, not "macvlan_dev".
> 
> And I also change one rcu_read_lock's place, seems can reduce rcu
> critical section a little.
> 
> Signed-off-by: Wang Jian <jianjian.wang1@gmail.com>

This patch was corrupted by your email client, for example it turned
TAB characters into sequences of spaces.

Please fix this, email a test patch to yourself, and do not resend the
patch to this mailing list until you can successfully extract and
cleanly apply the test patch you email to yourself.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb
From: David Miller @ 2018-08-16 19:27 UTC (permalink / raw)
  To: ivan.khoronzhuk
  Cc: grygorii.strashko, corbet, netdev, linux-doc, linux-kernel
In-Reply-To: <20180815202953.10137-1-ivan.khoronzhuk@linaro.org>

From: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Date: Wed, 15 Aug 2018 23:29:53 +0300

> If set cbs parameters calculated for 1000Mb, but use on 100Mb port
> w/o h/w offload (for cpsw offload it doesn't matter), it works
> incorrectly. According to the example and testing board, second port
> is 100Mb interface. Correct them on recalculated for 100Mb interface.
> It allows to use the same command for CBS software implementation for
> board in example.
> 
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] isdn: Disable IIOCDBGVAR
From: David Miller @ 2018-08-16 19:26 UTC (permalink / raw)
  To: keescook; +Cc: viro, isdn, linux-kernel, netdev
In-Reply-To: <20180815191405.GA29528@beast>

From: Kees Cook <keescook@chromium.org>
Date: Wed, 15 Aug 2018 12:14:05 -0700

> It was possible to directly leak the kernel address where the isdn_dev
> structure pointer was stored. This is a kernel ASLR bypass for anyone
> with access to the ioctl. The code had been present since the beginning
> of git history, though this shouldn't ever be needed for normal operation,
> therefore remove it.
> 
> Reported-by: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Karsten Keil <isdn@linux-pingi.de>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> netdev doesn't like explict stable markings, so I'll just ask here that it
> get included in -stable please. :)

Applied and queued up for -stable, thanks :)

^ permalink raw reply

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: David Miller @ 2018-08-16 19:21 UTC (permalink / raw)
  To: jian-hong; +Cc: nic_swsd, netdev, hkallweit1, linux-kernel, linux
In-Reply-To: <20180815062110.16155-1-jian-hong@endlessm.com>

From: <jian-hong@endlessm.com>
Date: Wed, 15 Aug 2018 14:21:10 +0800

> Found the ethernet network on ASUS X441UAR doesn't come back on resume
> from suspend when using MSI-X.  The chip is RTL8106e - version 39.

Heiner, please take a look at this.

You recently disabled MSI-X on RTL8168g for similar reasons.

Now that we've seen two chips like this, maybe there is some other
problem afoot.

Thanks.

^ permalink raw reply

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a774a1 SoC
From: David Miller @ 2018-08-16 19:09 UTC (permalink / raw)
  To: fabrizio.castro
  Cc: robh+dt, mark.rutland, sergei.shtylyov, geert+renesas,
	horms+renesas, biju.das, yoshihiro.shimoda.uh, netdev,
	linux-renesas-soc, devicetree, linux-kernel, horms,
	Chris.Paterson2
In-Reply-To: <1534250017-15725-1-git-send-email-fabrizio.castro@bp.renesas.com>

From: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
Date: Tue, 14 Aug 2018 13:33:37 +0100

> Document RZ/G2M (R8A774A1) SoC bindings.
> 
> Signed-off-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
> Reviewed-by: Biju Das <biju.das@bp.renesas.com>

Applied, thanks.

^ permalink raw reply

* RE: TJA1100 100Base-T1 PHY features via ethtool?
From: Woojung.Huh @ 2018-08-16 15:58 UTC (permalink / raw)
  To: mgr, f.fainelli; +Cc: davem, netdev, kernel
In-Reply-To: <20180814071332.smqescmspo73zq6l@pengutronix.de>

Hi Florian & Michael,

> > ethtool is being converted to netlink, and that will be a much more
> > flexible interface to work with since it is basically easily extensible
> > (unlike the current ethtool + ioctl approach).
> 
> Yes, netlink sounds absolutely more useful here.
Is ethtool + netlink expected to be merged in net-next soon?
Couldn't find anything on the web except some experimental information.

Thanks.
Woojung

^ permalink raw reply

* [PATCH 4.18 20/22] Bluetooth: hidp: buffer overflow in hidp_process_report
From: Greg Kroah-Hartman @ 2018-08-16 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Mark Salyzyn, Marcel Holtmann,
	Johan Hedberg, David S. Miller, Kees Cook, Benjamin Tissoires,
	linux-bluetooth, netdev, security, kernel-team
In-Reply-To: <20180816171556.502583508@linuxfoundation.org>

4.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mark Salyzyn <salyzyn@android.com>

commit 7992c18810e568b95c869b227137a2215702a805 upstream.

CVE-2018-9363

The buffer length is unsigned at all layers, but gets cast to int and
checked in hidp_process_report and can lead to a buffer overflow.
Switch len parameter to unsigned int to resolve issue.

This affects 3.18 and newer kernels.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Fixes: a4b1b5877b514b276f0f31efe02388a9c2836728 ("HID: Bluetooth: hidp: make sure input buffers are big enough")
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: linux-bluetooth@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: security@kernel.org
Cc: kernel-team@android.com
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 net/bluetooth/hidp/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -431,8 +431,8 @@ static void hidp_del_timer(struct hidp_s
 		del_timer(&session->timer);
 }

-static void hidp_process_report(struct hidp_session *session,
-				int type, const u8 *data, int len, int intr)
+static void hidp_process_report(struct hidp_session *session, int type,
+				const u8 *data, unsigned int len, int intr)
 {
 	if (len > HID_MAX_BUFFER_SIZE)
 		len = HID_MAX_BUFFER_SIZE;

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox