Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH nf-next] netfilter: xt_CHECKSUM: avoid bad offload warnings on GSO packets
From: Florian Westphal @ 2017-08-25  9:40 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Florian Westphal, Davide Caratti, Pablo Neira Ayuso,
	Jozsef Kadlecsik, netfilter-devel, coreteam, netdev, linux-kernel,
	Michael S. Tsirkin, Markos Chandras
In-Reply-To: <20170825092819.hjgazfcvvrlibkd4@unicorn.suse.cz>

Michal Kubecek <mkubecek@suse.cz> wrote:
> On Thu, Aug 24, 2017 at 03:17:22PM +0200, Florian Westphal wrote:
> > Davide Caratti <dcaratti@redhat.com> wrote:
> > > Small nit: may I suggest you to call skb_csum_hwoffload_help() instead of
> > > skb_checksum_help(), so that we avoid corrupting SCTP packets in case they
> > > hit xt_CHECKSUM target?
> > 
> > Alternatively we could restrict the target to udp only.
> > 
> > AFAIU the only reason this thing exists is to fix up udp checksum
> > for old dhcp clients that use AF_PACKET without evaluating the extra
> > metadata that indicates when a 'bad' checksum is in fact ok because it
> > is supposed to be filled in by hardware later.
> > 
> > This can happen in virtual environemnt when such skb is directly passed
> > to vm.
> 
> Based on what the documentation and the commit message of the commit
> introducing xt_CHECKSUM module say, it seems so. But I must admit I'm
> not sure where is the target is used and how (and why). In particular,
> our issue was most likely result of
> 
>   https://github.com/openstack/openstack-ansible-tests/blob/master/test-prepare-host.yml#L196-L197

Sigh.  Ok, that pretty much leaves your patch as the only viable option,
however, I still think the warning isn't useful.

Can you send a v2 with gso check but without warning?

Thanks!

^ permalink raw reply

* Re: [Patch net-next v2 3/4] net_sched: remove tc class reference counting
From: Jiri Pirko @ 2017-08-25  9:36 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Cong Wang, netdev, jhs
In-Reply-To: <20170825091850.GI15739@breakpoint.cc>

Fri, Aug 25, 2017 at 11:18:50AM CEST, fw@strlen.de wrote:
>Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Aug 25, 2017 at 01:51:29AM CEST, xiyou.wangcong@gmail.com wrote:
>> >For TC classes, their ->get() and ->put() are always paired, and the
>> >reference counting is completely useless, because:
>> >
>> >1) For class modification and dumping paths, we already hold RTNL lock,
>> >   so all of these ->get(),->change(),->put() are atomic.
>> 
>> There is ongoing initiative by Florian to avoid taking RTNL for some
>> rtnetlink calls. I think that for dumping it could be done in tc as well.
>> Don't we need the refcnt then?
>
>Dumping is a problem at this time because several places depend on RTNL
>to ensure we get a consistent state, even "simple" functions like
>rtnl_fill_ifinfo, see e.g.  2907c35ff64708065e5a7fd54e8ded8263eb3074
>(net: hold rtnl again in dump callbacks).
>
>So for these places we already need some other way (e.g. seqlock)
>to ensure we don't put garbage in netlink skb.
>
>At this time I think that it is better if Congs patches go in
>(Unless there are other problems of course) as they simplify
>things quite a bit, and I am not sure that we need refcount.
>
>It might be enough to use rcu and detect when the class we just read
>from might have been in inconsistent state (so we can retry).
>
>Does that make sense to you?

It does. Thanks!

^ permalink raw reply

* Re: [PATCH nf-next] netfilter: xt_CHECKSUM: avoid bad offload warnings on GSO packets
From: Michal Kubecek @ 2017-08-25  9:28 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Davide Caratti, Pablo Neira Ayuso, Jozsef Kadlecsik,
	netfilter-devel, coreteam, netdev, linux-kernel,
	Michael S. Tsirkin, Markos Chandras
In-Reply-To: <20170824131722.GB15739@breakpoint.cc>

On Thu, Aug 24, 2017 at 03:17:22PM +0200, Florian Westphal wrote:
> Davide Caratti <dcaratti@redhat.com> wrote:
> > Small nit: may I suggest you to call skb_csum_hwoffload_help() instead of
> > skb_checksum_help(), so that we avoid corrupting SCTP packets in case they
> > hit xt_CHECKSUM target?
> 
> Alternatively we could restrict the target to udp only.
> 
> AFAIU the only reason this thing exists is to fix up udp checksum
> for old dhcp clients that use AF_PACKET without evaluating the extra
> metadata that indicates when a 'bad' checksum is in fact ok because it
> is supposed to be filled in by hardware later.
> 
> This can happen in virtual environemnt when such skb is directly passed
> to vm.

Based on what the documentation and the commit message of the commit
introducing xt_CHECKSUM module say, it seems so. But I must admit I'm
not sure where is the target is used and how (and why). In particular,
our issue was most likely result of

  https://github.com/openstack/openstack-ansible-tests/blob/master/test-prepare-host.yml#L196-L197

where they explicitely confine it to TCP packets. Unfortunately these
lines come from "Initial testing commit" so it's hard to say what the
intention behind that was.

                                                       Michal Kubecek

^ permalink raw reply

* Re: [patch net-next 11/12] mlxsw: spectrum_dpipe: Add support for IPv4 host table dump
From: Arkadi Sharshevsky @ 2017-08-25  9:26 UTC (permalink / raw)
  To: David Ahern, Jiri Pirko, netdev; +Cc: davem, idosch, mlxsw
In-Reply-To: <e5af97b0-c17b-5360-69f7-8d1428960258@gmail.com>



On 08/24/2017 10:26 PM, David Ahern wrote:
> On 8/23/17 11:40 PM, Jiri Pirko wrote:
>> +static int
>> +mlxsw_sp_dpipe_table_host_entries_get(struct mlxsw_sp *mlxsw_sp,
>> +				      struct devlink_dpipe_entry *entry,
>> +				      bool counters_enabled,
>> +				      struct devlink_dpipe_dump_ctx *dump_ctx,
>> +				      int type)
>> +{
>> +	int rif_neigh_count = 0;
>> +	int rif_neigh_skip = 0;
>> +	int neigh_count = 0;
>> +	int rif_count;
>> +	int i, j;
>> +	int err;
>> +
>> +	rtnl_lock();
> 
> Why does a h/w driver dumping its tables need the rtnl lock?
> 

This table represents the hw IPv4 arp table, and the
driver depends on rtnl to be held.

^ permalink raw reply

* [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Robert Hoo @ 2017-08-25  9:26 UTC (permalink / raw)
  To: davem, tariqt, brouer, kyle.leet; +Cc: netdev, robert.hu, Robert Hoo

(Sorry for yesterday's wrong sending, I finally fixed my MTA and git
send-email settings.)

It's hard to benchmark 40G+ network bandwidth using ordinary
tools like iperf, netperf (see reference 1). 
Pktgen, packet generator from Kernel sapce, shall be a candidate.
I then tried with pktgen multiqueue sample scripts, but still
cannot reach line rate.
I then derived this NUMA awared irq affinity sample script from
multi-queue sample one, successfully benchmarked 40G link. I think this can
also be useful for 100G reference, though I haven't got device to test yet.

This script simply does:
Detect $DEV's NUMA node belonging.
Bind each thread (processor from that NUMA node) with each $DEV queue's
irq affinity, 1:1 mapping.
How many '-t' threads input determines how many queues will be
utilized.

Tested with Intel XL710 NIC with Cisco 3172 switch.

It would be even slightly better if the irqbalance service is turned
off outside.

Referrences:
https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
---
 ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh

diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
new file mode 100755
index 0000000..f0ee25c
--- /dev/null
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Multiqueue: Using pktgen threads for sending on multiple CPUs
+#  * adding devices to kernel threads which are in the same NUMA node
+#  * bound devices queue's irq affinity to the threads, 1:1 mapping
+#  * notice the naming scheme for keeping device names unique
+#  * nameing scheme: dev@thread_number
+#  * flow variation via random UDP source port
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+#
+# Required param: -i dev in $DEV
+source ${basedir}/parameters.sh
+
+get_iface_node()
+{
+	echo `cat /sys/class/net/$1/device/numa_node`
+}
+
+get_iface_irqs()
+{
+	local IFACE=$1
+	local queues="${IFACE}-.*TxRx"
+
+	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
+		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
+	    done)
+	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
+
+	echo $irqs
+}
+
+get_node_cpus()
+{
+	local node=$1
+	local node_cpu_list
+	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
+			/sys/devices/system/node/node$node/cpulist`
+
+	for cpu_range in $node_cpu_range_list
+	do
+		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
+	done
+
+	echo $node_cpu_list
+}
+
+
+# Base Config
+DELAY="0"        # Zero means max speed
+COUNT="20000000"   # Zero means indefinitely
+[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
+
+# Flow variation random source port between min and max
+UDP_MIN=9
+UDP_MAX=109
+
+node=`get_iface_node $DEV`
+irq_array=(`get_iface_irqs $DEV`)
+cpu_array=(`get_node_cpus $node`)
+
+[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
+	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
+
+# (example of setting default params in your script)
+if [ -z "$DEST_IP" ]; then
+    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
+fi
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((i = 0; i < $THREADS; i++)); do
+    # The device name is extended with @name, using thread number to
+    # make then unique, but any name will do.
+    # Set the queue's irq affinity to this $thread (processor)
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
+    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
+
+    # Add remove all other devices and add_device $dev to thread
+    pg_thread $thread "rem_device_all"
+    pg_thread $thread "add_device" $dev
+
+    # select queue and bind the queue and $dev in 1:1 relationship
+    queue_num=$i
+    echo "queue number is $queue_num"
+    pg_set $dev "queue_map_min $queue_num"
+    pg_set $dev "queue_map_max $queue_num"
+
+    # Notice config queue to map to cpu (mirrors smp_processor_id())
+    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
+    pg_set $dev "flag QUEUE_MAP_CPU"
+
+    # Base config of dev
+    pg_set $dev "count $COUNT"
+    pg_set $dev "clone_skb $CLONE_SKB"
+    pg_set $dev "pkt_size $PKT_SIZE"
+    pg_set $dev "delay $DELAY"
+
+    # Flag example disabling timestamping
+    pg_set $dev "flag NO_TIMESTAMP"
+
+    # Destination
+    pg_set $dev "dst_mac $DST_MAC"
+    pg_set $dev "dst$IP6 $DEST_IP"
+
+    # Setup random UDP port src range
+    pg_set $dev "flag UDPSRC_RND"
+    pg_set $dev "udp_src_min $UDP_MIN"
+    pg_set $dev "udp_src_max $UDP_MAX"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((i = 0; i < $THREADS; i++)); do
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo "Device: $dev"
+    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH nf-next] netfilter: xt_CHECKSUM: avoid bad offload warnings on GSO packets
From: Michal Kubecek @ 2017-08-25  9:21 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Florian Westphal, Pablo Neira Ayuso, Jozsef Kadlecsik,
	netfilter-devel, coreteam, netdev, linux-kernel,
	Michael S. Tsirkin, Markos Chandras
In-Reply-To: <1503580122.2958.37.camel@redhat.com>

On Thu, Aug 24, 2017 at 03:08:42PM +0200, Davide Caratti wrote:
> On Thu, 2017-08-24 at 13:07 +0200, Michal Kubecek wrote:
> > On Thu, Aug 24, 2017 at 12:51:18PM +0200, Florian Westphal wrote:
> > > Michal Kubecek <mkubecek@suse.cz> wrote:
> > > > When --checksum_fill action is applied to a GSO packet, checksum_tg() calls
> > > > skb_checksum_help() which is only meant to be applied to non-GSO packets so
> > > > that it issues a warning.
> > > > 
> > > > This can be easily triggered by using e.g.
> > > > 
> > > >   iptables -t mangle -A OUTPUT -j CHECKSUM --checksum-fill
> > > > 
> > > > and sending TCP stream via a device with GSO enabled.
> > > > 
> > > > While this can be considered a misconfiguration, I believe the bad offload
> > > > warning is supposed to catch bugs in drivers and networking stack, not
> > > > misconfigured firewalls. So let's ignore such packets and only issue a one
> > > > time warning with pr_warn_once() rather than a WARN with stack trace and
> > > > tainted kernel.
> > > 
> > > Why issue a warning at all?
> > > What kind of action should be taken upon seeing such warning?
> > 
> > Check and fix the configuration. The reason why I left at least some
> > kind of warning is that the module does something that is unexpected as
> > the checksum is not calculated (this module is often used in
> > virtualization environments where "hardware checksum offload" in fact
> > means the checksum is not computed at all).
> > 
> 
> hello Michal,
> 
> GSO should be capable of computing the checksum on individual segments
> later, so I also think the warning can be removed.
> 
> Small nit: may I suggest you to call skb_csum_hwoffload_help() instead of
> skb_checksum_help(), so that we avoid corrupting SCTP packets in case they
> hit xt_CHECKSUM target?

That sounds like an independent issue so it should be probably handled
by a separate patch.

Michal Kubecek

^ permalink raw reply

* Re: [Patch net-next v2 3/4] net_sched: remove tc class reference counting
From: Florian Westphal @ 2017-08-25  9:18 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Cong Wang, netdev, jhs, fw
In-Reply-To: <20170825090021.GA4829@nanopsycho>

Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Aug 25, 2017 at 01:51:29AM CEST, xiyou.wangcong@gmail.com wrote:
> >For TC classes, their ->get() and ->put() are always paired, and the
> >reference counting is completely useless, because:
> >
> >1) For class modification and dumping paths, we already hold RTNL lock,
> >   so all of these ->get(),->change(),->put() are atomic.
> 
> There is ongoing initiative by Florian to avoid taking RTNL for some
> rtnetlink calls. I think that for dumping it could be done in tc as well.
> Don't we need the refcnt then?

Dumping is a problem at this time because several places depend on RTNL
to ensure we get a consistent state, even "simple" functions like
rtnl_fill_ifinfo, see e.g.  2907c35ff64708065e5a7fd54e8ded8263eb3074
(net: hold rtnl again in dump callbacks).

So for these places we already need some other way (e.g. seqlock)
to ensure we don't put garbage in netlink skb.

At this time I think that it is better if Congs patches go in
(Unless there are other problems of course) as they simplify
things quite a bit, and I am not sure that we need refcount.

It might be enough to use rcu and detect when the class we just read
from might have been in inconsistent state (so we can retry).

Does that make sense to you?

^ permalink raw reply

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Jesper Dangaard Brouer @ 2017-08-25  9:19 UTC (permalink / raw)
  To: Robert Hoo; +Cc: brouer, robert.hu, netdev@vger.kernel.org
In-Reply-To: <1503127531-134546-1-git-send-email-robert.hu@intel.com>


(please don't use BCC on the netdev list, replies might miss the list in cc)

Comments inlined below:

On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@intel.com> wrote:

> From: Robert Ho <robert.hu@intel.com>
> 
> It's hard to benchmark 40G+ network bandwidth using ordinary
> tools like iperf, netperf. I then tried with pktgen multiqueue sample
> scripts, but still cannot reach line rate.

The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
Thus, the performance will suffer.

See the samples that use the burst feature:
  pktgen_sample03_burst_single_flow.sh
  pktgen_sample05_flow_per_thread.sh

With the pktgen "burst" feature, I can easily generate 40G.  Generating
100G is also possible, but often you will hit some HW limits before the
pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
memory bandwidth limit.


> I then derived this NUMA awared irq affinity sample script from
> multi-queue sample one, successfully benchmarked 40G link. I think this can
> also be useful for 100G reference, though I haven't got device to test.

Okay, so your issue was really related to NUMA irq affinity.  I do feel
that IRQ tuning lives outside the realm of the pktgen scripts, but
looking closer at your script, I it doesn't look like you change the
IRQ setting which is good.  

You introduce some helper functions take makes it possible to extract
NUMA information in the shell script code, really cool.  I would like
to see these functions being integrated into the function.sh file.

 
> This script simply does:
> Detect $DEV's NUMA node belonging.
> Bind each thread (processor from that NUMA node) with each $DEV queue's
> irq affinity, 1:1 mapping.
> How many '-t' threads input determines how many queues will be
> utilized.
> 
> Tested with Intel XL710 NIC with Cisco 3172 switch.
> 
> It would be even slightly better if the irqbalance service is turned
> off outside.

Yes, if you don't turn-off (kill) irqbalance it will move around the
IRQs behind your back...

 
> Referrences:
> https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> 
> Signed-off-by: Robert Hoo <robert.hu@intel.com>
> ---
>  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> 
> diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> new file mode 100755
> index 0000000..f0ee25c
> --- /dev/null
> +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> @@ -0,0 +1,132 @@
> +#!/bin/bash
> +#
> +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> +#  * adding devices to kernel threads which are in the same NUMA node
> +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> +#  * notice the naming scheme for keeping device names unique
> +#  * nameing scheme: dev@thread_number
> +#  * flow variation via random UDP source port
> +#
> +basedir=`dirname $0`
> +source ${basedir}/functions.sh
> +root_check_run_with_sudo "$@"
> +#
> +# Required param: -i dev in $DEV
> +source ${basedir}/parameters.sh
> +
> +get_iface_node()
> +{
> +	echo `cat /sys/class/net/$1/device/numa_node`

Here you could use the following shell trick to avoid using "cat":

 echo $(</sys/class/net/$1/device/numa_node)

It looks like you don't handle the case of -1, which indicate non-NUMA
system.  You need to use something like::

get_iface_node()
{
    local node=$(</sys/class/net/$1/device/numa_node)
    if [[ $node == -1 ]]; then
	echo 0
    else
	echo $node
    fi
}


> +}
> +
> +get_iface_irqs()
> +{
> +	local IFACE=$1
> +	local queues="${IFACE}-.*TxRx"
> +
> +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> +	    done)

Nice that you handle all these different methods.  I personally look
in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

Maybe I should switch to use:
   /sys/class/net/$IFACE/device/msi_irqs/*
 

> +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"

In the error case you should let the script die.  There is a helper
function for this called "err" (where first arg is the exitcode, which
is useful to detect the reason your script failed).


> +	echo $irqs
> +}

> +get_node_cpus()
> +{
> +	local node=$1
> +	local node_cpu_list
> +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> +			/sys/devices/system/node/node$node/cpulist`
> +
> +	for cpu_range in $node_cpu_range_list
> +	do
> +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> +	done
> +
> +	echo $node_cpu_list
> +}
> +
> +
> +# Base Config
> +DELAY="0"        # Zero means max speed
> +COUNT="20000000"   # Zero means indefinitely
> +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> +
> +# Flow variation random source port between min and max
> +UDP_MIN=9
> +UDP_MAX=109
> +
> +node=`get_iface_node $DEV`
> +irq_array=(`get_iface_irqs $DEV`)
> +cpu_array=(`get_node_cpus $node`)

Nice trick to generate an array.

> +
> +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> +
> +# (example of setting default params in your script)
> +if [ -z "$DEST_IP" ]; then
> +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> +fi
> +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> +
> +# General cleanup everything since last run
> +pg_ctrl "reset"
> +
> +# Threads are specified with parameter -t value in $THREADS
> +for ((i = 0; i < $THREADS; i++)); do
> +    # The device name is extended with @name, using thread number to
> +    # make then unique, but any name will do.
> +    # Set the queue's irq affinity to this $thread (processor)
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@${thread}
> +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> +
> +    # Add remove all other devices and add_device $dev to thread
> +    pg_thread $thread "rem_device_all"
> +    pg_thread $thread "add_device" $dev
> +
> +    # select queue and bind the queue and $dev in 1:1 relationship
> +    queue_num=$i
> +    echo "queue number is $queue_num"
> +    pg_set $dev "queue_map_min $queue_num"
> +    pg_set $dev "queue_map_max $queue_num"
> +
> +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> +    pg_set $dev "flag QUEUE_MAP_CPU"
> +
> +    # Base config of dev
> +    pg_set $dev "count $COUNT"
> +    pg_set $dev "clone_skb $CLONE_SKB"
> +    pg_set $dev "pkt_size $PKT_SIZE"
> +    pg_set $dev "delay $DELAY"
> +
> +    # Flag example disabling timestamping
> +    pg_set $dev "flag NO_TIMESTAMP"
> +
> +    # Destination
> +    pg_set $dev "dst_mac $DST_MAC"
> +    pg_set $dev "dst$IP6 $DEST_IP"
> +
> +    # Setup random UDP port src range
> +    pg_set $dev "flag UDPSRC_RND"
> +    pg_set $dev "udp_src_min $UDP_MIN"
> +    pg_set $dev "udp_src_max $UDP_MAX"
> +done
> +
> +# start_run
> +echo "Running... ctrl^C to stop" >&2
> +pg_ctrl "start"
> +echo "Done" >&2
> +
> +# Print results
> +for ((i = 0; i < $THREADS; i++)); do
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@${thread}
> +    echo "Device: $dev"
> +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> +done



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [Patch net-next v2 4/4] net_sched: kill u32_node pointer in Qdisc
From: Jiri Pirko @ 2017-08-25  9:06 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-5-xiyou.wangcong@gmail.com>

Fri, Aug 25, 2017 at 01:51:30AM CEST, xiyou.wangcong@gmail.com wrote:
>It is ugly to hide a u32-filter-specific pointer inside Qdisc,
>this breaks the TC layers:
>
>1. Qdisc is a generic representation, should not have any specific
>   data of any type
>
>2. Qdisc layer is above filter layer, should only save filters in
>   the list of struct tcf_proto.
>
>This pointer is used as the head of the chain of u32 hash tables,
>that is struct tc_u_hnode, because u32 filter is very special,
>it allows to create multiple hash tables within one qdisc and
>across multiple u32 filters.
>
>Instead of using this ugly pointer, we can just save it in a global
>hash table key'ed by (dev ifindex, qdisc handle), therefore we can
>still treat it as a per qdisc basis data structure conceptually.
>
>Of course, because of network namespaces, this key is not unique
>at all, but it is fine as we already have a pointer to Qdisc in
>struct tc_u_common, we can just compare the pointers when collision.
>
>And this only affects slow paths, has no impact to fast path,
>thanks to the pointer ->tp_c.
>
>Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>Cc: Jiri Pirko <jiri@resnulli.us>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: Stefano Brivio @ 2017-08-25  9:02 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Wei Wang, Eric Dumazet, Martin KaFai Lau, netdev
In-Reply-To: <20170825075217.GW31224@secunet.com>

On Fri, 25 Aug 2017 09:52:17 +0200
Steffen Klassert <steffen.klassert@secunet.com> wrote:

> On Fri, Aug 25, 2017 at 09:05:42AM +0200, Steffen Klassert wrote:
> > rt_cookie might be used uninitialized, fix this by
> > initializing it.
> > 
> > Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > ---
> >  net/ipv6/route.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index a9d3564..48c8c92 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -1289,7 +1289,7 @@ static void rt6_dst_from_metrics_check(struct rt6_info *rt)
> >  
> >  static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
> >  {
> > -	u32 rt_cookie;
> > +	u32 rt_cookie = 0;
> >  
> >  	if (!rt6_get_cookie_safe(rt, &rt_cookie) || rt_cookie != cookie)
> >  		return NULL;  
> 
> The compiler warning seems to be a false positive, as
> rt_cookie != cookie is only checked if rt6_get_cookie_safe
> returns true in which case rt_cookie is initialized.
> 
> Please disregard this patch.

...or not? I was thinking of sending a similar patch with
uninitialized_var(rt_cookie), but it seems we have similar cases
where we just initialize to zero instead.

I wonder which approach is considered the most acceptable nowadays. I
would be in favour of uninitialized_var() as it doesn't change the
binary output, but https://lwn.net/Articles/529954/ also contains some
valid criticism. Ideas?

^ permalink raw reply

* Re: [Patch net-next v2 3/4] net_sched: remove tc class reference counting
From: Jiri Pirko @ 2017-08-25  9:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, jhs, fw
In-Reply-To: <20170824235130.28503-4-xiyou.wangcong@gmail.com>

Fri, Aug 25, 2017 at 01:51:29AM CEST, xiyou.wangcong@gmail.com wrote:
>For TC classes, their ->get() and ->put() are always paired, and the
>reference counting is completely useless, because:
>
>1) For class modification and dumping paths, we already hold RTNL lock,
>   so all of these ->get(),->change(),->put() are atomic.

There is ongoing initiative by Florian to avoid taking RTNL for some
rtnetlink calls. I think that for dumping it could be done in tc as well.
Don't we need the refcnt then?


>
>2) For filter bindiing/unbinding, we use other reference counter than
>   this one, and they should have RTNL lock too.
>
>3) For ->qlen_notify(), it is special because it is called on ->enqueue()
>   path, but we already hold qdisc tree lock there, and we hold this
>   tree lock when graft or delete the class too, so it should not be gone
>   or changed until we release the tree lock.
>
>Therefore, this patch removes ->get() and ->put(), but:
>
>1) Adds a new ->find() to find the pointer to a class by classid, no
>   refcnt.
>
>2) Move the original class destroy upon the last refcnt into ->delete(),
>   right after releasing tree lock. This is fine because the class is
>   already removed from hash when holding the lock.
>
>For those who also use ->put() as ->unbind(), just rename them to reflect
>this change.

^ permalink raw reply

* Re: [Patch net-next v2 2/4] net_sched: introduce tclass_del_notify()
From: Jiri Pirko @ 2017-08-25  8:54 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-3-xiyou.wangcong@gmail.com>

Fri, Aug 25, 2017 at 01:51:28AM CEST, xiyou.wangcong@gmail.com wrote:
>Like for TC actions, ->delete() is a special case,
>we have to prepare and fill the notification before delete
>otherwise would get use-after-free after we remove the
>reference count.
>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>---

[...]


>+static int tclass_del_notify(struct net *net,
>+			     const struct Qdisc_class_ops *cops,
>+			     struct sk_buff *oskb, struct nlmsghdr *n,
>+			     struct Qdisc *q, unsigned long cl)
>+{
>+	u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
>+	struct sk_buff *skb;
>+	int err = 0;
>+
>+	if (!cops->delete)
>+		return -EOPNOTSUPP;
>+
>+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
>+	if (!skb)
>+		return -ENOBUFS;
>+
>+	if (tc_fill_tclass(skb, q, cl, portid, n->nlmsg_seq, 0,
>+			   RTM_DELTCLASS) < 0) {
>+		kfree_skb(skb);
>+		return -EINVAL;
>+	}
>+
>+	err = cops->delete(q, cl);
>+	if (err) {
>+		kfree_skb(skb);
>+		return err;
>+	}
>+
>+	return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
>+			      n->nlmsg_flags & NLM_F_ECHO);

There is a lot of code duplication with tclass_notify function. Don't
you rather want to split tclass_notify into tclass_notify_prepare and
tclass_notify_send and use these 2 from both tclass_notify and
tclass_del_notify?

^ permalink raw reply

* Re: [Patch net-next v2 1/4] net_sched: get rid of more forward declarations
From: Jiri Pirko @ 2017-08-25  8:45 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-2-xiyou.wangcong@gmail.com>

Fri, Aug 25, 2017 at 01:51:27AM CEST, xiyou.wangcong@gmail.com wrote:
>This is not needed if we move them up properly.
>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* [PATCH] hinic: skb_pad() frees on error
From: Dan Carpenter @ 2017-08-25  8:24 UTC (permalink / raw)
  To: Aviad Krawczyk; +Cc: netdev, kernel-janitors

The skb_pad() function frees the skb on error, so this code has a double
free.

Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_tx.c b/drivers/net/ethernet/huawei/hinic/hinic_tx.c
index 5bf6a32faa46..abe3e38cd342 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_tx.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_tx.c
@@ -192,7 +192,7 @@ netdev_tx_t hinic_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	if (skb->len < MIN_SKB_LEN) {
 		if (skb_pad(skb, MIN_SKB_LEN - skb->len)) {
 			netdev_err(netdev, "Failed to pad skb\n");
-			goto skb_error;
+			goto update_error_stats;
 		}
 
 		skb->len = MIN_SKB_LEN;
@@ -237,6 +237,7 @@ netdev_tx_t hinic_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 skb_error:
 	dev_kfree_skb_any(skb);
 
+update_error_stats:
 	u64_stats_update_begin(&txq->txq_stats.syncp);
 	txq->txq_stats.tx_dropped++;
 	u64_stats_update_end(&txq->txq_stats.syncp);

^ permalink raw reply related

* Re: [EXT] Re: [PATCH net-next 09/13] net: mvpp2: dynamic reconfiguration of the PHY mode
From: Russell King - ARM Linux @ 2017-08-25  8:19 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: Stefan Chulski, Andrew Lunn, davem@davemloft.net, kishon@ti.com,
	jason@lakedaemon.net, sebastian.hesselbarth@gmail.com,
	gregory.clement@free-electrons.com,
	thomas.petazzoni@free-electrons.com, Nadav Haklai,
	linux-kernel@vger.kernel.org, mw@semihalf.com,
	miquel.raynal@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <20170824171418.GI28443@kwain>

On Thu, Aug 24, 2017 at 07:14:18PM +0200, Antoine Tenart wrote:
> On Thu, Aug 24, 2017 at 05:08:29PM +0000, Stefan Chulski wrote:
> > > > Imagine phylib is using the copper Ethernet PHY, but the MAC is using
> > > > the SFP port. Somebody pulls out the copper cable, phylib says the
> > > > link is down, turns the carrier off and calls the callback. Not good,
> > > > since your SFP cable is still plugged in...  Ethtool is
> > > > returning/setting stuff in the Copper Ethernet PHY, when in fact you
> > > > intend to be setting SFP settings.
> > > 
> > > I see what could be the issue but I do not understand one aspect though:
> > > how could we switch from one PHY to another, as there's only one output
> > > between the SoC (and so a given GoP#) and the board. So if a given PHY can
> > > handle multiple modes I see, but in the other case a muxing somewhere would
> > > be needed? Or did I miss something?
> > 
> > I think PHY name and PHY mode struct that describe here both MAC to
> > PHY and PHY to PHY connection create confusion...  Serdes IP lane
> > doesn't care if connector is SFP, RJ45 or direct attached cable.
> > mvpp22_comphy_init only configures MAC to PHY
> > connection. SFI for 10G(KR in mainline), SGMII for 1G and HS_SGMII for
> > 2.5G.
> 
> So maybe one confusion was to name them PHY_MODE_10GKR and
> PHY_MODE_SGMII. It could be PHY_MODE_10G and PHY_MODE_1G instead.

SGMII mode supports 100M and 10M as well using data repetition, so 1G
makes it look like those speeds are not supported.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [PATCH net-next v7 08/10] bpf: Add a Landlock sandbox example
From: Mickaël Salaün @ 2017-08-25  8:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf
In-Reply-To: <20170824025901.cpppy4nn5xv2ao24@ast-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 3333 bytes --]



On 24/08/2017 04:59, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2017 at 02:09:31AM +0200, Mickaël Salaün wrote:
>> Add a basic sandbox tool to create a process isolated from some part of
>> the system. This sandbox create a read-only environment. It is only
>> allowed to write to a character device such as a TTY:
>>
>>   # :> X
>>   # echo $?
>>   0
>>   # ./samples/bpf/landlock1 /bin/sh -i
>>   Launching a new sandboxed process.
>>   # :> Y
>>   cannot create Y: Operation not permitted
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> 
> ...
> 
>> +SEC("landlock1")
>> +static int landlock_fs_prog1(struct landlock_context *ctx)
>> +{
>> +	char fmt_error_mode[] = "landlock1: error: get_mode:%lld\n";
>> +	char fmt_error_access[] = "landlock1: error: access denied\n";
>> +	long long ret;
>> +
>> +	/*
>> +	 * The argument ctx->arg2 contains bitflags of actions for which the
>> +	 * rule is run.  The flag LANDLOCK_ACTION_FS_WRITE means that a write
>> +	 * is requested by one of the userspace processes restricted by this
>> +	 * rule. The following test allows any actions which does not include a
>> +	 * write.
>> +	 */
>> +	if (!(ctx->arg2 & LANDLOCK_ACTION_FS_WRITE))
>> +		return 0;
>> +
>> +	/*
>> +	 * The argument ctx->arg1 is a file handle for which the process want
>> +	 * to access. The function bpf_handle_fs_get_mode() return the mode of
>> +	 * a file (e.g. S_IFBLK, S_IFDIR, S_IFREG...). If there is an error,
>> +	 * for example if the argument is not a file handle, then an
>> +	 * -errno value is returned. Otherwise the caller get the file mode as
>> +	 *  with stat(2).
>> +	 */
>> +	ret = bpf_handle_fs_get_mode((void *)ctx->arg1);
>> +	if (ret < 0) {
>> +
>> +		/*
>> +		 * The bpf_trace_printk() function enable to write in the
>> +		 * kernel eBPF debug log, accessible through
>> +		 * /sys/kernel/debug/tracing/trace_pipe . To be allowed to call
>> +		 * this function, a Landlock rule must have the
>> +		 * LANDLOCK_SUBTYPE_ABILITY_DEBUG ability, which is only
>> +		 * allowed for CAP_SYS_ADMIN.
>> +		 */
>> +		bpf_trace_printk(fmt_error_mode, sizeof(fmt_error_mode), ret);
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * This check allows the action on the file if it is a directory or a
>> +	 * pipe. Otherwise, a message is printed to the eBPF log.
>> +	 */
>> +	if (S_ISCHR(ret) || S_ISFIFO(ret))
>> +		return 0;
>> +	bpf_trace_printk(fmt_error_access, sizeof(fmt_error_access));
>> +	return 1;
>> +}
>> +
>> +/*
>> + * This subtype enable to set the ABI, which ensure that the eBPF context and
>> + * program behavior will be compatible with this Landlock rule.
>> + */
>> +SEC("subtype")
>> +static const union bpf_prog_subtype _subtype = {
>> +	.landlock_rule = {
>> +		.abi = 1,
>> +		.event = LANDLOCK_SUBTYPE_EVENT_FS,
>> +		.ability = LANDLOCK_SUBTYPE_ABILITY_DEBUG,
>> +	}
>> +};
> 
> from rule writer perspective can you somehow merge subtype definition
> with the program? It seems they go hand in hand.
> Like section name of the program can be:
> SEC("landlock_rule1/event=fs/ability=debug")
> static int landlock_fs_prog1(struct landlock_context *ctx)...
> and the loader can parse this string and prepare appropriate
> data structures for the kernel.

Right, I'll try that.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v7 05/10] landlock: Add LSM hooks related to filesystem
From: Mickaël Salaün @ 2017-08-25  8:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf
In-Reply-To: <20170824025030.sxl2hkpcbzipb47y@ast-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 2700 bytes --]


On 24/08/2017 04:50, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2017 at 02:09:28AM +0200, Mickaël Salaün wrote:
>> Handle 33 filesystem-related LSM hooks for the Landlock filesystem
>> event: LANDLOCK_SUBTYPE_EVENT_FS.
>>
>> A Landlock event wrap LSM hooks for similar kernel object types (e.g.
>> struct file, struct path...). Multiple LSM hooks can trigger the same
>> Landlock event.
>>
>> Landlock handle nine coarse-grained actions: read, write, execute, new,
>> get, remove, ioctl, lock and fcntl. Each of them abstract LSM hook
>> access control in a way that can be extended in the future.
>>
>> The Landlock LSM hook registration is done after other LSM to only run
>> actions from user-space, via eBPF programs, if the access was granted by
>> major (privileged) LSMs.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> 
> ...
> 
>> +/* WRAP_ARG_SB */
>> +#define WRAP_ARG_SB_TYPE	WRAP_TYPE_FS
>> +#define WRAP_ARG_SB_DEC(arg)					\
>> +	EXPAND_C(WRAP_TYPE_FS) wrap_##arg =			\
>> +	{ .type = BPF_HANDLE_FS_TYPE_DENTRY, .dentry = arg->s_root };
>> +#define WRAP_ARG_SB_VAL(arg)	((uintptr_t)&wrap_##arg)
>> +#define WRAP_ARG_SB_OK(arg)	(arg && arg->s_root)
> ...
> 
>> +HOOK_NEW_FS(sb_remount, 2,
>> +	struct super_block *, sb,
>> +	void *, data,
>> +	WRAP_ARG_SB, sb,
>> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
>> +);
> 
> this looks wrong. casting super_block to dentry?

This is called when remounting a block device. The WRAP_ARG_SB take the
sb->s_root as a dentry, it is not a cast. What do you expect from this hook?

> 
>> +/* a directory inode contains only one dentry */
>> +HOOK_NEW_FS(inode_create, 3,
>> +	struct inode *, dir,
>> +	struct dentry *, dentry,
>> +	umode_t, mode,
>> +	WRAP_ARG_INODE, dir,
>> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
>> +);
> 
> more general question: why you're not wrapping all useful
> arguments? Like in the above dentry can be acted upon
> by the landlock rule and it's readily available...

The context used for the FS event must have the exact same types for all
calls. This event is meant to be generic but we can add more specific
ones if needed, like I do with FS_IOCTL.

The idea is to enable people to write simple rules, while being able to
write fine grain rules for special cases (e.g. IOCTL) if needed.

> 
> The limitation of only 2 args looks odd.
> Is it a hard limitation ? how hard to extend?

It's not a hard limit at all. Actually, the FS_FNCTL event should have
three arguments (I'll add them in the next series): FS handle, FCNTL
command and FCNTL argument. I made sure that it's really easy to add
more arguments to the context of an event.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v7 03/10] bpf,landlock: Define an eBPF program type for a Landlock rule
From: Mickaël Salaün @ 2017-08-25  8:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf
In-Reply-To: <20170824022853.qbgiubmslaaeclxf@ast-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 9633 bytes --]



On 24/08/2017 04:28, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2017 at 02:09:26AM +0200, Mickaël Salaün wrote:
>> Add a new type of eBPF program used by Landlock rules.
>>
>> This new BPF program type will be registered with the Landlock LSM
>> initialization.
>>
>> Add an initial Landlock Kconfig.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: James Morris <james.l.morris@oracle.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Serge E. Hallyn <serge@hallyn.com>
>> ---
>>
>> Changes since v6:
>> * add 3 more sub-events: IOCTL, LOCK, FCNTL
>>   https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35571@digikod.net
>> * rename LANDLOCK_VERSION to LANDLOCK_ABI to better reflect its purpose,
>>   and move it from landlock.h to common.h
>> * rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE: an eBPF
>>   program could be used for something else than a rule
>> * simplify struct landlock_context by removing the arch and syscall_nr fields
>> * remove all eBPF map functions call, remove ABILITY_WRITE
>> * refactor bpf_landlock_func_proto() (suggested by Kees Cook)
>> * constify pointers
>> * fix doc inclusion
>>
>> Changes since v5:
>> * rename file hooks.c to init.c
>> * fix spelling
>>
>> Changes since v4:
>> * merge a minimal (not enabled) LSM code and Kconfig in this commit
>>
>> Changes since v3:
>> * split commit
>> * revamp the landlock_context:
>>   * add arch, syscall_nr and syscall_cmd (ioctl, fcntl…) to be able to
>>     cross-check action with the event type
>>   * replace args array with dedicated fields to ease the addition of new
>>     fields
>> ---
>>  include/linux/bpf_types.h      |  3 ++
>>  include/uapi/linux/bpf.h       | 97 +++++++++++++++++++++++++++++++++++++++++
>>  security/Kconfig               |  1 +
>>  security/Makefile              |  2 +
>>  security/landlock/Kconfig      | 18 ++++++++
>>  security/landlock/Makefile     |  3 ++
>>  security/landlock/common.h     | 21 +++++++++
>>  security/landlock/init.c       | 98 ++++++++++++++++++++++++++++++++++++++++++
>>  tools/include/uapi/linux/bpf.h | 97 +++++++++++++++++++++++++++++++++++++++++
>>  9 files changed, 340 insertions(+)
>>  create mode 100644 security/landlock/Kconfig
>>  create mode 100644 security/landlock/Makefile
>>  create mode 100644 security/landlock/common.h
>>  create mode 100644 security/landlock/init.c
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 6f1a567667b8..8bac93970a47 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -18,6 +18,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint_prog_ops)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event_prog_ops)
>>  #endif
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_LANDLOCK_RULE, bpf_landlock_ops)
>> +#endif
>>  
>>  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 8541ab85e432..20da634da941 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -129,6 +129,7 @@ enum bpf_prog_type {
>>  	BPF_PROG_TYPE_LWT_XMIT,
>>  	BPF_PROG_TYPE_SOCK_OPS,
>>  	BPF_PROG_TYPE_SK_SKB,
>> +	BPF_PROG_TYPE_LANDLOCK_RULE,
>>  };
>>  
>>  enum bpf_attach_type {
>> @@ -879,4 +880,100 @@ enum {
>>  #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
>>  #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
>>  
>> +/**
>> + * enum landlock_subtype_event - event occurring when an action is performed on
>> + * a particular kernel object
>> + *
>> + * An event is a policy decision point which exposes the same context type
>> + * (especially the same arg[0-9] field types) for each rule execution.
>> + *
>> + * @LANDLOCK_SUBTYPE_EVENT_UNSPEC: invalid value
>> + * @LANDLOCK_SUBTYPE_EVENT_FS: generic filesystem event
>> + * @LANDLOCK_SUBTYPE_EVENT_FS_IOCTL: custom IOCTL sub-event
>> + * @LANDLOCK_SUBTYPE_EVENT_FS_LOCK: custom LOCK sub-event
>> + * @LANDLOCK_SUBTYPE_EVENT_FS_FCNTL: custom FCNTL sub-event
>> + */
>> +enum landlock_subtype_event {
>> +	LANDLOCK_SUBTYPE_EVENT_UNSPEC,
>> +	LANDLOCK_SUBTYPE_EVENT_FS,
>> +	LANDLOCK_SUBTYPE_EVENT_FS_IOCTL,
>> +	LANDLOCK_SUBTYPE_EVENT_FS_LOCK,
>> +	LANDLOCK_SUBTYPE_EVENT_FS_FCNTL,
>> +};
>> +#define _LANDLOCK_SUBTYPE_EVENT_LAST LANDLOCK_SUBTYPE_EVENT_FS_FCNTL
>> +
>> +/**
>> + * DOC: landlock_subtype_ability
>> + *
>> + * eBPF context and functions allowed for a rule
>> + *
>> + * - LANDLOCK_SUBTYPE_ABILITY_DEBUG: allows to do debug actions (e.g. writing
>> + *   logs), which may be dangerous and should only be used for rule testing
>> + */
>> +#define LANDLOCK_SUBTYPE_ABILITY_DEBUG		(1ULL << 0)
>> +#define _LANDLOCK_SUBTYPE_ABILITY_NB		1
>> +#define _LANDLOCK_SUBTYPE_ABILITY_MASK		((1ULL << _LANDLOCK_SUBTYPE_ABILITY_NB) - 1)
> 
> can you move the last two macros out of uapi?
> There is really no need for the implementation details to
> leak into uapi.

Right, I'll move them.

> 
>> +
>> +/*
>> + * Future options for a Landlock rule (e.g. run even if a previous rule denied
>> + * an action).
>> + */
>> +#define _LANDLOCK_SUBTYPE_OPTION_NB		0
>> +#define _LANDLOCK_SUBTYPE_OPTION_MASK		((1ULL << _LANDLOCK_SUBTYPE_OPTION_NB) - 1)
> 
> same here
> 
>> +
>> +/*
>> + * Status visible in the @status field of a context (e.g. already called in
>> + * this syscall session, with same args...).
>> + *
>> + * The @status field exposed to a rule shall depend on the rule version.
>> + */
>> +#define _LANDLOCK_SUBTYPE_STATUS_NB		0
>> +#define _LANDLOCK_SUBTYPE_STATUS_MASK		((1ULL << _LANDLOCK_SUBTYPE_STATUS_NB) - 1)
> 
> and here
> 
>> +
>> +/**
>> + * DOC: landlock_action_fs
>> + *
>> + * - %LANDLOCK_ACTION_FS_EXEC: execute a file or walk through a directory
>> + * - %LANDLOCK_ACTION_FS_WRITE: modify a file or a directory view (which
>> + *   include mount actions)
>> + * - %LANDLOCK_ACTION_FS_READ: read a file or a directory
>> + * - %LANDLOCK_ACTION_FS_NEW: create a file or a directory
>> + * - %LANDLOCK_ACTION_FS_GET: open or receive a file
>> + * - %LANDLOCK_ACTION_FS_REMOVE: unlink a file or remove a directory
>> + *
>> + * Each of the following actions are specific to syscall multiplexers. Each of
>> + * them trigger a dedicated Landlock event where their command can be read.
>> + *
>> + * - %LANDLOCK_ACTION_FS_IOCTL: ioctl command
>> + * - %LANDLOCK_ACTION_FS_LOCK: flock or fcntl lock command
>> + * - %LANDLOCK_ACTION_FS_FCNTL: fcntl command
>> + */
>> +#define LANDLOCK_ACTION_FS_EXEC			(1ULL << 0)
>> +#define LANDLOCK_ACTION_FS_WRITE		(1ULL << 1)
>> +#define LANDLOCK_ACTION_FS_READ			(1ULL << 2)
>> +#define LANDLOCK_ACTION_FS_NEW			(1ULL << 3)
>> +#define LANDLOCK_ACTION_FS_GET			(1ULL << 4)
>> +#define LANDLOCK_ACTION_FS_REMOVE		(1ULL << 5)
>> +#define LANDLOCK_ACTION_FS_IOCTL		(1ULL << 6)
>> +#define LANDLOCK_ACTION_FS_LOCK			(1ULL << 7)
>> +#define LANDLOCK_ACTION_FS_FCNTL		(1ULL << 8)
>> +#define _LANDLOCK_ACTION_FS_NB			9
>> +#define _LANDLOCK_ACTION_FS_MASK		((1ULL << _LANDLOCK_ACTION_FS_NB) - 1)
> 
> and here
> 
>> +
>> +
>> +/**
>> + * struct landlock_context - context accessible to a Landlock rule
>> + *
>> + * @status: bitfield for future use (LANDLOCK_SUBTYPE_STATUS_*)
>> + * @event: event type (&enum landlock_subtype_event)
>> + * @arg1: event's first optional argument
>> + * @arg2: event's second optional argument
>> + */
>> +struct landlock_context {
>> +	__u64 status;
>> +	__u64 event;
>> +	__u64 arg1;
>> +	__u64 arg2;
>> +};
> 
> looking at all the uapi additions.. probably worth moving them
> into separate .h like we did with bpf_perf_event.h
> How about include/uapi/linux/landlock.h ?
> It can include bpf.h first thing and then the rest of
> landlock related bits.
> so all, but BPF_PROG_TYPE_LANDLOCK_RULE will go there.

Sounds good.


> 
>> +++ b/security/landlock/Kconfig
>> @@ -0,0 +1,18 @@
>> +config SECURITY_LANDLOCK
>> +	bool "Landlock sandbox support"
>> +	depends on SECURITY
>> +	depends on BPF_SYSCALL
>> +	depends on SECCOMP_FILTER
>> +	default y
> 
> all new features need to start with default n

OK

> 
>> +static inline const struct bpf_func_proto *bpf_landlock_func_proto(
>> +		enum bpf_func_id func_id,
>> +		const union bpf_prog_subtype *prog_subtype)
>> +{
>> +	/* generic functions */
>> +	if (prog_subtype->landlock_rule.ability &
>> +			LANDLOCK_SUBTYPE_ABILITY_DEBUG) {
>> +		switch (func_id) {
>> +		case BPF_FUNC_get_current_comm:
>> +			return &bpf_get_current_comm_proto;
>> +		case BPF_FUNC_get_current_pid_tgid:
>> +			return &bpf_get_current_pid_tgid_proto;
>> +		case BPF_FUNC_get_current_uid_gid:
>> +			return &bpf_get_current_uid_gid_proto;
> 
> why current_*() helpers are 'debug' only?

This helpers should not be needed for access control. It sounds like a
really bad idea to rely on a command name, a PID or even the UID,
especially for a kind of hierarchy-based access control like Landlock. I
want to avoid people from using these features except for debug
purposes. This could be changed in the future if there is a legitimate
use case of some of these features, though.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v7 01/10] selftest: Enhance kselftest_harness.h with a step mechanism
From: Mickaël Salaün @ 2017-08-25  7:58 UTC (permalink / raw)
  To: Alexei Starovoitov, Shuah Khan
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Thomas Graf, Will Drewry
In-Reply-To: <20170824023105.gt6lyn2snhczryik@ast-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 1563 bytes --]



On 24/08/2017 04:31, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2017 at 02:09:24AM +0200, Mickaël Salaün wrote:
>> This step mechanism may be useful to return an information about the
>> error without being able to write to TH_LOG_STREAM.
>>
>> Set _metadata->no_print to true to print this counter.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: Will Drewry <wad@chromium.org>
>> Link: https://lkml.kernel.org/r/CAGXu5j+D-FP8Kt9unNOqKrQJP4DYTpmgkJxWykZyrYiVPz3Y3Q@mail.gmail.com
>> ---
>>
>> This patch is intended to the kselftest tree:
>> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
>>
>> Changes since v6:
>> * add the step counter in assert/expect macros and use _metadata to
>>   enable the counter (suggested by Kees Cook)
>> ---
>>  tools/testing/selftests/kselftest_harness.h   | 31 ++++++++++++++++++++++-----
>>  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
>>  2 files changed, 27 insertions(+), 6 deletions(-)
> 
> is there a dependency on this in patches 2+ ?
> if not, I would send this patch to selftests right away.
> 
> 

The Landlock tests [patch 9/10] rely on it for now.

I sent it three weeks ago:
https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net

Anyway, until this patch is merged in the kselftest tree and then
available to net-next, I'll have to keep it here.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net-next v2 5/5] ipv6: sr: implement additional seg6local actions
From: David Lebrun @ 2017-08-25  7:58 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

This patch implements the following seg6local actions.

- SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the
  next-hop looked up in the specified routing table.

- SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to
  the specified network interface.

- SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it,
  possibly to the specified next-hop.

- SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it
  to the next-hop looked up in the specified routing table.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 net/ipv6/seg6_local.c | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 26db4d3e..9c1a885 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -30,6 +30,7 @@
 #ifdef CONFIG_IPV6_SEG6_HMAC
 #include <net/seg6_hmac.h>
 #endif
+#include <linux/etherdevice.h>
 
 struct seg6_local_lwt;
 
@@ -226,6 +227,82 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	return -EINVAL;
 }
 
+static int input_action_end_t(struct sk_buff *skb, struct seg6_local_lwt *slwt)
+{
+	struct ipv6_sr_hdr *srh;
+
+	srh = get_and_validate_srh(skb);
+	if (!srh)
+		goto drop;
+
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
+
+	lookup_nexthop(skb, NULL, slwt->table);
+
+	return dst_input(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
+/* decapsulate and forward inner L2 frame on specified interface */
+static int input_action_end_dx2(struct sk_buff *skb,
+				struct seg6_local_lwt *slwt)
+{
+	struct net *net = dev_net(skb->dev);
+	struct net_device *odev;
+	struct ethhdr *eth;
+
+	if (!decap_and_validate(skb, NEXTHDR_NONE))
+		goto drop;
+
+	if (!pskb_may_pull(skb, ETH_HLEN))
+		goto drop;
+
+	skb_reset_mac_header(skb);
+	eth = (struct ethhdr *)skb->data;
+
+	/* To determine the frame's protocol, we assume it is 802.3. This avoids
+	 * a call to eth_type_trans(), which is not really relevant for our
+	 * use case.
+	 */
+	if (!eth_proto_is_802_3(eth->h_proto))
+		goto drop;
+
+	odev = dev_get_by_index_rcu(net, slwt->oif);
+	if (!odev)
+		goto drop;
+
+	/* As we accept Ethernet frames, make sure the egress device is of
+	 * the correct type.
+	 */
+	if (odev->type != ARPHRD_ETHER)
+		goto drop;
+
+	if (!(odev->flags & IFF_UP) || !netif_carrier_ok(odev))
+		goto drop;
+
+	skb_orphan(skb);
+
+	if (skb_warn_if_lro(skb))
+		goto drop;
+
+	skb_forward_csum(skb);
+
+	if (skb->len - ETH_HLEN > odev->mtu)
+		goto drop;
+
+	skb->dev = odev;
+	skb->protocol = eth->h_proto;
+
+	return dev_queue_xmit(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
 /* decapsulate and forward to specified nexthop */
 static int input_action_end_dx6(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
@@ -260,6 +337,56 @@ static int input_action_end_dx6(struct sk_buff *skb,
 	return -EINVAL;
 }
 
+static int input_action_end_dx4(struct sk_buff *skb,
+				struct seg6_local_lwt *slwt)
+{
+	struct iphdr *iph;
+	__be32 nhaddr;
+	int err;
+
+	if (!decap_and_validate(skb, IPPROTO_IPIP))
+		goto drop;
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		goto drop;
+
+	skb->protocol = htons(ETH_P_IP);
+
+	iph = ip_hdr(skb);
+
+	nhaddr = slwt->nh4.s_addr ?: iph->daddr;
+
+	skb_dst_drop(skb);
+
+	err = ip_route_input(skb, nhaddr, iph->saddr, 0, skb->dev);
+	if (err)
+		goto drop;
+
+	return dst_input(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
+static int input_action_end_dt6(struct sk_buff *skb,
+				struct seg6_local_lwt *slwt)
+{
+	if (!decap_and_validate(skb, IPPROTO_IPV6))
+		goto drop;
+
+	if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
+		goto drop;
+
+	lookup_nexthop(skb, NULL, slwt->table);
+
+	return dst_input(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
 /* push an SRH on top of the current one */
 static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 {
@@ -330,11 +457,31 @@ static struct seg6_action_desc seg6_action_table[] = {
 		.input		= input_action_end_x,
 	},
 	{
+		.action		= SEG6_LOCAL_ACTION_END_T,
+		.attrs		= (1 << SEG6_LOCAL_TABLE),
+		.input		= input_action_end_t,
+	},
+	{
+		.action		= SEG6_LOCAL_ACTION_END_DX2,
+		.attrs		= (1 << SEG6_LOCAL_OIF),
+		.input		= input_action_end_dx2,
+	},
+	{
 		.action		= SEG6_LOCAL_ACTION_END_DX6,
 		.attrs		= (1 << SEG6_LOCAL_NH6),
 		.input		= input_action_end_dx6,
 	},
 	{
+		.action		= SEG6_LOCAL_ACTION_END_DX4,
+		.attrs		= (1 << SEG6_LOCAL_NH4),
+		.input		= input_action_end_dx4,
+	},
+	{
+		.action		= SEG6_LOCAL_ACTION_END_DT6,
+		.attrs		= (1 << SEG6_LOCAL_TABLE),
+		.input		= input_action_end_dt6,
+	},
+	{
 		.action		= SEG6_LOCAL_ACTION_END_B6,
 		.attrs		= (1 << SEG6_LOCAL_SRH),
 		.input		= input_action_end_b6,
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next v2 1/5] ipv6: sr: add support for ip4ip6 encapsulation
From: David Lebrun @ 2017-08-25  7:56 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

This patch enables the SRv6 encapsulation mode to carry an IPv4 payload.
All the infrastructure was already present, I just had to add a parameter
to seg6_do_srh_encap() to specify the inner packet protocol, and perform
some additional checks.

Usage example:
ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 include/net/seg6.h       |  3 ++-
 net/ipv6/seg6_iptunnel.c | 47 +++++++++++++++++++++++++++++++++++++----------
 net/ipv6/seg6_local.c    |  2 +-
 3 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 5379f55..099bad5 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -60,7 +60,8 @@ extern int seg6_local_init(void);
 extern void seg6_local_exit(void);
 
 extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len);
-extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
+extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh,
+			     int proto);
 extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
 
 #endif
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5012330..5bec781 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,7 +91,7 @@ static void set_tun_src(struct net *net, struct net_device *dev,
 }
 
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
-int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh)
+int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
 	struct net *net = dev_net(skb_dst(skb)->dev);
 	struct ipv6hdr *hdr, *inner_hdr;
@@ -116,15 +116,22 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh)
 	 * hlim will be decremented in ip6_forward() afterwards and
 	 * decapsulation will overwrite inner hlim with outer hlim
 	 */
-	ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-		     ip6_flowlabel(inner_hdr));
-	hdr->hop_limit = inner_hdr->hop_limit;
+
+	if (skb->protocol == htons(ETH_P_IPV6)) {
+		ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
+			     ip6_flowlabel(inner_hdr));
+		hdr->hop_limit = inner_hdr->hop_limit;
+	} else {
+		ip6_flow_hdr(hdr, 0, 0);
+		hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
+	}
+
 	hdr->nexthdr = NEXTHDR_ROUTING;
 
 	isrh = (void *)hdr + sizeof(*hdr);
 	memcpy(isrh, osrh, hdrlen);
 
-	isrh->nexthdr = NEXTHDR_IPV6;
+	isrh->nexthdr = proto;
 
 	hdr->daddr = isrh->segments[isrh->first_segment];
 	set_tun_src(net, skb->dev, &hdr->daddr, &hdr->saddr);
@@ -199,7 +206,7 @@ static int seg6_do_srh(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct seg6_iptunnel_encap *tinfo;
-	int err = 0;
+	int proto, err = 0;
 
 	tinfo = seg6_encap_lwtunnel(dst->lwtstate);
 
@@ -210,17 +217,31 @@ static int seg6_do_srh(struct sk_buff *skb)
 
 	switch (tinfo->mode) {
 	case SEG6_IPTUN_MODE_INLINE:
+		if (skb->protocol != htons(ETH_P_IPV6))
+			return -EINVAL;
+
 		err = seg6_do_srh_inline(skb, tinfo->srh);
+		if (err)
+			return err;
+
 		skb_reset_inner_headers(skb);
 		break;
 	case SEG6_IPTUN_MODE_ENCAP:
-		err = seg6_do_srh_encap(skb, tinfo->srh);
+		if (skb->protocol == htons(ETH_P_IPV6))
+			proto = IPPROTO_IPV6;
+		else if (skb->protocol == htons(ETH_P_IP))
+			proto = IPPROTO_IPIP;
+		else
+			return -EINVAL;
+
+		err = seg6_do_srh_encap(skb, tinfo->srh, proto);
+		if (err)
+			return err;
+
+		skb->protocol = htons(ETH_P_IPV6);
 		break;
 	}
 
-	if (err)
-		return err;
-
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
@@ -334,6 +355,9 @@ static int seg6_build_state(struct nlattr *nla,
 	struct seg6_lwt *slwt;
 	int err;
 
+	if (family != AF_INET && family != AF_INET6)
+		return -EINVAL;
+
 	err = nla_parse_nested(tb, SEG6_IPTUNNEL_MAX, nla,
 			       seg6_iptunnel_policy, extack);
 
@@ -356,6 +380,9 @@ static int seg6_build_state(struct nlattr *nla,
 
 	switch (tuninfo->mode) {
 	case SEG6_IPTUN_MODE_INLINE:
+		if (family != AF_INET6)
+			return -EINVAL;
+
 		break;
 	case SEG6_IPTUN_MODE_ENCAP:
 		break;
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 147680e..609b94e 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -290,7 +290,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	skb_reset_inner_headers(skb);
 	skb->encapsulation = 1;
 
-	err = seg6_do_srh_encap(skb, slwt->srh);
+	err = seg6_do_srh_encap(skb, slwt->srh, IPPROTO_IPV6);
 	if (err)
 		goto drop;
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next v2 3/5] ipv6: sr: enforce IPv6 packets for seg6local lwt
From: David Lebrun @ 2017-08-25  7:56 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

This patch ensures that the seg6local lightweight tunnel is used solely
with IPv6 routes and processes only IPv6 packets.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 net/ipv6/seg6_local.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 609b94e..c626325 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -357,6 +357,11 @@ static int seg6_local_input(struct sk_buff *skb)
 	struct seg6_action_desc *desc;
 	struct seg6_local_lwt *slwt;
 
+	if (skb->protocol != htons(ETH_P_IPV6)) {
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
 	slwt = seg6_local_lwtunnel(orig_dst->lwtstate);
 	desc = slwt->desc;
 
@@ -623,6 +628,9 @@ static int seg6_local_build_state(struct nlattr *nla, unsigned int family,
 	struct seg6_local_lwt *slwt;
 	int err;
 
+	if (family != AF_INET6)
+		return -EINVAL;
+
 	err = nla_parse_nested(tb, SEG6_LOCAL_MAX, nla, seg6_local_policy,
 			       extack);
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next v2 4/5] ipv6: sr: add helper functions for seg6local
From: David Lebrun @ 2017-08-25  7:56 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

This patch adds three helper functions to be used with the seg6local packet
processing actions.

The decap_and_validate() function will be used by the End.D* actions, that
decapsulate an SR-enabled packet.

The advance_nextseg() function applies the fundamental operations to update
an SRH for the next segment.

The lookup_nexthop() function helps select the next-hop for the processed
SR packets. It supports an optional next-hop address to route the packet
specifically through it, and an optional routing table to use.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 net/ipv6/Kconfig      |   1 +
 net/ipv6/seg6_local.c | 189 ++++++++++++++++++++++++++------------------------
 2 files changed, 101 insertions(+), 89 deletions(-)

diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 0d72239..ea71e4b 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -308,6 +308,7 @@ config IPV6_SEG6_LWTUNNEL
 	depends on IPV6
 	select LWTUNNEL
 	select DST_CACHE
+	select IPV6_MULTIPLE_TABLES
 	---help---
 	  Support for encapsulation of packets within an outer IPv6
 	  header and a Segment Routing Header using the lightweight
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index c626325..26db4d3e 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -99,23 +99,105 @@ static struct ipv6_sr_hdr *get_and_validate_srh(struct sk_buff *skb)
 	return srh;
 }
 
+static bool decap_and_validate(struct sk_buff *skb, int proto)
+{
+	struct ipv6_sr_hdr *srh;
+	unsigned int off = 0;
+
+	srh = get_srh(skb);
+	if (srh && srh->segments_left > 0)
+		return false;
+
+#ifdef CONFIG_IPV6_SEG6_HMAC
+	if (srh && !seg6_hmac_validate_skb(skb))
+		return false;
+#endif
+
+	if (ipv6_find_hdr(skb, &off, proto, NULL, NULL) < 0)
+		return false;
+
+	if (!pskb_pull(skb, off))
+		return false;
+
+	skb_postpull_rcsum(skb, skb_network_header(skb), off);
+
+	skb_reset_network_header(skb);
+	skb_reset_transport_header(skb);
+	skb->encapsulation = 0;
+
+	return true;
+}
+
+static void advance_nextseg(struct ipv6_sr_hdr *srh, struct in6_addr *daddr)
+{
+	struct in6_addr *addr;
+
+	srh->segments_left--;
+	addr = srh->segments + srh->segments_left;
+	*daddr = *addr;
+}
+
+static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			   u32 tbl_id)
+{
+	struct net *net = dev_net(skb->dev);
+	struct ipv6hdr *hdr = ipv6_hdr(skb);
+	int flags = RT6_LOOKUP_F_HAS_SADDR;
+	struct dst_entry *dst = NULL;
+	struct rt6_info *rt;
+	struct flowi6 fl6;
+
+	fl6.flowi6_iif = skb->dev->ifindex;
+	fl6.daddr = nhaddr ? *nhaddr : hdr->daddr;
+	fl6.saddr = hdr->saddr;
+	fl6.flowlabel = ip6_flowinfo(hdr);
+	fl6.flowi6_mark = skb->mark;
+	fl6.flowi6_proto = hdr->nexthdr;
+
+	if (nhaddr)
+		fl6.flowi6_flags = FLOWI_FLAG_KNOWN_NH;
+
+	if (!tbl_id) {
+		dst = ip6_route_input_lookup(net, skb->dev, &fl6, flags);
+	} else {
+		struct fib6_table *table;
+
+		table = fib6_get_table(net, tbl_id);
+		if (!table)
+			goto out;
+
+		rt = ip6_pol_route(net, table, 0, &fl6, flags);
+		dst = &rt->dst;
+	}
+
+	if (dst && dst->dev->flags & IFF_LOOPBACK && !dst->error) {
+		dst_release(dst);
+		dst = NULL;
+	}
+
+out:
+	if (!dst) {
+		rt = net->ipv6.ip6_blk_hole_entry;
+		dst = &rt->dst;
+		dst_hold(dst);
+	}
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+}
+
 /* regular endpoint function */
 static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 {
 	struct ipv6_sr_hdr *srh;
-	struct in6_addr *addr;
 
 	srh = get_and_validate_srh(skb);
 	if (!srh)
 		goto drop;
 
-	srh->segments_left--;
-	addr = srh->segments + srh->segments_left;
-
-	ipv6_hdr(skb)->daddr = *addr;
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	skb_dst_drop(skb);
-	ip6_route_input(skb);
+	lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -127,41 +209,15 @@ static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 /* regular endpoint, and forward to specified nexthop */
 static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 {
-	struct net *net = dev_net(skb->dev);
 	struct ipv6_sr_hdr *srh;
-	struct dst_entry *dst;
-	struct in6_addr *addr;
-	struct ipv6hdr *hdr;
-	struct flowi6 fl6;
-	int flags;
 
 	srh = get_and_validate_srh(skb);
 	if (!srh)
 		goto drop;
 
-	srh->segments_left--;
-	addr = srh->segments + srh->segments_left;
-
-	hdr = ipv6_hdr(skb);
-	hdr->daddr = *addr;
-
-	skb_dst_drop(skb);
-
-	fl6.flowi6_iif = skb->dev->ifindex;
-	fl6.daddr = slwt->nh6;
-	fl6.saddr = hdr->saddr;
-	fl6.flowlabel = ip6_flowinfo(hdr);
-	fl6.flowi6_mark = skb->mark;
-	fl6.flowi6_proto = hdr->nexthdr;
-
-	flags = RT6_LOOKUP_F_HAS_SADDR | RT6_LOOKUP_F_IFACE |
-		RT6_LOOKUP_F_REACHABLE;
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	dst = ip6_route_input_lookup(net, skb->dev, &fl6, flags);
-	if (dst->dev->flags & IFF_LOOPBACK)
-		goto drop;
-
-	skb_dst_set(skb, dst);
+	lookup_nexthop(skb, &slwt->nh6, 0);
 
 	return dst_input(skb);
 
@@ -174,42 +230,18 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 static int input_action_end_dx6(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
 {
-	struct net *net = dev_net(skb->dev);
-	struct ipv6hdr *inner_hdr;
-	struct ipv6_sr_hdr *srh;
-	struct dst_entry *dst;
-	unsigned int off = 0;
-	struct flowi6 fl6;
-	bool use_nh;
-	int flags;
+	struct in6_addr *nhaddr = NULL;
 
 	/* this function accepts IPv6 encapsulated packets, with either
 	 * an SRH with SL=0, or no SRH.
 	 */
 
-	srh = get_srh(skb);
-	if (srh && srh->segments_left > 0)
-		goto drop;
-
-#ifdef CONFIG_IPV6_SEG6_HMAC
-	if (srh && !seg6_hmac_validate_skb(skb))
+	if (!decap_and_validate(skb, IPPROTO_IPV6))
 		goto drop;
-#endif
 
-	if (ipv6_find_hdr(skb, &off, IPPROTO_IPV6, NULL, NULL) < 0)
-		goto drop;
-
-	if (!pskb_pull(skb, off))
+	if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
 		goto drop;
 
-	skb_postpull_rcsum(skb, skb_network_header(skb), off);
-
-	skb_reset_network_header(skb);
-	skb_reset_transport_header(skb);
-	skb->encapsulation = 0;
-
-	inner_hdr = ipv6_hdr(skb);
-
 	/* The inner packet is not associated to any local interface,
 	 * so we do not call netif_rx().
 	 *
@@ -217,26 +249,10 @@ static int input_action_end_dx6(struct sk_buff *skb,
 	 * inner packet's DA. Otherwise, use the specified nexthop.
 	 */
 
-	use_nh = !ipv6_addr_any(&slwt->nh6);
+	if (!ipv6_addr_any(&slwt->nh6))
+		nhaddr = &slwt->nh6;
 
-	skb_dst_drop(skb);
-
-	fl6.flowi6_iif = skb->dev->ifindex;
-	fl6.daddr = use_nh ? slwt->nh6 : inner_hdr->daddr;
-	fl6.saddr = inner_hdr->saddr;
-	fl6.flowlabel = ip6_flowinfo(inner_hdr);
-	fl6.flowi6_mark = skb->mark;
-	fl6.flowi6_proto = inner_hdr->nexthdr;
-
-	flags = RT6_LOOKUP_F_HAS_SADDR | RT6_LOOKUP_F_REACHABLE;
-	if (use_nh)
-		flags |= RT6_LOOKUP_F_IFACE;
-
-	dst = ip6_route_input_lookup(net, skb->dev, &fl6, flags);
-	if (dst->dev->flags & IFF_LOOPBACK)
-		goto drop;
-
-	skb_dst_set(skb, dst);
+	lookup_nexthop(skb, nhaddr, 0);
 
 	return dst_input(skb);
 drop:
@@ -261,8 +277,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	skb_dst_drop(skb);
-	ip6_route_input(skb);
+	lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -276,16 +291,13 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 				     struct seg6_local_lwt *slwt)
 {
 	struct ipv6_sr_hdr *srh;
-	struct in6_addr *addr;
 	int err = -EINVAL;
 
 	srh = get_and_validate_srh(skb);
 	if (!srh)
 		goto drop;
 
-	srh->segments_left--;
-	addr = srh->segments + srh->segments_left;
-	ipv6_hdr(skb)->daddr = *addr;
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
 	skb_reset_inner_headers(skb);
 	skb->encapsulation = 1;
@@ -297,8 +309,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	skb_dst_drop(skb);
-	ip6_route_input(skb);
+	lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next v2 0/5] net: updates for IPv6 Segment Routing
From: David Lebrun @ 2017-08-25  7:56 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun

v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect()
    use cases, and L2ENCAP only uses this redirection. Fix incoherence
    between arbitrary MAC header size support and fixed headroom
    computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP
    mode.

This patch series provides several updates for the SRv6 implementation. The
first patch leverages the existing infrastructure to support encapsulation
of IPv4 packets. The second patch implements the T.Encaps.L2 SR function,
enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet.
The last three patches update the seg6local lightweight tunnel, and mainly
implement four new actions: End.T, End.DX2, End.DX4 and End.DT6.

David Lebrun (5):
  ipv6: sr: add support for ip4ip6 encapsulation
  ipv6: sr: add support for encapsulation of L2 frames
  ipv6: sr: enforce IPv6 packets for seg6local lwt
  ipv6: sr: add helper functions for seg6local
  ipv6: sr: implement additional seg6local actions

 include/net/seg6.h                 |   3 +-
 include/uapi/linux/seg6_iptunnel.h |  18 ++-
 net/ipv6/Kconfig                   |   1 +
 net/ipv6/seg6_iptunnel.c           |  72 +++++++--
 net/ipv6/seg6_local.c              | 314 ++++++++++++++++++++++++++++---------
 5 files changed, 317 insertions(+), 91 deletions(-)

-- 
2.10.2

^ permalink raw reply

* [PATCH net-next v2 2/5] ipv6: sr: add support for encapsulation of L2 frames
From: David Lebrun @ 2017-08-25  7:56 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

This patch implements the L2 frame encapsulation mechanism, referred to
as T.Encaps.L2 in the SRv6 specifications [1].

A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only
accepts packets with an existing MAC header (i.e., it will not work for
locally generated packets). The resulting packet looks like IPv6 -> SRH ->
Ethernet -> original L3 payload. The next header field of the SRH is set to
NEXTHDR_NONE.

[1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 include/uapi/linux/seg6_iptunnel.h | 18 ++++++++++++++----
 net/ipv6/seg6_iptunnel.c           | 25 +++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/seg6_iptunnel.h b/include/uapi/linux/seg6_iptunnel.h
index b6e5a0a..b23df9f 100644
--- a/include/uapi/linux/seg6_iptunnel.h
+++ b/include/uapi/linux/seg6_iptunnel.h
@@ -33,16 +33,26 @@ struct seg6_iptunnel_encap {
 enum {
 	SEG6_IPTUN_MODE_INLINE,
 	SEG6_IPTUN_MODE_ENCAP,
+	SEG6_IPTUN_MODE_L2ENCAP,
 };
 
 #ifdef __KERNEL__
 
 static inline size_t seg6_lwt_headroom(struct seg6_iptunnel_encap *tuninfo)
 {
-	int encap = (tuninfo->mode == SEG6_IPTUN_MODE_ENCAP);
-
-	return ((tuninfo->srh->hdrlen + 1) << 3) +
-	       (encap * sizeof(struct ipv6hdr));
+	int head = 0;
+
+	switch (tuninfo->mode) {
+	case SEG6_IPTUN_MODE_INLINE:
+		break;
+	case SEG6_IPTUN_MODE_ENCAP:
+		head = sizeof(struct ipv6hdr);
+		break;
+	case SEG6_IPTUN_MODE_L2ENCAP:
+		return 0;
+	}
+
+	return ((tuninfo->srh->hdrlen + 1) << 3) + head;
 }
 
 #endif
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5bec781..bd6cc68 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -240,6 +240,22 @@ static int seg6_do_srh(struct sk_buff *skb)
 
 		skb->protocol = htons(ETH_P_IPV6);
 		break;
+	case SEG6_IPTUN_MODE_L2ENCAP:
+		if (!skb_mac_header_was_set(skb))
+			return -EINVAL;
+
+		if (pskb_expand_head(skb, skb->mac_len, 0, GFP_ATOMIC) < 0)
+			return -ENOMEM;
+
+		skb_mac_header_rebuild(skb);
+		skb_push(skb, skb->mac_len);
+
+		err = seg6_do_srh_encap(skb, tinfo->srh, NEXTHDR_NONE);
+		if (err)
+			return err;
+
+		skb->protocol = htons(ETH_P_IPV6);
+		break;
 	}
 
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
@@ -386,6 +402,8 @@ static int seg6_build_state(struct nlattr *nla,
 		break;
 	case SEG6_IPTUN_MODE_ENCAP:
 		break;
+	case SEG6_IPTUN_MODE_L2ENCAP:
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -409,8 +427,11 @@ static int seg6_build_state(struct nlattr *nla,
 	memcpy(&slwt->tuninfo, tuninfo, tuninfo_len);
 
 	newts->type = LWTUNNEL_ENCAP_SEG6;
-	newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT |
-			LWTUNNEL_STATE_INPUT_REDIRECT;
+	newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
+
+	if (tuninfo->mode != SEG6_IPTUN_MODE_L2ENCAP)
+		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
+
 	newts->headroom = seg6_lwt_headroom(tuninfo);
 
 	*ts = newts;
-- 
2.10.2

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox