Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH]: iproute action: typo nat fix
From: Stephen Hemminger @ 2013-10-01  4:37 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev@vger.kernel.org, herbert
In-Reply-To: <5248116B.7080305@mojatatu.com>

On Sun, 29 Sep 2013 07:39:23 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 
> attached.
> 
> cheers,
> jamal

Both applied

^ permalink raw reply

* [PATCH net]  tc: export tc_defact.h to userspace
From: Stephen Hemminger @ 2013-10-01  4:30 UTC (permalink / raw)
  To: David Miller, Jamal Hadi Salim; +Cc: netdev

Jamal sent patch to add tc user simple actions to iproute2
but required header was not being exported.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

---
 include/linux/tc_act/tc_defact.h      |   19 -------------------
 include/uapi/linux/tc_act/Kbuild      |    1 +
 include/uapi/linux/tc_act/tc_defact.h |   19 +++++++++++++++++++
 3 files changed, 20 insertions(+), 19 deletions(-)
 delete mode 100644 include/linux/tc_act/tc_defact.h
 create mode 100644 include/uapi/linux/tc_act/tc_defact.h

diff --git a/include/linux/tc_act/tc_defact.h b/include/linux/tc_act/tc_defact.h
deleted file mode 100644
index 6f65d07..0000000
--- a/include/linux/tc_act/tc_defact.h
+++ /dev/null
@@ -1,19 +0,0 @@
-#ifndef __LINUX_TC_DEF_H
-#define __LINUX_TC_DEF_H
-
-#include <linux/pkt_cls.h>
-
-struct tc_defact {
-	tc_gen;
-};
-                                                                                
-enum {
-	TCA_DEF_UNSPEC,
-	TCA_DEF_TM,
-	TCA_DEF_PARMS,
-	TCA_DEF_DATA,
-	__TCA_DEF_MAX
-};
-#define TCA_DEF_MAX (__TCA_DEF_MAX - 1)
-
-#endif
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index 0623ec4..56f1216 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -1,5 +1,6 @@
 # UAPI Header export list
 header-y += tc_csum.h
+header-y += tc_defact.h
 header-y += tc_gact.h
 header-y += tc_ipt.h
 header-y += tc_mirred.h
diff --git a/include/uapi/linux/tc_act/tc_defact.h b/include/uapi/linux/tc_act/tc_defact.h
new file mode 100644
index 0000000..6f65d07
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_defact.h
@@ -0,0 +1,19 @@
+#ifndef __LINUX_TC_DEF_H
+#define __LINUX_TC_DEF_H
+
+#include <linux/pkt_cls.h>
+
+struct tc_defact {
+	tc_gen;
+};
+                                                                                
+enum {
+	TCA_DEF_UNSPEC,
+	TCA_DEF_TM,
+	TCA_DEF_PARMS,
+	TCA_DEF_DATA,
+	__TCA_DEF_MAX
+};
+#define TCA_DEF_MAX (__TCA_DEF_MAX - 1)
+
+#endif
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] ll_temac: Reset dma descriptors on ndo_open
From: David Miller @ 2013-10-01  4:21 UTC (permalink / raw)
  To: ricardo.ribalda; +Cc: joe, jg1.han, gregkh, wfp5p, netdev, linux-kernel
In-Reply-To: <1380281068-13269-1-git-send-email-ricardo.ribalda@gmail.com>

From: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Date: Fri, 27 Sep 2013 13:24:28 +0200

> The dma descriptors are only initialized on the probe function.
> 
> If a packet is on the buffer when temac_stop is called, the dma
> descriptors can be left on a incorrect status where no other package can
> be sent.
> 
> So an interface could be left in an usable state after ifdow/ifup.
> 
> This patch makes sure that the descriptors are in a proper status when
> the device is started.
> 
> Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>

This analysis is not correct.

In the current driver, the descriptors are allocated and initialized
in the open function, not the probe function.

I'm not applying this patch.

^ permalink raw reply

* Re: [PATCH] iproute2: bridge: Close file with bridge monitor file
From: Stephen Hemminger @ 2013-10-01  4:11 UTC (permalink / raw)
  To: Petr Písař; +Cc: netdev
In-Reply-To: <1380095145-6618-1-git-send-email-ppisar@redhat.com>

On Wed, 25 Sep 2013 09:45:45 +0200
Petr Písař <ppisar@redhat.com> wrote:

> The `bridge monitor file FILENAME' reads dumped netlink messages from
> a file. But it forgot to close the file after using it. This patch
> fixes it.
> 
> Signed-off-by: Petr Písař <ppisar@redhat.com>

Applied

^ permalink raw reply

* Re: [PATCH] {iproute2, xfrm}: Use memcpy to suppress gcc phony buffer overflow warning
From: Stephen Hemminger @ 2013-10-01  4:10 UTC (permalink / raw)
  To: Fan Du; +Cc: Sohny Thomas, David Laight, netdev
In-Reply-To: <52478710.702@windriver.com>


> diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
> index 0d98e78..5cc87d3 100644
> --- a/ip/xfrm_state.c
> +++ b/ip/xfrm_state.c
> @@ -159,7 +159,7 @@ static int xfrm_algo_parse(struct xfrm_algo *alg, enum xfrm_attr_type_t type,
>   			if (len > max)
>   				invarg("\"ALGO-KEY\" makes buffer overflow\n", key);
> 
> -			strncpy(buf, key, len);
> +			memcpy(buf, key, len);
>   		}
>   	}
> 

Applied this patch. With some edits to make commit message logical.

^ permalink raw reply

* Re: [PATCH] tcp: TSQ can use a dynamic limit
From: David Miller @ 2013-10-01  3:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiyou.wangcong, wei.liu2, netdev, ycheng, ncardwell
In-Reply-To: <1380277734.30872.25.camel@edumazet-glaptop.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 27 Sep 2013 03:28:54 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> When TCP Small Queues was added, we used a sysctl to limit amount of
> packets queues on Qdisc/device queues for a given TCP flow.
> 
> Problem is this limit is either too big for low rates, or too small
> for high rates.
> 
> Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO 
> auto sizing, it can better control number of packets in Qdisc/device
> queues.
> 
> New limit is two packets or at least 1 to 2 ms worth of packets.
> 
> Low rates flows benefit from this patch by having even smaller
> number of packets in queues, allowing for faster recovery,
> better RTT estimations.
> 
> High rates flows benefit from this patch by allowing more than 2 packets
> in flight as we had reports this was a limiting factor to reach line
> rate. [ In particular if TX completion is delayed because of coalescing
> parameters ]
> 
> Example for a single flow on 10Gbp link controlled by FQ/pacing
> 
> 14 packets in flight instead of 2
> 
> $ tc -s -d qd
> qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
> buckets 1024 quantum 3028 initial_quantum 15140 
>  Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
> requeues 6822476) 
>  rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 
>   2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
>   2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
> 
> Note that sk_pacing_rate is currently set to twice the actual rate, but
> this might be refined in the future when a flow is in congestion
> avoidance.
> 
> Additional change : skb->destructor should be set to tcp_wfree().
> 
> A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [net-next PATCH V2] virtio-net: switch to use XPS to choose txq
From: Rusty Russell @ 2013-10-01  1:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20130930090402.GB20291@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Mon, Sep 30, 2013 at 03:37:17PM +0800, Jason Wang wrote:
>> We used to use a percpu structure vq_index to record the cpu to queue
>> mapping, this is suboptimal since it duplicates the work of XPS and
>> loses all other XPS functionality such as allowing use to configure
>> their own transmission steering strategy.
>> 
>> So this patch switches to use XPS and suggest a default mapping when
>> the number of cpus is equal to the number of queues. With XPS support,
>> there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
>> so they were removed also.
>> 
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Dave, please apply.

Cheers,
Rusty.

^ permalink raw reply

* [PATCH net-next] extended BPF
From: Alexei Starovoitov @ 2013-10-01  1:04 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang,
	David Howells, Cong Wang, Jesse Gross, Thomas Graf,
	Willem de Bruijn, Stephen Hemminger, Pablo Neira Ayuso,
	Pavel Emelyanov, Will Drewry, linux-kernel, x86

Q: What is BPF?
A: Safe dynamically loadable 32-bit program that can access skb->data via
sk_load_byte/half/word calls or seccomp_data. Can be attached to sockets,
to netfilter xtables, seccomp. In case of sockets/xtables input is skb.
In case of seccomp input is struct seccomp_data.

Q: What is extended BPF?
A: Safe dynamically loadable 64-bit program that can call fixed set
of kernel functions and takes generic bpf_context as an input.
BPF program is a glue between kernel functions and bpf_context.
Different kernel subsystems can define their own set of available functions
and alter BPF machinery for specific use case.

Example 1:
when function set is {bpf_load_byte/half/word} and bpf_context=skb
the extended BPF is equivalent to original BPF (w/o negative offset extensions),
since any such extended BPF program will only be able to load data from skb
and interpret it.

Example 2:
when function set is {empty} and bpf_context=seccomp_data,
the extended BPF is equivalent to original seccomp BPF with simpler programs
and can immediately take advantage of extended BPF-JIT.
(original BPF-JIT doesn't work for seccomp)

when function set is {bpf_table_lookup} and bpf_context=seccomp_data,
the extended BPF can do much more interesting analysis of syscalls.

Example 3:
when function set is {bpf_load_byte/half/word} and bpf_context=skb+flow_keys
the extended BPF can be used to implement dynamically loadable flow_dissector
for RPS, bonding, etc with the performance close to native.
The BPF program can be customized for specific user environment and network
traffic without poluting the kernel with all possible ways to dissect.

Example 4:
when function set is {bpf_load_xxx + bpf_table_lookup} and bpf_context=skb
the extended BPF can be used to implement network analytics in tcpdump.
Like counting all tcp flows through the dev or filtering for specific
set of IP addresses.

Example 5:
when function set is {load_xxx + store_xxx + table_lookup + forward} and
bpf_context=skb, the extended BPF can be used to implement packet parsing,
mac/ip/flow table lookups and actions, which is similar concept to OVS
(flow_parse+lookup+action), but programmability is done via BPF program
that connects table_lookup result to action instead of hard-coded OVS rules.

Probably there are many other use cases in iptables/nftable/TC. Please suggest.

Extended Instruction Set was designed with these goals:
- to be able to write programs in restricted C and compile into BPF with GCC
- to be able to JIT to modern 64-bit CPU with minimal performance overhead
  over two steps: C -> BPF -> native code
- to be able to guarantee termination and safety of BPF program in kernel
  with simple algorithm

As much as I like tcpdump writing filters in tcpdump syntax is difficult.
Same filter done in C is easier to understand.
At the same time having GCC-bpf is not a requirement. One can code BPF
in the same way it was done for original BPF: macroses from filter.h

Minimal performance overhead is achieved by having one to one mapping
between BPF insns and native insns, and one to one mapping between BPF
registers and native registers on 64-bit CPUs

Extended BPF allows jump forward and backward for two reasons:
to reduce branch mispredict penalty GCC moves cold basic blocks out of
fall-through path and to reduce code duplication that would be unavoidable
if only jump forward was available.
To guarantee termination simple non-recursive depth-first-search verifies
that there are no back-edges (no loops in the program), program is a DAG
with root at the first insn, all branches end at the last RET insn and
all instructions are reachable.
(Original BPF actually allows unreachable insns, but that's a bug)

Original BPF has two registers (A and X) and hidden frame pointer.
Extended BPF has ten registers and read-only frame pointer.
Since 64-bit CPUs are passing arguments to the functions via registers
the number of args from BPF program to in-kernel function is restricted to 5
and one register is used to accept return value from in-kernel function.
x86_64 passes first 6 arguments in registers.
aarch64/sparcv9/mips64 have 7-8 registers for arguments.
x86_64 has 6 callee saved registers.
aarch64/sparcv9/mips64 have 11 or more callee saved registers.

Therefore extended BPF calling convention is defined as:
R0 - return value from in-kernel function
R1-R5 - arguments from BPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack

so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc
and BPF calling convention maps directly to ABIs used by kernel on 64-bit
architectures.

R0-R5 are scratch registers and BPF program needs spill/fill them if necessary
across calls.
Note that there is only one BPF program == one BPF function and it cannot call
other BPF functions. It can only call predefined in-kernel functions.

All BPF registers are 64-bit without subregs, which makes JITed x86 code
less optimal, but matches sparc/mips architectures.
Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
nicely, but read-modify-write overhead for sparc/mips is not worth the gains.

Original BPF and extended BPF are two operand instructions, which helps
to do one-to-one mapping between BPF insn and x86 insn during JIT.

Extended BPF doesn't have pre-defined endianness not to favor one
architecture vs another. Therefore bswap insn was introduced.
Original BPF doesn't have such insn and does bswap as part of sk_load_word call
which is often unnecessary if we want to compare the value with the constant.
Restricted C code might be written differently depending on endianness
and GCC-bpf will take an endianness flag.

32-bit architectures run 64-bit extended BPF programs via interpreter

Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
values in/out kernel functions, so 32-bit BPF registers would require to define
register-pair ABI, there won't be a direct BPF register to HW register
mapping and JIT would need to do combine/split/move operations for every
register in and out of the function, which is complex, bug prone and slow.
Another reason is counters. To use 64-bit counter BPF program would need to do
a complex math. Again bug prone and not atomic.

Q: How was it tested?
A: Extended BPF was tested on x86_64 and i386 with lockdep,kmemleak,rcu,sleep
debugging and various stress tests over the last year.

Q: What is the performance difference between optimized x86 and optimized BPF?
A: As a performance test skb_flow_dissect() was re-written in BPF,
since it's one the hottest functions in the networking stack and performance
results are the following:
x86_64 skb_flow_dissect() same skb (all cached)          -  42 nsec per call
x86_64 skb_flow_dissect() different skbs (cache misses)  - 141 nsec per call
bpf_jit skb_flow_dissect() same skb (all cached)         -  51 nsec per call
bpf_jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call

C->BPF->x86_64 is obviously slower than C->x86_64 when all data is in cache,
but presence of cache misses hide extra insns and in this particular case
make it faster, since C for BPF though does the same packet parsing has static
branch prediction markings for performance. Similar optimizations cannot be
done to in-kernel skb_flow_dissect, since it has no knowledge of traffic and
has to be generic in all cases. Loadable BPF program can be customized by user.

Q: Original BPF is safe, deterministic and kernel can easily prove that.
   Does extended BPF keep these properties?
A: Yes. The safety of the program is determined in two steps.
First step does depth-first-search to disallow loops and other CFG validation.
Second step starts from the first insn and descends all possible paths.
It simulates execution of every insn and observes the state change of
registers and stack.
At the start of the program the register R1 contains a pointer to bpf_context
and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has
now type PTR_TO_CTX as well and can be used on right hand side of expression.
If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable.
If register was never written to, it's not readable.
After kernel function call, R1-R5 are reset to unreadable and R0 has a return
type of the function. Since R6-R9 are callee saved, their state is preserved
across the call.
load/store instructions are allowed only with registers of valid types, which
are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment
checked.

bpf_context structure is generic. Its contents are defined by specific use case.
For seccomp it can be seccomp_data and through get_context_access callback
BPF checker is customized, so that BPF program can only access certain fields
of bpf_context with specified size and alignment.
For example, the following insn:
  BPF_INSN_LD(BPF_W, R0, R6, 8)
intends to load word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, then get_context_access callback should let the checker know
that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker
will reject the program.
If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds,
which are hard coded to [-480, 0]. In this example offset is 8, so it will fail
verification.
The checker will allow BPF program to read data from stack only after it wrote
into it.
Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved
registers may not be enough for some programs.

Allowed function calls are customized via get_func_proto callback.
For example:
  u64 bpf_load_byte(struct bpf_context *ctx, u32 offset);
function will have the following definition:
  [FUNC_bpf_load_byte] = {RET_INTEGER, PTR_TO_CTX}
and BPF checker will verify that bpf_load_byte is always called with first
argument being a valid pointer to bpf_context. After the call BPF register R0
will be set to readable state, so that BPF program can access it.

One of the useful functions that can be made available to BPF program
is bpf_table_lookup.
Consider a tcpdump filter that needs to filter packets from 10 IP addresses.
One can write 10 'if' statements in a program, but it's much more efficient
to do one table lookup.
Therefore extended BPF program consists of instructions and tables.
>From BPF program the table is identified by constant table_id
and access to a table in C looks like:
elem = bpf_table_lookup(ctx, table_id, key);

BPF checker matches 'table_id' against known tables, verifies that 'key' points
to stack and table->key_size bytes are initialized.
>From there on bpf_table_lookup() is a normal kernel function. It needs to do
a lookup by whatever means and return either valid pointer to the element
or NULL. BPF checker will verify that the program accesses the pointer only
after comparing it to NULL. That's the meaning of PTR_TO_TABLE_CONDITIONAL and
PTR_TO_TABLE register types in bpf_check.c

In the example 3 flow_dissector doesn't need a table_lookup, so it doesn't
implement it and doesn't provide it, so any BPF program that is written
to be loaded as flow_dissector will not be able to do lookups.

If a kernel subsystem wants to use this BPF machinery and decides to implement
bpf_table_lookup, the checker will guarantee that argument 'ctx' is a valid
pointer to bpf_context, 'table_id' is valid table_id and table->key_size bytes
can be read from the pointer 'key'. It's up to implementation to decide how it
wants to do the lookup and what is the key.

Going back to the example BPF insn:
  BPF_INSN_LD(BPF_W, R0, R6, 8)
if R6=PTR_TO_TABLE, then offset and size of access must be within
[0, table->elem_size] which is determined by constant table_id that was passed
into bpf_table_lookup call prior to this insn.

Just like original, extended BPF is limited to 4096 insns, which means that any
program will terminate quickly and will call fixed number of kernel functions.
Earlier implementation of the checker had a precise calculation of worst case
number of insns, but it was removed to simplify the code, since the worst number
is always less then number of insns in a program anyway (because it's a DAG).

For the use case #5 the programs so far were in 200-1000 insn range, but if
someone has a use case for larger programs, it's trivial to increase the limit,
since the checker was tested with artificially large programs all the way
to 64k insns.

Since register/stack state tracking simulates execution of all insns in all
possible branches, it will explode if not bounded. There are two bounds.
verifier_state stack is limited to 1k, therefore BPF program cannot have
more than 1k jump insns.
Total number of insns to be analyzed is limited to 32k, which means that
checker will either prove correctness or reject the program in few
milliseconds on average x86 cpu. Valid programs take microseconds to verify.

GCC backend for BPF is available at https://github.com/iovisor/bpf_gcc
It needs simulator and proper testsuite before sending to GCC mailing list.

Summary:
extended BPF is a set of pseudo instructions that stitch kernel provided
data in the form of bpf_context with kernel provided set of functions in a safe
and deterministic way with minimal performance overhead vs native code.

This patch provides core BPF framework. Subsequent patches build upon it by
customizing BPF checker for specific use case.

Tested on x86_64 and i386

Some of the use cases above were suggested by Eric Dumazet

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/x86/net/Makefile        |    2 +-
 arch/x86/net/bpf2_jit_comp.c |  617 +++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp.c  |   41 +-
 arch/x86/net/bpf_jit_comp.h  |   36 ++
 include/linux/filter.h       |   83 +++-
 include/uapi/linux/filter.h  |  130 +++++-
 net/core/Makefile            |    2 +-
 net/core/bpf_check.c         | 1049 ++++++++++++++++++++++++++++++++++++++++++
 net/core/bpf_run.c           |  422 +++++++++++++++++
 9 files changed, 2345 insertions(+), 37 deletions(-)
 create mode 100644 arch/x86/net/bpf2_jit_comp.c
 create mode 100644 arch/x86/net/bpf_jit_comp.h
 create mode 100644 net/core/bpf_check.c
 create mode 100644 net/core/bpf_run.c

diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index 90568c3..54f57c9 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -1,4 +1,4 @@
 #
 # Arch-specific network modules
 #
-obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o bpf2_jit_comp.o
diff --git a/arch/x86/net/bpf2_jit_comp.c b/arch/x86/net/bpf2_jit_comp.c
new file mode 100644
index 0000000..c7e08e2
--- /dev/null
+++ b/arch/x86/net/bpf2_jit_comp.c
@@ -0,0 +1,617 @@
+/*
+ * Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/moduleloader.h>
+#include "bpf_jit_comp.h"
+
+static inline u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+{
+	if (len == 1)
+		*ptr = bytes;
+	else if (len == 2)
+		*(u16 *)ptr = bytes;
+	else
+		*(u32 *)ptr = bytes;
+	return ptr + len;
+}
+
+#define EMIT(bytes, len) (prog = emit_code(prog, (bytes), (len)))
+
+#define EMIT1(b1)		EMIT(b1, 1)
+#define EMIT2(b1, b2)		EMIT((b1) + ((b2) << 8), 2)
+#define EMIT3(b1, b2, b3)	EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3)
+#define EMIT4(b1, b2, b3, b4)	EMIT((b1) + ((b2) << 8) + ((b3) << 16) + \
+				     ((b4) << 24), 4)
+/* imm32 is sign extended by cpu */
+#define EMIT1_off32(b1, off) \
+	do {EMIT1(b1); EMIT(off, 4); } while (0)
+#define EMIT2_off32(b1, b2, off) \
+	do {EMIT2(b1, b2); EMIT(off, 4); } while (0)
+#define EMIT3_off32(b1, b2, b3, off) \
+	do {EMIT3(b1, b2, b3); EMIT(off, 4); } while (0)
+#define EMIT4_off32(b1, b2, b3, b4, off) \
+	do {EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0)
+
+/* mov A, X */
+#define EMIT_mov(A, X) \
+	EMIT3(add_2mod(0x48, A, X), 0x89, add_2reg(0xC0, A, X))
+
+#define X86_JAE 0x73
+#define X86_JE  0x74
+#define X86_JNE 0x75
+#define X86_JA  0x77
+#define X86_JGE 0x7D
+#define X86_JG  0x7F
+
+static inline bool is_imm8(__s32 value)
+{
+	return value <= 127 && value >= -128;
+}
+
+static inline bool is_simm32(__s64 value)
+{
+	return value == (__s64)(__s32)value;
+}
+
+static int bpf_size_to_x86_bytes(int bpf_size)
+{
+	if (bpf_size == BPF_W)
+		return 4;
+	else if (bpf_size == BPF_H)
+		return 2;
+	else if (bpf_size == BPF_B)
+		return 1;
+	else if (bpf_size == BPF_DW)
+		return 4; /* imm32 */
+	else
+		return 0;
+}
+
+#define AUX_REG 32
+
+/* avoid x86-64 R12 which if used as base address in memory access
+ * always needs an extra byte for index */
+static const int reg2hex[] = {
+	[R0] = 0, /* rax */
+	[R1] = 7, /* rdi */
+	[R2] = 6, /* rsi */
+	[R3] = 2, /* rdx */
+	[R4] = 1, /* rcx */
+	[R5] = 0, /* r8 */
+	[R6] = 3, /* rbx callee saved */
+	[R7] = 5, /* r13 callee saved */
+	[R8] = 6, /* r14 callee saved */
+	[R9] = 7, /* r15 callee saved */
+	[__fp__] = 5, /* rbp readonly */
+	[AUX_REG] = 1, /* r9 temp register */
+};
+
+/* is_ereg() == true if r8 <= reg <= r15,
+ * rax,rcx,...,rbp don't need extra byte of encoding */
+static inline bool is_ereg(u32 reg)
+{
+	if (reg == R5 || (reg >= R7 && reg <= R9) || reg == AUX_REG)
+		return true;
+	else
+		return false;
+}
+
+static inline u8 add_1mod(u8 byte, u32 reg)
+{
+	if (is_ereg(reg))
+		byte |= 1;
+	return byte;
+}
+static inline u8 add_2mod(u8 byte, u32 r1, u32 r2)
+{
+	if (is_ereg(r1))
+		byte |= 1;
+	if (is_ereg(r2))
+		byte |= 4;
+	return byte;
+}
+
+static inline u8 add_1reg(u8 byte, u32 a_reg)
+{
+	return byte + reg2hex[a_reg];
+}
+static inline u8 add_2reg(u8 byte, u32 a_reg, u32 x_reg)
+{
+	return byte + reg2hex[a_reg] + (reg2hex[x_reg] << 3);
+}
+
+static u8 *select_bpf_func(struct bpf_program *prog, int id)
+{
+	if (id < 0 || id >= FUNC_bpf_max_id)
+		return NULL;
+	return prog->cb->jit_select_func(id);
+}
+
+static int do_jit(struct bpf_program *bpf_prog, int *addrs, u8 *image,
+		  int oldproglen)
+{
+	struct bpf_insn *insn = bpf_prog->insns;
+	int insn_cnt = bpf_prog->insn_cnt;
+	u8 temp[64];
+	int i;
+	int proglen = 0;
+	u8 *prog = temp;
+	int stacksize = 512;
+
+	EMIT1(0x55); /* push rbp */
+	EMIT3(0x48, 0x89, 0xE5); /* mov rbp,rsp */
+
+	/* sub rsp, stacksize */
+	EMIT3_off32(0x48, 0x81, 0xEC, stacksize);
+	/* mov qword ptr [rbp-X],rbx */
+	EMIT3_off32(0x48, 0x89, 0x9D, -stacksize);
+	/* mov qword ptr [rbp-X],r13 */
+	EMIT3_off32(0x4C, 0x89, 0xAD, -stacksize + 8);
+	/* mov qword ptr [rbp-X],r14 */
+	EMIT3_off32(0x4C, 0x89, 0xB5, -stacksize + 16);
+	/* mov qword ptr [rbp-X],r15 */
+	EMIT3_off32(0x4C, 0x89, 0xBD, -stacksize + 24);
+
+	for (i = 0; i < insn_cnt; i++, insn++) {
+		const __s32 K = insn->imm;
+		__u32 a_reg = insn->a_reg;
+		__u32 x_reg = insn->x_reg;
+		u8 b1 = 0, b2 = 0, b3 = 0;
+		u8 jmp_cond;
+		__s64 jmp_offset;
+		int ilen;
+		u8 *func;
+
+		switch (insn->code) {
+			/* ALU */
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_X:
+			b1 = 0x48;
+			b3 = 0xC0;
+			switch (BPF_OP(insn->code)) {
+			case BPF_ADD: b2 = 0x01; break;
+			case BPF_SUB: b2 = 0x29; break;
+			case BPF_AND: b2 = 0x21; break;
+			case BPF_OR: b2 = 0x09; break;
+			case BPF_XOR: b2 = 0x31; break;
+			}
+			EMIT3(add_2mod(b1, a_reg, x_reg), b2,
+			      add_2reg(b3, a_reg, x_reg));
+			break;
+
+			/* mov A, X */
+		case BPF_ALU | BPF_MOV | BPF_X:
+			EMIT_mov(a_reg, x_reg);
+			break;
+
+			/* neg A */
+		case BPF_ALU | BPF_NEG | BPF_X:
+			EMIT3(add_1mod(0x48, a_reg), 0xF7,
+			      add_1reg(0xD8, a_reg));
+			break;
+
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_K:
+			b1 = add_1mod(0x48, a_reg);
+
+			switch (BPF_OP(insn->code)) {
+			case BPF_ADD: b3 = 0xC0; break;
+			case BPF_SUB: b3 = 0xE8; break;
+			case BPF_AND: b3 = 0xE0; break;
+			case BPF_OR: b3 = 0xC8; break;
+			}
+
+			if (is_imm8(K))
+				EMIT4(b1, 0x83, add_1reg(b3, a_reg), K);
+			else
+				EMIT3_off32(b1, 0x81, add_1reg(b3, a_reg), K);
+			break;
+
+		case BPF_ALU | BPF_MOV | BPF_K:
+			/* 'mov rax, imm32' sign extends imm32.
+			 * possible optimization: if imm32 is positive,
+			 * use 'mov eax, imm32' (which zero-extends imm32)
+			 * to save 2 bytes */
+			b1 = add_1mod(0x48, a_reg);
+			b2 = 0xC7;
+			b3 = 0xC0;
+			EMIT3_off32(b1, b2, add_1reg(b3, a_reg), K);
+			break;
+
+			/* A %= X
+			 * A /= X */
+		case BPF_ALU | BPF_MOD | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_X:
+			EMIT1(0x50); /* push rax */
+			EMIT1(0x52); /* push rdx */
+
+			/* mov r9, X */
+			EMIT_mov(AUX_REG, x_reg);
+
+			/* mov rax, A */
+			EMIT_mov(R0, a_reg);
+
+			/* xor rdx, rdx */
+			EMIT3(0x48, 0x31, 0xd2);
+
+			/* if X==0, skip divide, make A=0 */
+
+			/* cmp r9, 0 */
+			EMIT4(0x49, 0x83, 0xF9, 0x00);
+
+			/* je .+3 */
+			EMIT2(X86_JE, 3);
+
+			/* div r9 */
+			EMIT3(0x49, 0xF7, 0xF1);
+
+			if (BPF_OP(insn->code) == BPF_MOD) {
+				/* mov r9, rdx */
+				EMIT3(0x49, 0x89, 0xD1);
+			} else {
+				/* mov r9, rax */
+				EMIT3(0x49, 0x89, 0xC1);
+			}
+
+			EMIT1(0x5A); /* pop rdx */
+			EMIT1(0x58); /* pop rax */
+
+			/* mov A, r9 */
+			EMIT_mov(a_reg, AUX_REG);
+			break;
+
+			/* shifts */
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_ARSH | BPF_K:
+			b1 = add_1mod(0x48, a_reg);
+			switch (BPF_OP(insn->code)) {
+			case BPF_LSH: b3 = 0xE0; break;
+			case BPF_RSH: b3 = 0xE8; break;
+			case BPF_ARSH: b3 = 0xF8; break;
+			}
+			EMIT4(b1, 0xC1, add_1reg(b3, a_reg), K);
+			break;
+
+		case BPF_ALU | BPF_BSWAP32 | BPF_X:
+			/* emit 'bswap eax' to swap lower 4-bytes */
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0x0F);
+			else
+				EMIT1(0x0F);
+			EMIT1(add_1reg(0xC8, a_reg));
+			break;
+
+		case BPF_ALU | BPF_BSWAP64 | BPF_X:
+			/* emit 'bswap rax' to swap 8-bytes */
+			EMIT3(add_1mod(0x48, a_reg), 0x0F,
+			      add_1reg(0xC8, a_reg));
+			break;
+
+			/* ST: *(u8*)(a_reg + off) = imm */
+		case BPF_ST | BPF_REL | BPF_B:
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0xC6);
+			else
+				EMIT1(0xC6);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_H:
+			if (is_ereg(a_reg))
+				EMIT3(0x66, 0x41, 0xC7);
+			else
+				EMIT2(0x66, 0xC7);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_W:
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0xC7);
+			else
+				EMIT1(0xC7);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_DW:
+			EMIT2(add_1mod(0x48, a_reg), 0xC7);
+
+st:			if (is_imm8(insn->off))
+				EMIT2(add_1reg(0x40, a_reg), insn->off);
+			else
+				EMIT1_off32(add_1reg(0x80, a_reg), insn->off);
+
+			EMIT(K, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
+			break;
+
+			/* STX: *(u8*)(a_reg + off) = x_reg */
+		case BPF_STX | BPF_REL | BPF_B:
+			/* emit 'mov byte ptr [rax + off], al' */
+			if (is_ereg(a_reg) || is_ereg(x_reg) ||
+			    /* have to add extra byte for x86 SIL, DIL regs */
+			    x_reg == R1 || x_reg == R2)
+				EMIT2(add_2mod(0x40, a_reg, x_reg), 0x88);
+			else
+				EMIT1(0x88);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_H:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT3(0x66, add_2mod(0x40, a_reg, x_reg), 0x89);
+			else
+				EMIT2(0x66, 0x89);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_W:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT2(add_2mod(0x40, a_reg, x_reg), 0x89);
+			else
+				EMIT1(0x89);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_DW:
+			EMIT2(add_2mod(0x48, a_reg, x_reg), 0x89);
+stx:			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, a_reg, x_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, a_reg, x_reg),
+					    insn->off);
+			break;
+
+			/* LDX: a_reg = *(u8*)(x_reg + off) */
+		case BPF_LDX | BPF_REL | BPF_B:
+			/* emit 'movzx rax, byte ptr [rax + off]' */
+			EMIT3(add_2mod(0x48, x_reg, a_reg), 0x0F, 0xB6);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_H:
+			/* emit 'movzx rax, word ptr [rax + off]' */
+			EMIT3(add_2mod(0x48, x_reg, a_reg), 0x0F, 0xB7);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_W:
+			/* emit 'mov eax, dword ptr [rax+0x14]' */
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT2(add_2mod(0x40, x_reg, a_reg), 0x8B);
+			else
+				EMIT1(0x8B);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_DW:
+			/* emit 'mov rax, qword ptr [rax+0x14]' */
+			EMIT2(add_2mod(0x48, x_reg, a_reg), 0x8B);
+ldx:			/* if insn->off == 0 we can save one extra byte, but
+			 * special case of x86 R13 which always needs an offset
+			 * is not worth the pain */
+			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, x_reg, a_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, x_reg, a_reg),
+					    insn->off);
+			break;
+
+			/* STX XADD: lock *(u8*)(a_reg + off) += x_reg */
+		case BPF_STX | BPF_XADD | BPF_B:
+			/* emit 'lock add byte ptr [rax + off], al' */
+			if (is_ereg(a_reg) || is_ereg(x_reg) ||
+			    /* have to add extra byte for x86 SIL, DIL regs */
+			    x_reg == R1 || x_reg == R2)
+				EMIT3(0xF0, add_2mod(0x40, a_reg, x_reg), 0x00);
+			else
+				EMIT2(0xF0, 0x00);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_H:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT4(0x66, 0xF0, add_2mod(0x40, a_reg, x_reg),
+				      0x01);
+			else
+				EMIT3(0x66, 0xF0, 0x01);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_W:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT3(0xF0, add_2mod(0x40, a_reg, x_reg), 0x01);
+			else
+				EMIT2(0xF0, 0x01);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_DW:
+			EMIT3(0xF0, add_2mod(0x48, a_reg, x_reg), 0x01);
+xadd:			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, a_reg, x_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, a_reg, x_reg),
+					    insn->off);
+			break;
+
+			/* call */
+		case BPF_JMP | BPF_CALL:
+			func = select_bpf_func(bpf_prog, K);
+			jmp_offset = func - (image + addrs[i]);
+			if (!func || !is_simm32(jmp_offset)) {
+				pr_err("unsupported bpf func %d addr %p image %p\n",
+				       K, func, image);
+				return -EINVAL;
+			}
+			EMIT1_off32(0xE8, jmp_offset);
+			break;
+
+			/* cond jump */
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JNE | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_X:
+		case BPF_JMP | BPF_JSGT | BPF_X:
+		case BPF_JMP | BPF_JSGE | BPF_X:
+			/* emit 'cmp a_reg, x_reg' insn */
+			b1 = 0x48;
+			b2 = 0x39;
+			b3 = 0xC0;
+			EMIT3(add_2mod(b1, a_reg, x_reg), b2,
+			      add_2reg(b3, a_reg, x_reg));
+			goto emit_jump;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JNE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JSGT | BPF_K:
+		case BPF_JMP | BPF_JSGE | BPF_K:
+			/* emit 'cmp a_reg, imm8/32' */
+			EMIT1(add_1mod(0x48, a_reg));
+
+			if (is_imm8(K))
+				EMIT3(0x83, add_1reg(0xF8, a_reg), K);
+			else
+				EMIT2_off32(0x81, add_1reg(0xF8, a_reg), K);
+
+emit_jump:		/* convert BPF opcode to x86 */
+			switch (BPF_OP(insn->code)) {
+			case BPF_JEQ:
+				jmp_cond = X86_JE;
+				break;
+			case BPF_JNE:
+				jmp_cond = X86_JNE;
+				break;
+			case BPF_JGT:
+				/* GT is unsigned '>', JA in x86 */
+				jmp_cond = X86_JA;
+				break;
+			case BPF_JGE:
+				/* GE is unsigned '>=', JAE in x86 */
+				jmp_cond = X86_JAE;
+				break;
+			case BPF_JSGT:
+				/* signed '>', GT in x86 */
+				jmp_cond = X86_JG;
+				break;
+			case BPF_JSGE:
+				/* signed '>=', GE in x86 */
+				jmp_cond = X86_JGE;
+				break;
+			default: /* to silence gcc warning */
+				return -EFAULT;
+			}
+			jmp_offset = addrs[i + insn->off] - addrs[i];
+			if (is_imm8(jmp_offset)) {
+				EMIT2(jmp_cond, jmp_offset);
+			} else if (is_simm32(jmp_offset)) {
+				EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
+			} else {
+				pr_err("cond_jmp gen bug %llx\n", jmp_offset);
+				return -EFAULT;
+			}
+
+			break;
+
+		case BPF_JMP | BPF_JA | BPF_X:
+			jmp_offset = addrs[i + insn->off] - addrs[i];
+			if (is_imm8(jmp_offset)) {
+				EMIT2(0xEB, jmp_offset);
+			} else if (is_simm32(jmp_offset)) {
+				EMIT1_off32(0xE9, jmp_offset);
+			} else {
+				pr_err("jmp gen bug %llx\n", jmp_offset);
+				return -EFAULT;
+			}
+
+			break;
+
+		case BPF_RET | BPF_K:
+			/* mov rbx, qword ptr [rbp-X] */
+			EMIT3_off32(0x48, 0x8B, 0x9D, -stacksize);
+			/* mov r13, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xAD, -stacksize + 8);
+			/* mov r14, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xB5, -stacksize + 16);
+			/* mov r15, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xBD, -stacksize + 24);
+
+			EMIT1(0xC9); /* leave */
+			EMIT1(0xC3); /* ret */
+			break;
+
+		default:
+			/*pr_debug_bpf_insn(insn, NULL);*/
+			pr_err("bpf_jit: unknown opcode %02x\n", insn->code);
+			return -EINVAL;
+		}
+
+		ilen = prog - temp;
+		if (image) {
+			if (proglen + ilen > oldproglen)
+				return -2;
+			memcpy(image + proglen, temp, ilen);
+		}
+		proglen += ilen;
+		addrs[i] = proglen;
+		prog = temp;
+	}
+	return proglen;
+}
+
+void bpf2_jit_compile(struct bpf_program *prog)
+{
+	struct bpf_binary_header *header = NULL;
+	int proglen, oldproglen = 0;
+	int *addrs;
+	u8 *image = NULL;
+	int pass;
+	int i;
+
+	if (!prog || !prog->cb || !prog->cb->jit_select_func)
+		return;
+
+	addrs = kmalloc(prog->insn_cnt * sizeof(*addrs), GFP_KERNEL);
+	if (!addrs)
+		return;
+
+	for (proglen = 0, i = 0; i < prog->insn_cnt; i++) {
+		proglen += 64;
+		addrs[i] = proglen;
+	}
+	for (pass = 0; pass < 10; pass++) {
+		proglen = do_jit(prog, addrs, image, oldproglen);
+		if (proglen <= 0) {
+			image = NULL;
+			goto out;
+		}
+		if (image) {
+			if (proglen != oldproglen)
+				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
+				       proglen, oldproglen);
+			break;
+		}
+		if (proglen == oldproglen) {
+			header = bpf_alloc_binary(proglen, &image);
+			if (!header)
+				goto out;
+		}
+		oldproglen = proglen;
+	}
+
+	if (image) {
+		bpf_flush_icache(header, image + proglen);
+		set_memory_ro((unsigned long)header, header->pages);
+	}
+out:
+	kfree(addrs);
+	prog->jit_image = (void (*)(struct bpf_context *ctx))image;
+	return;
+}
+EXPORT_SYMBOL(bpf2_jit_compile);
+
+void bpf2_jit_free(struct bpf_program *prog)
+{
+	if (prog->jit_image)
+		bpf_free_binary(prog->jit_image);
+}
+EXPORT_SYMBOL(bpf2_jit_free);
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 79c216a..37ebea8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -13,6 +13,7 @@
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
 #include <linux/random.h>
+#include "bpf_jit_comp.h"
 
 /*
  * Conventions :
@@ -112,16 +113,6 @@ do {								\
 #define SEEN_XREG    2 /* ebx is used */
 #define SEEN_MEM     4 /* use mem[] for temporary storage */
 
-static inline void bpf_flush_icache(void *start, void *end)
-{
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	smp_wmb();
-	flush_icache_range((unsigned long)start, (unsigned long)end);
-	set_fs(old_fs);
-}
-
 #define CHOOSE_LOAD_FUNC(K, func) \
 	((int)K < 0 ? ((int)K >= SKF_LL_OFF ? func##_negative_offset : func) : func##_positive_offset)
 
@@ -145,16 +136,8 @@ static int pkt_type_offset(void)
 	return -1;
 }
 
-struct bpf_binary_header {
-	unsigned int	pages;
-	/* Note : for security reasons, bpf code will follow a randomly
-	 * sized amount of int3 instructions
-	 */
-	u8		image[];
-};
-
-static struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
-						  u8 **image_ptr)
+struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
+					   u8 **image_ptr)
 {
 	unsigned int sz, hole;
 	struct bpf_binary_header *header;
@@ -772,13 +755,17 @@ out:
 	return;
 }
 
-void bpf_jit_free(struct sk_filter *fp)
+void bpf_free_binary(void *bpf_func)
 {
-	if (fp->bpf_func != sk_run_filter) {
-		unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
-		struct bpf_binary_header *header = (void *)addr;
+	unsigned long addr = (unsigned long)bpf_func & PAGE_MASK;
+	struct bpf_binary_header *header = (void *)addr;
 
-		set_memory_rw(addr, header->pages);
-		module_free(NULL, header);
-	}
+	set_memory_rw(addr, header->pages);
+	module_free(NULL, header);
+}
+
+void bpf_jit_free(struct sk_filter *fp)
+{
+	if (fp->bpf_func != sk_run_filter)
+		bpf_free_binary(fp->bpf_func);
 }
diff --git a/arch/x86/net/bpf_jit_comp.h b/arch/x86/net/bpf_jit_comp.h
new file mode 100644
index 0000000..2172261
--- /dev/null
+++ b/arch/x86/net/bpf_jit_comp.h
@@ -0,0 +1,36 @@
+/* bpf_jit_comp.h : BPF filter alloc/free routines
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef __BPF_JIT_COMP_H
+#define __BPF_JIT_COMP_H
+
+#include <linux/uaccess.h>
+#include <asm/cacheflush.h>
+
+struct bpf_binary_header {
+	unsigned int	pages;
+	/* Note : for security reasons, bpf code will follow a randomly
+	 * sized amount of int3 instructions
+	 */
+	u8		image[];
+};
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	smp_wmb();
+	flush_icache_range((unsigned long)start, (unsigned long)end);
+	set_fs(old_fs);
+}
+
+struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
+					   u8 **image_ptr);
+void bpf_free_binary(void *image_ptr);
+
+#endif
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a6ac848..d45ad80 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -48,13 +48,86 @@ extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
 
+/* type of value stored in a BPF register or
+ * passed into function as an argument or
+ * returned from the function */
+enum bpf_reg_type {
+	INVALID_PTR,  /* reg doesn't contain a valid pointer */
+	PTR_TO_CTX,   /* reg points to bpf_context */
+	PTR_TO_TABLE, /* reg points to table element */
+	PTR_TO_TABLE_CONDITIONAL, /* points to table element or NULL */
+	PTR_TO_STACK,     /* reg == frame_pointer */
+	PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
+	RET_INTEGER, /* function returns integer */
+	RET_VOID,    /* function returns void */
+	CONST_ARG    /* function expects integer constant argument */
+};
+
+/* BPF function prototype */
+struct bpf_func_proto {
+	enum bpf_reg_type ret_type;
+	enum bpf_reg_type arg1_type;
+	enum bpf_reg_type arg2_type;
+	enum bpf_reg_type arg3_type;
+	enum bpf_reg_type arg4_type;
+};
+
+/* struct bpf_context access type */
+enum bpf_access_type {
+	BPF_READ = 1,
+	BPF_WRITE = 2
+};
+
+struct bpf_context_access {
+	int size;
+	enum bpf_access_type type;
+};
+
+struct bpf_callbacks {
+	/* execute BPF func_id with given registers */
+	void (*execute_func)(int id, u64 *regs);
+
+	/* return address of func_id suitable to be called from JITed program */
+	void *(*jit_select_func)(int id);
+
+	/* return BPF function prototype for verification */
+	const struct bpf_func_proto* (*get_func_proto)(int id);
+
+	/* return expected bpf_context access size and permissions
+	 * for given byte offset within bpf_context */
+	const struct bpf_context_access *(*get_context_access)(int off);
+};
+
+struct bpf_program {
+	u16   insn_cnt;
+	u16   table_cnt;
+	struct bpf_insn *insns;
+	struct bpf_table *tables;
+	struct bpf_callbacks *cb;
+	void (*jit_image)(struct bpf_context *ctx);
+};
+/* load BPF program from user space, setup callback extensions
+ * and run through verifier */
+int bpf_load(struct bpf_image *image, struct bpf_callbacks *cb,
+	     struct bpf_program **prog);
+/* free BPF program */
+void bpf_free(struct bpf_program *prog);
+/* execture BPF program */
+void bpf_run(struct bpf_program *prog, struct bpf_context *ctx);
+/* verify correctness of BPF program */
+int bpf_check(struct bpf_program *prog);
+/* pr_debug one BPF instructions and registers */
+void pr_debug_bpf_insn(struct bpf_insn *insn, u64 *regs);
+
 #ifdef CONFIG_BPF_JIT
 #include <stdarg.h>
 #include <linux/linkage.h>
 #include <linux/printk.h>
 
-extern void bpf_jit_compile(struct sk_filter *fp);
-extern void bpf_jit_free(struct sk_filter *fp);
+void bpf_jit_compile(struct sk_filter *fp);
+void bpf_jit_free(struct sk_filter *fp);
+void bpf2_jit_compile(struct bpf_program *prog);
+void bpf2_jit_free(struct bpf_program *prog);
 
 static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
 				u32 pass, void *image)
@@ -73,6 +146,12 @@ static inline void bpf_jit_compile(struct sk_filter *fp)
 static inline void bpf_jit_free(struct sk_filter *fp)
 {
 }
+static inline void bpf2_jit_compile(struct bpf_program *prog)
+{
+}
+static inline void bpf2_jit_free(struct bpf_program *prog)
+{
+}
 #define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
 #endif
 
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9cca..be244e6 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,3 +1,4 @@
+/* extended BPF is Copyright (c) 2011-2013, PLUMgrid, http://plumgrid.com */
 /*
  * Linux Socket Filter Data Structures
  */
@@ -19,7 +20,7 @@
  *	Try and keep these values and structures similar to BSD, especially
  *	the BPF code definitions which need to match so you can share filters
  */
- 
+
 struct sock_filter {	/* Filter block */
 	__u16	code;   /* Actual filter code */
 	__u8	jt;	/* Jump true */
@@ -46,11 +47,93 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_RET         0x06
 #define         BPF_MISC        0x07
 
+struct bpf_insn {
+	__u8	code;    /* opcode */
+	__u8    a_reg:4; /* dest register*/
+	__u8    x_reg:4; /* source register */
+	__s16	off;     /* signed offset */
+	__s32	imm;     /* signed immediate constant */
+};
+
+struct bpf_table {
+	__u32   id;
+	__u32   type;
+	__u32   key_size;
+	__u32   elem_size;
+	__u32   max_entries;
+	__u32   param1;         /* meaning is table-dependent */
+};
+
+enum bfp_table_type {
+	BPF_TABLE_HASH = 1,
+	BPF_TABLE_LPM
+};
+
+struct bpf_image {
+	/* version > 4096 to be binary compatible with original bpf */
+	__u16   version;
+	__u16   rsvd;
+	__u16   insn_cnt;
+	__u16   table_cnt;
+	struct bpf_insn __user  *insns;
+	struct bpf_table __user *tables;
+};
+
+/* maximum number of insns and tables in a BPF program */
+#define MAX_BPF_INSNS 4096
+#define MAX_BPF_TABLES 64
+
+/* pointer to bpf_context is the first and only argument to BPF program
+ * its definition is use-case specific */
+struct bpf_context;
+
+/* bpf_add|sub|...: a += x
+ *         bpf_mov: a = x
+ *       bpf_bswap: bswap a */
+#define BPF_INSN_ALU(op, a, x) \
+	(struct bpf_insn){BPF_ALU|BPF_OP(op)|BPF_X, a, x, 0, 0}
+
+/* bpf_add|sub|...: a += imm
+ *         bpf_mov: a = imm */
+#define BPF_INSN_ALU_IMM(op, a, imm) \
+	(struct bpf_insn){BPF_ALU|BPF_OP(op)|BPF_K, a, 0, 0, imm}
+
+/* a = *(uint *) (x + off) */
+#define BPF_INSN_LD(size, a, x, off) \
+	(struct bpf_insn){BPF_LDX|BPF_SIZE(size)|BPF_REL, a, x, off, 0}
+
+/* *(uint *) (a + off) = x */
+#define BPF_INSN_ST(size, a, off, x) \
+	(struct bpf_insn){BPF_STX|BPF_SIZE(size)|BPF_REL, a, x, off, 0}
+
+/* *(uint *) (a + off) = imm */
+#define BPF_INSN_ST_IMM(size, a, off, imm) \
+	(struct bpf_insn){BPF_ST|BPF_SIZE(size)|BPF_REL, a, 0, off, imm}
+
+/* lock *(uint *) (a + off) += x */
+#define BPF_INSN_XADD(size, a, off, x) \
+	(struct bpf_insn){BPF_STX|BPF_SIZE(size)|BPF_XADD, a, x, off, 0}
+
+/* if (a 'op' x) pc += off else fall through */
+#define BPF_INSN_JUMP(op, a, x, off) \
+	(struct bpf_insn){BPF_JMP|BPF_OP(op)|BPF_X, a, x, off, 0}
+
+/* if (a 'op' imm) pc += off else fall through */
+#define BPF_INSN_JUMP_IMM(op, a, imm, off) \
+	(struct bpf_insn){BPF_JMP|BPF_OP(op)|BPF_K, a, 0, off, imm}
+
+#define BPF_INSN_RET() \
+	(struct bpf_insn){BPF_RET|BPF_K, 0, 0, 0, 0}
+
+#define BPF_INSN_CALL(fn_code) \
+	(struct bpf_insn){BPF_JMP|BPF_CALL, 0, 0, 0, fn_code}
+
 /* ld/ldx fields */
 #define BPF_SIZE(code)  ((code) & 0x18)
 #define         BPF_W           0x00
 #define         BPF_H           0x08
 #define         BPF_B           0x10
+#define         BPF_DW          0x18
 #define BPF_MODE(code)  ((code) & 0xe0)
 #define         BPF_IMM         0x00
 #define         BPF_ABS         0x20
@@ -58,6 +141,8 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_MEM         0x60
 #define         BPF_LEN         0x80
 #define         BPF_MSH         0xa0
+#define         BPF_REL         0xc0
+#define         BPF_XADD        0xe0 /* exclusive add */
 
 /* alu/jmp fields */
 #define BPF_OP(code)    ((code) & 0xf0)
@@ -68,20 +153,54 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_OR          0x40
 #define         BPF_AND         0x50
 #define         BPF_LSH         0x60
-#define         BPF_RSH         0x70
+#define         BPF_RSH         0x70 /* logical shift right */
 #define         BPF_NEG         0x80
 #define		BPF_MOD		0x90
 #define		BPF_XOR		0xa0
+#define		BPF_MOV		0xb0 /* mov reg to reg */
+#define		BPF_ARSH	0xc0 /* sign extending arithmetic shift right */
+#define		BPF_BSWAP32	0xd0 /* swap lower 4 bytes of 64-bit register */
+#define		BPF_BSWAP64	0xe0 /* swap all 8 bytes of 64-bit register */
 
 #define         BPF_JA          0x00
-#define         BPF_JEQ         0x10
-#define         BPF_JGT         0x20
-#define         BPF_JGE         0x30
+#define         BPF_JEQ         0x10 /* jump == */
+#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
+#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
 #define         BPF_JSET        0x40
+#define         BPF_JNE         0x50 /* jump != */
+#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
+#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
+#define         BPF_CALL        0x80 /* function call */
 #define BPF_SRC(code)   ((code) & 0x08)
 #define         BPF_K           0x00
 #define         BPF_X           0x08
 
+/* 64-bit registers */
+#define         R0              0
+#define         R1              1
+#define         R2              2
+#define         R3              3
+#define         R4              4
+#define         R5              5
+#define         R6              6
+#define         R7              7
+#define         R8              8
+#define         R9              9
+#define         __fp__          10
+
+/* all types of BPF programs support at least two functions:
+ * bpf_table_lookup() and bpf_table_update()
+ * contents of bpf_context are use-case specific
+ * BPF engine can be extended with additional functions */
+enum {
+	FUNC_bpf_table_lookup = 1,
+	FUNC_bpf_table_update = 2,
+	FUNC_bpf_max_id = 1024
+};
+void *bpf_table_lookup(struct bpf_context *ctx, int table_id, const void *key);
+int bpf_table_update(struct bpf_context *ctx, int table_id, const void *key,
+		     const void *leaf);
+
 /* ret - BPF_K and BPF_X also apply */
 #define BPF_RVAL(code)  ((code) & 0x18)
 #define         BPF_A           0x10
@@ -134,5 +253,4 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
-
 #endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index b33b996..f04e016 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
 obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
-			sock_diag.o dev_ioctl.o
+			sock_diag.o dev_ioctl.o bpf_run.o bpf_check.o
 
 obj-$(CONFIG_XFRM) += flow.o
 obj-y += net-sysfs.o
diff --git a/net/core/bpf_check.c b/net/core/bpf_check.c
new file mode 100644
index 0000000..ed99cd2
--- /dev/null
+++ b/net/core/bpf_check.c
@@ -0,0 +1,1049 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+
+/* bpf_check() is a static code analyzer that walks the BPF program
+ * instruction by instruction and updates register/stack state.
+ * All paths of conditional branches are analyzed until 'ret' insn.
+ *
+ * At the first pass depth-first-search verifies that the BPF program is a DAG.
+ * It rejects the following programs:
+ * - larger than 4K insns or 64 tables
+ * - if loop is present (detected via back-edge)
+ * - unreachable insns exist (shouldn't be a forest. program = one function)
+ * - more than one ret insn
+ * - ret insn is not a last insn
+ * - out of bounds or malformed jumps
+ * The second pass is all possible path descent from the 1st insn.
+ * Conditional branch target insns keep a link list of verifier states.
+ * If the state already visited, this path can be pruned.
+ * If it wasn't a DAG, such state prunning would be incorrect, since it would
+ * skip cycles. Since it's analyzing all pathes through the program,
+ * the length of the analysis is limited to 32k insn, which may be hit even
+ * if insn_cnt < 4K, but there are too many branches that change stack/regs.
+ * Number of 'branches to be analyzed' is limited to 1k
+ *
+ * All registers are 64-bit (even on 32-bit arch)
+ * R0 - return register
+ * R1-R5 argument passing registers
+ * R6-R9 callee saved registers
+ * R10 - frame pointer read-only
+ *
+ * At the start of BPF program the register R1 contains a pointer to bpf_context
+ * and has type PTR_TO_CTX.
+ *
+ * bpf_table_lookup() function returns ether pointer to table value or NULL
+ * which is type PTR_TO_TABLE_CONDITIONAL. Once it passes through !=0 insn
+ * the register holding that pointer in the true branch changes state to
+ * PTR_TO_TABLE and the same register changes state to INVALID_PTR in the false
+ * branch. See check_cond_jmp_op()
+ *
+ * R10 has type PTR_TO_STACK. The sequence 'mov Rx, R10; add Rx, imm' changes
+ * Rx state to PTR_TO_STACK_IMM and immediate constant is saved for further
+ * stack bounds checking
+ *
+ * registers used to pass pointers to function calls are verified against
+ * function prototypes
+ * Ex: before the call to bpf_table_lookup(), R1 must have type PTR_TO_CTX
+ * R2 must contain integer constant and R3 PTR_TO_STACK_IMM
+ * Integer constant in R2 is a table_id. It's checked that 0 <= R2 < table_cnt
+ * and corresponding table_info->key_size fetched to check that
+ * [R3, R3 + table_info->key_size) are within stack limits and all that stack
+ * memory was initiliazed earlier by BPF program.
+ * After bpf_table_lookup() call insn, R0 is set to PTR_TO_TABLE_CONDITIONAL
+ * R1-R5 are cleared and no longer readable (but still writeable).
+ *
+ * load/store alignment is checked
+ * Ex: stx [Rx + 3], (u32)Ry is rejected
+ *
+ * load/store to stack bounds checked and register spill is tracked
+ * Ex: stx [R10 + 0], (u8)Rx is rejected
+ *
+ * load/store to table bounds checked and table_id provides table size
+ * Ex: stx [Rx + 8], (u16)Ry is ok, if Rx is PTR_TO_TABLE and
+ * 8 + sizeof(u16) <= table_info->elem_size
+ *
+ * load/store to bpf_context checked against known fields
+ *
+ * Future improvements:
+ * stack size is hardcoded to 512 bytes maximum per program, relax it
+ */
+#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
+
+/* JITed code allocates 512 bytes and used bottom 4 slots
+ * to save R6-R9
+ */
+#define MAX_BPF_STACK (512 - 4 * 8)
+
+struct reg_state {
+	enum bpf_reg_type ptr;
+	bool read_ok;
+	int imm;
+};
+
+#define MAX_REG 11
+
+enum bpf_stack_slot_type {
+	STACK_INVALID,    /* nothing was stored in this stack slot */
+	STACK_SPILL,      /* 1st byte of register spilled into stack */
+	STACK_SPILL_PART, /* other 7 bytes of register spill */
+	STACK_MISC	  /* BPF program wrote some data into this slot */
+};
+
+struct bpf_stack_slot {
+	enum bpf_stack_slot_type type;
+	enum bpf_reg_type ptr;
+	int imm;
+};
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct verifier_state {
+	struct reg_state regs[MAX_REG];
+	struct bpf_stack_slot stack[MAX_BPF_STACK];
+};
+
+/* linked list of verifier states
+ * used to prune search
+ */
+struct verifier_state_list {
+	struct verifier_state state;
+	struct verifier_state_list *next;
+};
+
+/* verifier_state + insn_idx are pushed to stack
+ * when branch is encountered
+ */
+struct verifier_stack_elem {
+	struct verifier_state st;
+	int insn_idx; /* at insn 'insn_idx' the program state is 'st' */
+	struct verifier_stack_elem *next;
+};
+
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+	struct bpf_table *tables;
+	int table_cnt;
+	struct verifier_stack_elem *head;
+	int stack_size;
+	struct verifier_state cur_state;
+	struct verifier_state_list **branch_landing;
+	const struct bpf_func_proto* (*get_func_proto)(int id);
+	const struct bpf_context_access *(*get_context_access)(int off);
+};
+
+static int pop_stack(struct verifier_env *env)
+{
+	int insn_idx;
+	struct verifier_stack_elem *elem;
+	if (env->head == NULL)
+		return -1;
+	memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
+	insn_idx = env->head->insn_idx;
+	elem = env->head->next;
+	kfree(env->head);
+	env->head = elem;
+	env->stack_size--;
+	return insn_idx;
+}
+
+static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx)
+{
+	struct verifier_stack_elem *elem;
+	elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
+	if (!elem)
+		goto err;
+	memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
+	elem->insn_idx = insn_idx;
+	elem->next = env->head;
+	env->head = elem;
+	env->stack_size++;
+	if (env->stack_size > 1024) {
+		pr_err("BPF program is too complex\n");
+		goto err;
+	}
+	return &elem->st;
+err:
+	/* pop all elements and return */
+	while (pop_stack(env) >= 0);
+	return NULL;
+}
+
+#define CALLER_SAVED_REGS 6
+static const int caller_saved[CALLER_SAVED_REGS] = { R0, R1, R2, R3, R4, R5 };
+
+static void init_reg_state(struct reg_state *regs)
+{
+	struct reg_state *reg;
+	int i;
+	for (i = 0; i < MAX_REG; i++) {
+		regs[i].ptr = INVALID_PTR;
+		regs[i].read_ok = false;
+		regs[i].imm = 0xbadbad;
+	}
+	reg = regs + __fp__;
+	reg->ptr = PTR_TO_STACK;
+	reg->read_ok = true;
+
+	reg = regs + R1;	/* 1st arg to a function */
+	reg->ptr = PTR_TO_CTX;
+	reg->read_ok = true;
+}
+
+static void mark_reg_no_ptr(struct reg_state *regs, int regno)
+{
+	regs[regno].ptr = INVALID_PTR;
+	regs[regno].imm = 0xbadbad;
+	regs[regno].read_ok = true;
+}
+
+static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
+{
+	if (is_src) {
+		if (!regs[regno].read_ok) {
+			pr_err("R%d !read_ok\n", regno);
+			return -EACCES;
+		}
+	} else {
+		if (regno == __fp__)
+			/* frame pointer is read only */
+			return -EACCES;
+		mark_reg_no_ptr(regs, regno);
+	}
+	return 0;
+}
+
+static int bpf_size_to_bytes(int bpf_size)
+{
+	if (bpf_size == BPF_W)
+		return 4;
+	else if (bpf_size == BPF_H)
+		return 2;
+	else if (bpf_size == BPF_B)
+		return 1;
+	else if (bpf_size == BPF_DW)
+		return 8;
+	else
+		return -EACCES;
+}
+
+static int check_stack_write(struct verifier_state *state, int off, int size,
+			     int value_regno)
+{
+	int i;
+	struct bpf_stack_slot *slot;
+	if (value_regno >= 0 &&
+	    (state->regs[value_regno].ptr == PTR_TO_TABLE ||
+	     state->regs[value_regno].ptr == PTR_TO_CTX)) {
+
+		/* register containing pointer is being spilled into stack */
+		if (size != 8) {
+			pr_err("invalid size of register spill\n");
+			return -EACCES;
+		}
+
+		slot = &state->stack[MAX_BPF_STACK + off];
+		slot->type = STACK_SPILL;
+		/* save register state */
+		slot->ptr = state->regs[value_regno].ptr;
+		slot->imm = state->regs[value_regno].imm;
+		for (i = 1; i < 8; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->type = STACK_SPILL_PART;
+		}
+	} else {
+
+		/* regular write of data into stack */
+		for (i = 0; i < size; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->type = STACK_MISC;
+		}
+	}
+	return 0;
+}
+
+static int check_stack_read(struct verifier_state *state, int off, int size,
+			    int value_regno)
+{
+	int i;
+	struct bpf_stack_slot *slot;
+
+	slot = &state->stack[MAX_BPF_STACK + off];
+
+	if (slot->type == STACK_SPILL) {
+		if (size != 8) {
+			pr_err("invalid size of register spill\n");
+			return -EACCES;
+		}
+		for (i = 1; i < 8; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].type !=
+			    STACK_SPILL_PART) {
+				pr_err("corrupted spill memory\n");
+				return -EACCES;
+			}
+		}
+
+		/* restore register state from stack */
+		state->regs[value_regno].ptr = slot->ptr;
+		state->regs[value_regno].imm = slot->imm;
+		state->regs[value_regno].read_ok = true;
+		return 0;
+	} else {
+		for (i = 0; i < size; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].type !=
+			    STACK_MISC) {
+				pr_err("invalid read from stack off %d+%d size %d\n",
+				       off, i, size);
+				return -EACCES;
+			}
+		}
+		/* have read misc data from the stack */
+		mark_reg_no_ptr(state->regs, value_regno);
+		return 0;
+	}
+}
+
+static int get_table_info(struct verifier_env *env, int table_id,
+			  struct bpf_table **table)
+{
+	/* if BPF program contains bpf_table_lookup(ctx, 1024, key)
+	 * the incorrect table_id will be caught here
+	 */
+	if (table_id < 0 || table_id >= env->table_cnt) {
+		pr_err("invalid access to table_id=%d max_tables=%d\n",
+		       table_id, env->table_cnt);
+		return -EACCES;
+	}
+	*table = &env->tables[table_id];
+	return 0;
+}
+
+/* check read/write into table element returned by bpf_table_lookup() */
+static int check_table_access(struct verifier_env *env, int regno, int off,
+			      int size)
+{
+	struct bpf_table *table;
+	int table_id = env->cur_state.regs[regno].imm;
+
+	_(get_table_info(env, table_id, &table));
+
+	if (off < 0 || off + size > table->elem_size) {
+		pr_err("invalid access to table_id=%d leaf_size=%d off=%d size=%d\n",
+		       table_id, table->elem_size, off, size);
+		return -EACCES;
+	}
+	return 0;
+}
+
+/* check access to 'struct bpf_context' fields */
+static int check_ctx_access(struct verifier_env *env, int off, int size,
+			    enum bpf_access_type t)
+{
+	const struct bpf_context_access *access;
+
+	if (off < 0 || off >= 32768/* struct bpf_context shouldn't be huge */)
+		goto error;
+
+	access = env->get_context_access(off);
+	if (!access)
+		goto error;
+
+	if (access->size == size && (access->type & t))
+		return 0;
+error:
+	pr_err("invalid bpf_context access off=%d size=%d\n", off, size);
+	return -EACCES;
+}
+
+static int check_mem_access(struct verifier_env *env, int regno, int off,
+			    int bpf_size, enum bpf_access_type t,
+			    int value_regno)
+{
+	struct verifier_state *state = &env->cur_state;
+	int size;
+	_(size = bpf_size_to_bytes(bpf_size));
+
+	if (off % size != 0) {
+		pr_err("misaligned access off %d size %d\n", off, size);
+		return -EACCES;
+	}
+
+	if (state->regs[regno].ptr == PTR_TO_TABLE) {
+		_(check_table_access(env, regno, off, size));
+		if (t == BPF_READ)
+			mark_reg_no_ptr(state->regs, value_regno);
+	} else if (state->regs[regno].ptr == PTR_TO_CTX) {
+		_(check_ctx_access(env, off, size, t));
+		if (t == BPF_READ)
+			mark_reg_no_ptr(state->regs, value_regno);
+	} else if (state->regs[regno].ptr == PTR_TO_STACK) {
+		if (off >= 0 || off < -MAX_BPF_STACK) {
+			pr_err("invalid stack off=%d size=%d\n", off, size);
+			return -EACCES;
+		}
+		if (t == BPF_WRITE)
+			_(check_stack_write(state, off, size, value_regno));
+		else
+			_(check_stack_read(state, off, size, value_regno));
+	} else {
+		pr_err("invalid mem access %d\n", state->regs[regno].ptr);
+		return -EACCES;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto funcs[] = {
+	[FUNC_bpf_table_lookup] = {PTR_TO_TABLE_CONDITIONAL, PTR_TO_CTX,
+				   CONST_ARG, PTR_TO_STACK_IMM},
+	[FUNC_bpf_table_update] = {RET_INTEGER, PTR_TO_CTX, CONST_ARG,
+				   PTR_TO_STACK_IMM, PTR_TO_STACK_IMM},
+};
+
+static int check_func_arg(struct reg_state *regs, int regno,
+			  enum bpf_reg_type expected_type, int *reg_values)
+{
+	struct reg_state *reg = regs + regno;
+	if (expected_type == INVALID_PTR)
+		return 0;
+
+	if (!reg->read_ok) {
+		pr_err("R%d !read_ok\n", regno);
+		return -EACCES;
+	}
+
+	if (reg->ptr != expected_type) {
+		pr_err("R%d ptr=%d expected=%d\n", regno, reg->ptr,
+		       expected_type);
+		return -EACCES;
+	} else if (expected_type == CONST_ARG) {
+		reg_values[regno] = reg->imm;
+	}
+
+	return 0;
+}
+
+/* when register 'regno' is passed into function that will read 'access_size'
+ * bytes from that pointer, make sure that it's within stack boundary
+ * and all elements of stack are initialized
+ */
+static int check_stack_boundary(struct verifier_state *state,
+				struct reg_state *regs, int regno,
+				int access_size)
+{
+	int off, i;
+
+	if (regs[regno].ptr != PTR_TO_STACK_IMM)
+		return -EACCES;
+
+	off = regs[regno].imm;
+	if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
+	    access_size <= 0) {
+		pr_err("invalid stack ptr R%d off=%d access_size=%d\n",
+		       regno, off, access_size);
+		return -EACCES;
+	}
+
+	for (i = 0; i < access_size; i++) {
+		if (state->stack[MAX_BPF_STACK + off + i].type != STACK_MISC) {
+			pr_err("invalid indirect read from stack off %d+%d size %d\n",
+			       off, i, access_size);
+			return -EACCES;
+		}
+	}
+	return 0;
+}
+
+static int check_call(struct verifier_env *env, int func_id)
+{
+	int reg_values[MAX_REG] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
+	struct verifier_state *state = &env->cur_state;
+	const struct bpf_func_proto *fn = NULL;
+	struct reg_state *regs = state->regs;
+	struct reg_state *reg;
+	int i;
+
+	/* find function prototype */
+	if (func_id < 0 || func_id >= FUNC_bpf_max_id) {
+		pr_err("invalid func %d\n", func_id);
+		return -EINVAL;
+	}
+
+	if (func_id == FUNC_bpf_table_lookup ||
+	    func_id == FUNC_bpf_table_update) {
+		fn = &funcs[func_id];
+	} else {
+		if (env->get_func_proto)
+			fn = env->get_func_proto(func_id);
+		if (!fn || (fn->ret_type != RET_INTEGER &&
+			    fn->ret_type != RET_VOID)) {
+			pr_err("unknown func %d\n", func_id);
+			return -EINVAL;
+		}
+	}
+
+	/* check args */
+	_(check_func_arg(regs, R1, fn->arg1_type, reg_values));
+	_(check_func_arg(regs, R2, fn->arg2_type, reg_values));
+	_(check_func_arg(regs, R3, fn->arg3_type, reg_values));
+	_(check_func_arg(regs, R4, fn->arg4_type, reg_values));
+
+	if (func_id == FUNC_bpf_table_lookup) {
+		struct bpf_table *table;
+		int table_id = reg_values[R2];
+
+		_(get_table_info(env, table_id, &table));
+
+		/* bpf_table_lookup(ctx, table_id, key) call: check that
+		 * [key, key + table_info->key_size) are within stack limits
+		 * and initialized
+		 */
+		_(check_stack_boundary(state, regs, R3, table->key_size));
+
+	} else if (func_id == FUNC_bpf_table_update) {
+		struct bpf_table *table;
+		int table_id = reg_values[R2];
+
+		_(get_table_info(env, table_id, &table));
+
+		/* bpf_table_update(ctx, table_id, key, value) check
+		 * that key and value are valid
+		 */
+		_(check_stack_boundary(state, regs, R3, table->key_size));
+		_(check_stack_boundary(state, regs, R4, table->elem_size));
+
+	} else if (fn->arg1_type == PTR_TO_STACK_IMM) {
+		/* bpf_xxx(buf, len) call will access 'len' bytes
+		 * from stack pointer 'buf'. Check it
+		 */
+		_(check_stack_boundary(state, regs, R1, reg_values[R2]));
+
+	} else if (fn->arg2_type == PTR_TO_STACK_IMM) {
+		/* bpf_yyy(arg1, buf, len) call will access 'len' bytes
+		 * from stack pointer 'buf'. Check it
+		 */
+		_(check_stack_boundary(state, regs, R2, reg_values[R3]));
+
+	} else if (fn->arg3_type == PTR_TO_STACK_IMM) {
+		/* bpf_zzz(arg1, arg2, buf, len) call will access 'len' bytes
+		 * from stack pointer 'buf'. Check it
+		 */
+		_(check_stack_boundary(state, regs, R3, reg_values[R4]));
+	}
+
+	/* reset caller saved regs */
+	for (i = 0; i < CALLER_SAVED_REGS; i++) {
+		reg = regs + caller_saved[i];
+		reg->read_ok = false;
+		reg->ptr = INVALID_PTR;
+		reg->imm = 0xbadbad;
+	}
+
+	/* update return register */
+	reg = regs + R0;
+	if (fn->ret_type == RET_INTEGER) {
+		reg->read_ok = true;
+		reg->ptr = INVALID_PTR;
+	} else if (fn->ret_type != RET_VOID) {
+		reg->read_ok = true;
+		reg->ptr = fn->ret_type;
+		if (func_id == FUNC_bpf_table_lookup)
+			/* when ret_type == PTR_TO_TABLE_CONDITIONAL
+			 * remember table_id, so that check_table_access()
+			 * can check 'elem_size' boundary of memory access
+			 * to table element returned from bpf_table_lookup()
+			 */
+			reg->imm = reg_values[R2];
+	}
+	return 0;
+}
+
+static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
+{
+	u16 opcode = BPF_OP(insn->code);
+
+	if (opcode == BPF_BSWAP32 || opcode == BPF_BSWAP64 ||
+	    opcode == BPF_NEG) {
+		if (BPF_SRC(insn->code) != BPF_X)
+			return -EINVAL;
+		/* check src operand */
+		_(check_reg_arg(regs, insn->a_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+	} else if (opcode == BPF_MOV) {
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+		if (BPF_SRC(insn->code) == BPF_X) {
+			/* case: R1 = R2
+			 * copy register state to dest reg
+			 */
+			regs[insn->a_reg].ptr = regs[insn->x_reg].ptr;
+			regs[insn->a_reg].imm = regs[insn->x_reg].imm;
+		} else {
+			/* case: R = imm
+			 * remember the value we stored into this reg
+			 */
+			regs[insn->a_reg].ptr = CONST_ARG;
+			regs[insn->a_reg].imm = insn->imm;
+		}
+
+	} else {	/* all other ALU ops: and, sub, xor, add, ... */
+
+		int stack_relative = 0;
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+		/* check src2 operand */
+		_(check_reg_arg(regs, insn->a_reg, 1));
+
+		if (opcode == BPF_ADD &&
+		    regs[insn->a_reg].ptr == PTR_TO_STACK &&
+		    BPF_SRC(insn->code) == BPF_K)
+			stack_relative = 1;
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+		if (stack_relative) {
+			regs[insn->a_reg].ptr = PTR_TO_STACK_IMM;
+			regs[insn->a_reg].imm = insn->imm;
+		}
+	}
+
+	return 0;
+}
+
+static int check_cond_jmp_op(struct verifier_env *env, struct bpf_insn *insn,
+			     int insn_idx)
+{
+	struct reg_state *regs = env->cur_state.regs;
+	struct verifier_state *other_branch;
+	u16 opcode = BPF_OP(insn->code);
+
+	if (BPF_SRC(insn->code) == BPF_X)
+		/* check src1 operand */
+		_(check_reg_arg(regs, insn->x_reg, 1));
+
+	/* check src2 operand */
+	_(check_reg_arg(regs, insn->a_reg, 1));
+
+	other_branch = push_stack(env, insn_idx + insn->off + 1);
+	if (!other_branch)
+		return -EFAULT;
+
+	/* detect if R == 0 where R is returned value from table_lookup() */
+	if (BPF_SRC(insn->code) == BPF_K &&
+	    insn->imm == 0 && (opcode == BPF_JEQ ||
+			       opcode == BPF_JNE) &&
+	    regs[insn->a_reg].ptr == PTR_TO_TABLE_CONDITIONAL) {
+		if (opcode == BPF_JEQ) {
+			/* next fallthrough insn can access memory via
+			 * this register
+			 */
+			regs[insn->a_reg].ptr = PTR_TO_TABLE;
+			/* branch targer cannot access it, since reg == 0 */
+			other_branch->regs[insn->a_reg].ptr = INVALID_PTR;
+		} else {
+			other_branch->regs[insn->a_reg].ptr = PTR_TO_TABLE;
+			regs[insn->a_reg].ptr = INVALID_PTR;
+		}
+	}
+	return 0;
+}
+
+
+/* non-recursive DFS pseudo code
+ * 1  procedure DFS-iterative(G,v):
+ * 2      label v as discovered
+ * 3      let S be a stack
+ * 4      S.push(v)
+ * 5      while S is not empty
+ * 6            t <- S.pop()
+ * 7            if t is what we're looking for:
+ * 8                return t
+ * 9            for all edges e in G.adjacentEdges(t) do
+ * 10               if edge e is already labelled
+ * 11                   continue with the next edge
+ * 12               w <- G.adjacentVertex(t,e)
+ * 13               if vertex w is not discovered and not explored
+ * 14                   label e as tree-edge
+ * 15                   label w as discovered
+ * 16                   S.push(w)
+ * 17                   continue at 5
+ * 18               else if vertex w is discovered
+ * 19                   label e as back-edge
+ * 20               else
+ * 21                   // vertex w is explored
+ * 22                   label e as forward- or cross-edge
+ * 23           label t as explored
+ * 24           S.pop()
+ *
+ * convention:
+ * 1 - discovered
+ * 2 - discovered and 1st branch labelled
+ * 3 - discovered and 1st and 2nd branch labelled
+ * 4 - explored
+ */
+
+#define STATE_END ((struct verifier_state_list *)-1)
+
+#define PUSH_INT(I) \
+	do { \
+		if (cur_stack >= insn_cnt) { \
+			ret = -E2BIG; \
+			goto free_st; \
+		} \
+		stack[cur_stack++] = I; \
+	} while (0)
+
+#define PEAK_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[cur_stack - 1]; \
+		_ret; \
+	 })
+
+#define POP_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[--cur_stack]; \
+		_ret; \
+	 })
+
+#define PUSH_INSN(T, W, E) \
+	do { \
+		int w = W; \
+		if (E == 1 && st[T] >= 2) \
+			break; \
+		if (E == 2 && st[T] >= 3) \
+			break; \
+		if (w >= insn_cnt) { \
+			ret = -EACCES; \
+			goto free_st; \
+		} \
+		if (E == 2) \
+			/* mark branch target for state pruning */ \
+			env->branch_landing[w] = STATE_END; \
+		if (st[w] == 0) { \
+			/* tree-edge */ \
+			st[T] = 1 + E; \
+			st[w] = 1; /* discovered */ \
+			PUSH_INT(w); \
+			goto peak_stack; \
+		} else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
+			pr_err("back-edge from insn %d to %d\n", t, w); \
+			ret = -EINVAL; \
+			goto free_st; \
+		} else if (st[w] == 4) { \
+			/* forward- or cross-edge */ \
+			st[T] = 1 + E; \
+		} else { \
+			pr_err("insn state internal bug\n"); \
+			ret = -EFAULT; \
+			goto free_st; \
+		} \
+	} while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env, struct bpf_insn *insns,
+		     int insn_cnt)
+{
+	int cur_stack = 0;
+	int *stack;
+	int ret = 0;
+	int *st;
+	int i, t;
+
+	if (insns[insn_cnt - 1].code != (BPF_RET | BPF_K)) {
+		pr_err("last insn is not a 'ret'\n");
+		return -EINVAL;
+	}
+
+	st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!stack) {
+		kfree(st);
+		return -ENOMEM;
+	}
+
+	st[0] = 1; /* mark 1st insn as discovered */
+	PUSH_INT(0);
+
+peak_stack:
+	while ((t = PEAK_INT()) != -1) {
+		if (t == insn_cnt - 1)
+			goto mark_explored;
+
+		if (BPF_CLASS(insns[t].code) == BPF_RET) {
+			pr_err("extraneous 'ret'\n");
+			ret = -EINVAL;
+			goto free_st;
+		}
+
+		if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+			u16 opcode = BPF_OP(insns[t].code);
+			if (opcode == BPF_CALL) {
+				PUSH_INSN(t, t + 1, 1);
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insns[t].code) != BPF_X) {
+					ret = -EINVAL;
+					goto free_st;
+				}
+				PUSH_INSN(t, t + insns[t].off + 1, 1);
+			} else {
+				PUSH_INSN(t, t + 1, 1);
+				PUSH_INSN(t, t + insns[t].off + 1, 2);
+			}
+		} else {
+			PUSH_INSN(t, t + 1, 1);
+		}
+
+mark_explored:
+		st[t] = 4; /* explored */
+		if (POP_INT() == -1) {
+			pr_err("pop_int internal bug\n");
+			ret = -EFAULT;
+			goto free_st;
+		}
+	}
+
+
+	for (i = 0; i < insn_cnt; i++) {
+		if (st[i] != 4) {
+			pr_err("unreachable insn %d\n", i);
+			ret = -EINVAL;
+			goto free_st;
+		}
+	}
+
+free_st:
+	kfree(st);
+	kfree(stack);
+	return ret;
+}
+
+static int is_state_visited(struct verifier_env *env, int insn_idx)
+{
+	struct verifier_state_list *sl;
+	struct verifier_state_list *new_sl;
+	sl = env->branch_landing[insn_idx];
+	if (!sl)
+		/* no branch jump to this insn, ignore it */
+		return 0;
+
+	while (sl != STATE_END) {
+		if (memcmp(&sl->state, &env->cur_state,
+			   sizeof(env->cur_state)) == 0)
+			/* reached the same register/stack state,
+			 * prune the search
+			 */
+			return 1;
+		sl = sl->next;
+	}
+	new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
+
+	if (!new_sl)
+		/* ignore kmalloc error, since it's rare and doesn't affect
+		 * correctness of algorithm
+		 */
+		return 0;
+	/* add new state to the head of linked list */
+	memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
+	new_sl->next = env->branch_landing[insn_idx];
+	env->branch_landing[insn_idx] = new_sl;
+	return 0;
+}
+
+static int __bpf_check(struct verifier_env *env, struct bpf_insn *insns,
+		       int insn_cnt)
+{
+	int insn_idx;
+	int insn_processed = 0;
+	struct verifier_state *state = &env->cur_state;
+	struct reg_state *regs = state->regs;
+
+	init_reg_state(regs);
+	insn_idx = 0;
+	for (;;) {
+		struct bpf_insn *insn;
+		u16 class;
+
+		if (insn_idx >= insn_cnt) {
+			pr_err("invalid insn idx %d insn_cnt %d\n",
+			       insn_idx, insn_cnt);
+			return -EFAULT;
+		}
+
+		insn = &insns[insn_idx];
+		class = BPF_CLASS(insn->code);
+
+		if (++insn_processed > 32768) {
+			pr_err("BPF program is too large. Proccessed %d insn\n",
+			       insn_processed);
+			return -E2BIG;
+		}
+
+		/* pr_debug_bpf_insn(insn, NULL); */
+
+		if (is_state_visited(env, insn_idx))
+			goto process_ret;
+
+		if (class == BPF_ALU) {
+			_(check_alu_op(regs, insn));
+
+		} else if (class == BPF_LDX) {
+			if (BPF_MODE(insn->code) != BPF_REL)
+				return -EINVAL;
+
+			/* check src operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+			_(check_mem_access(env, insn->x_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_READ,
+					   insn->a_reg));
+
+			/* dest reg state will be updated by mem_access */
+
+		} else if (class == BPF_STX) {
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+			/* check src2 operand */
+			_(check_reg_arg(regs, insn->a_reg, 1));
+			_(check_mem_access(env, insn->a_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   insn->x_reg));
+
+		} else if (class == BPF_ST) {
+			if (BPF_MODE(insn->code) != BPF_REL)
+				return -EINVAL;
+			/* check src operand */
+			_(check_reg_arg(regs, insn->a_reg, 1));
+			_(check_mem_access(env, insn->a_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   -1));
+
+		} else if (class == BPF_JMP) {
+			u16 opcode = BPF_OP(insn->code);
+			if (opcode == BPF_CALL) {
+				_(check_call(env, insn->imm));
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insn->code) != BPF_X)
+					return -EINVAL;
+				insn_idx += insn->off + 1;
+				continue;
+			} else {
+				_(check_cond_jmp_op(env, insn, insn_idx));
+			}
+
+		} else if (class == BPF_RET) {
+process_ret:
+			insn_idx = pop_stack(env);
+			if (insn_idx < 0)
+				break;
+			else
+				continue;
+		}
+
+		insn_idx++;
+	}
+
+	/* pr_debug("insn_processed %d\n", insn_processed); */
+	return 0;
+}
+
+static void free_states(struct verifier_env *env, int insn_cnt)
+{
+	int i;
+
+	for (i = 0; i < insn_cnt; i++) {
+		struct verifier_state_list *sl = env->branch_landing[i];
+		if (sl)
+			while (sl != STATE_END) {
+				struct verifier_state_list *sln = sl->next;
+				kfree(sl);
+				sl = sln;
+			}
+	}
+
+	kfree(env->branch_landing);
+}
+
+int bpf_check(struct bpf_program *prog)
+{
+	int ret;
+	struct verifier_env *env;
+
+	if (prog->insn_cnt <= 0 || prog->insn_cnt > MAX_BPF_INSNS ||
+	    prog->table_cnt < 0 || prog->table_cnt > MAX_BPF_TABLES) {
+		pr_err("BPF program has %d insn and %d tables. Max is %d/%d\n",
+		       prog->insn_cnt, prog->table_cnt,
+		       MAX_BPF_INSNS, MAX_BPF_TABLES);
+		return -E2BIG;
+	}
+
+	env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
+	if (!env)
+		return -ENOMEM;
+
+	env->tables = prog->tables;
+	env->table_cnt = prog->table_cnt;
+	env->get_func_proto = prog->cb->get_func_proto;
+	env->get_context_access = prog->cb->get_context_access;
+	env->branch_landing = kzalloc(sizeof(struct verifier_state_list *) *
+				      prog->insn_cnt, GFP_KERNEL);
+
+	if (!env->branch_landing) {
+		kfree(env);
+		return -ENOMEM;
+	}
+
+	ret = check_cfg(env, prog->insns, prog->insn_cnt);
+	if (ret)
+		goto free_env;
+	ret = __bpf_check(env, prog->insns, prog->insn_cnt);
+free_env:
+	free_states(env, prog->insn_cnt);
+	kfree(env);
+	return ret;
+}
+EXPORT_SYMBOL(bpf_check);
diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
new file mode 100644
index 0000000..c47a362
--- /dev/null
+++ b/net/core/bpf_run.c
@@ -0,0 +1,422 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/filter.h>
+
+static const char *const bpf_class_string[] = {
+	"ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
+};
+
+static const char *const bpf_alu_string[] = {
+	"+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+	"%=", "^=", "=", "s>>=", "bswap32", "bswap64", "BUG"
+};
+
+static const char *const bpf_ldst_string[] = {
+	"u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+	"jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
+};
+
+static const char *debug_reg(int regno, u64 *regs)
+{
+	static char reg_value[16][32];
+	if (!regs)
+		return "";
+	snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
+		 regs[regno]);
+	return reg_value[regno];
+}
+
+#define R(regno) debug_reg(regno, regs)
+
+void pr_debug_bpf_insn(struct bpf_insn *insn, u64 *regs)
+{
+	u16 class = BPF_CLASS(insn->code);
+	if (class == BPF_ALU) {
+		if (BPF_SRC(insn->code) == BPF_X)
+			pr_debug("code_%02x r%d%s %s r%d%s\n",
+				 insn->code, insn->a_reg, R(insn->a_reg),
+				 bpf_alu_string[BPF_OP(insn->code) >> 4],
+				 insn->x_reg, R(insn->x_reg));
+		else
+			pr_debug("code_%02x r%d%s %s %d\n",
+				 insn->code, insn->a_reg, R(insn->a_reg),
+				 bpf_alu_string[BPF_OP(insn->code) >> 4],
+				 insn->imm);
+	} else if (class == BPF_STX) {
+		if (BPF_MODE(insn->code) == BPF_REL)
+			pr_debug("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
+				 insn->code,
+				 bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				 insn->a_reg, R(insn->a_reg),
+				 insn->off, insn->x_reg, R(insn->x_reg));
+		else if (BPF_MODE(insn->code) == BPF_XADD)
+			pr_debug("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
+				 insn->code,
+				 bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				 insn->a_reg, R(insn->a_reg), insn->off,
+				 insn->x_reg, R(insn->x_reg));
+		else
+			pr_debug("BUG_%02x\n", insn->code);
+	} else if (class == BPF_ST) {
+		if (BPF_MODE(insn->code) != BPF_REL) {
+			pr_debug("BUG_st_%02x\n", insn->code);
+			return;
+		}
+		pr_debug("code_%02x *(%s *)(r%d%s %+d) = %d\n",
+			 insn->code,
+			 bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			 insn->a_reg, R(insn->a_reg),
+			 insn->off, insn->imm);
+	} else if (class == BPF_LDX) {
+		if (BPF_MODE(insn->code) != BPF_REL) {
+			pr_debug("BUG_ldx_%02x\n", insn->code);
+			return;
+		}
+		pr_debug("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
+			 insn->code, insn->a_reg,
+			 bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			 insn->x_reg, R(insn->x_reg), insn->off);
+	} else if (class == BPF_JMP) {
+		u16 opcode = BPF_OP(insn->code);
+		if (opcode == BPF_CALL) {
+			pr_debug("code_%02x call %d\n", insn->code, insn->imm);
+		} else if (insn->code == (BPF_JMP | BPF_JA | BPF_X)) {
+			pr_debug("code_%02x goto pc%+d\n",
+				 insn->code, insn->off);
+		} else if (BPF_SRC(insn->code) == BPF_X) {
+			pr_debug("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
+				 insn->code, insn->a_reg, R(insn->a_reg),
+				 bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				 insn->x_reg, R(insn->x_reg), insn->off);
+		} else {
+			pr_debug("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
+				 insn->code, insn->a_reg, R(insn->a_reg),
+				 bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				 insn->imm, insn->off);
+		}
+	} else {
+		pr_debug("code_%02x %s\n", insn->code, bpf_class_string[class]);
+	}
+}
+
+void bpf_run(struct bpf_program *prog, struct bpf_context *ctx)
+{
+	struct bpf_insn *insn = prog->insns;
+	u64 stack[64];
+	u64 regs[16] = { };
+	regs[__fp__] = (u64)(ulong)&stack[64];
+	regs[R1] = (u64)(ulong)ctx;
+
+	for (;; insn++) {
+		const s32 K = insn->imm;
+		u64 tmp;
+		u64 *a_reg = &regs[insn->a_reg];
+		u64 *x_reg = &regs[insn->x_reg];
+#define A (*a_reg)
+#define X (*x_reg)
+		/*pr_debug_bpf_insn(insn, regs);*/
+		switch (insn->code) {
+			/* ALU */
+		case BPF_ALU | BPF_ADD | BPF_X:
+			A += X;
+			continue;
+		case BPF_ALU | BPF_ADD | BPF_K:
+			A += K;
+			continue;
+		case BPF_ALU | BPF_SUB | BPF_X:
+			A -= X;
+			continue;
+		case BPF_ALU | BPF_SUB | BPF_K:
+			A -= K;
+			continue;
+		case BPF_ALU | BPF_AND | BPF_X:
+			A &= X;
+			continue;
+		case BPF_ALU | BPF_AND | BPF_K:
+			A &= K;
+			continue;
+		case BPF_ALU | BPF_OR | BPF_X:
+			A |= X;
+			continue;
+		case BPF_ALU | BPF_OR | BPF_K:
+			A |= K;
+			continue;
+		case BPF_ALU | BPF_LSH | BPF_X:
+			A <<= X;
+			continue;
+		case BPF_ALU | BPF_LSH | BPF_K:
+			A <<= K;
+			continue;
+		case BPF_ALU | BPF_RSH | BPF_X:
+			A >>= X;
+			continue;
+		case BPF_ALU | BPF_RSH | BPF_K:
+			A >>= K;
+			continue;
+		case BPF_ALU | BPF_MOV | BPF_X:
+			A = X;
+			continue;
+		case BPF_ALU | BPF_MOV | BPF_K:
+			A = K;
+			continue;
+		case BPF_ALU | BPF_ARSH | BPF_X:
+			(*(s64 *) &A) >>= X;
+			continue;
+		case BPF_ALU | BPF_ARSH | BPF_K:
+			(*(s64 *) &A) >>= K;
+			continue;
+		case BPF_ALU | BPF_BSWAP32 | BPF_X:
+			A = __builtin_bswap32(A);
+			continue;
+		case BPF_ALU | BPF_BSWAP64 | BPF_X:
+			A = __builtin_bswap64(A);
+			continue;
+		case BPF_ALU | BPF_MOD | BPF_X:
+			tmp = A;
+			if (X)
+				A = do_div(tmp, X);
+			continue;
+		case BPF_ALU | BPF_MOD | BPF_K:
+			tmp = A;
+			if (K)
+				A = do_div(tmp, K);
+			continue;
+
+			/* CALL */
+		case BPF_JMP | BPF_CALL:
+			prog->cb->execute_func(K, regs);
+			continue;
+
+			/* JMP */
+		case BPF_JMP | BPF_JA | BPF_X:
+			insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JEQ | BPF_X:
+			if (A == X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+			if (A == K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JNE | BPF_X:
+			if (A != X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JNE | BPF_K:
+			if (A != K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGT | BPF_X:
+			if (A > X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGT | BPF_K:
+			if (A > K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGE | BPF_X:
+			if (A >= X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGE | BPF_K:
+			if (A >= K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGT | BPF_X:
+			if (((s64)A) > ((s64)X))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGT | BPF_K:
+			if (((s64)A) > ((s64)K))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGE | BPF_X:
+			if (((s64)A) >= ((s64)X))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGE | BPF_K:
+			if (((s64)A) >= ((s64)K))
+				insn += insn->off;
+			continue;
+
+			/* STX */
+		case BPF_STX | BPF_REL | BPF_B:
+			*(u8 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_H:
+			*(u16 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_W:
+			*(u32 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_DW:
+			*(u64 *)(ulong)(A + insn->off) = X;
+			continue;
+
+			/* ST */
+		case BPF_ST | BPF_REL | BPF_B:
+			*(u8 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_H:
+			*(u16 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_W:
+			*(u32 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_DW:
+			*(u64 *)(ulong)(A + insn->off) = K;
+			continue;
+
+			/* LDX */
+		case BPF_LDX | BPF_REL | BPF_B:
+			A = *(u8 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_H:
+			A = *(u16 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_W:
+			A = *(u32 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_DW:
+			A = *(u64 *)(ulong)(X + insn->off);
+			continue;
+
+			/* STX XADD */
+		case BPF_STX | BPF_XADD | BPF_B:
+			__sync_fetch_and_add((u8 *)(ulong)(A + insn->off),
+					     (u8)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_H:
+			__sync_fetch_and_add((u16 *)(ulong)(A + insn->off),
+					     (u16)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_W:
+			__sync_fetch_and_add((u32 *)(ulong)(A + insn->off),
+					     (u32)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_DW:
+			__sync_fetch_and_add((u64 *)(ulong)(A + insn->off),
+					     (u64)X);
+			continue;
+
+			/* RET */
+		case BPF_RET | BPF_K:
+			return;
+		default:
+			/* bpf_check() will guarantee that
+			 * we never reach here
+			 */
+			pr_err("unknown opcode %02x\n", insn->code);
+			return;
+		}
+	}
+}
+EXPORT_SYMBOL(bpf_run);
+
+int bpf_load(struct bpf_image *image, struct bpf_callbacks *cb,
+	     struct bpf_program **p_prog)
+{
+	struct bpf_program *prog;
+	int ret;
+
+	if (!image || !cb || !cb->execute_func || !cb->get_func_proto ||
+	    !cb->get_context_access)
+		return -EINVAL;
+
+	if (image->insn_cnt <= 0 || image->insn_cnt > MAX_BPF_INSNS ||
+	    image->table_cnt < 0 || image->table_cnt > MAX_BPF_TABLES) {
+		pr_err("BPF program has %d insn and %d tables. Max is %d/%d\n",
+		       image->insn_cnt, image->table_cnt,
+		       MAX_BPF_INSNS, MAX_BPF_TABLES);
+		return -E2BIG;
+	}
+
+	prog = kzalloc(sizeof(struct bpf_program), GFP_KERNEL);
+	if (!prog)
+		return -ENOMEM;
+
+	prog->insn_cnt = image->insn_cnt;
+	prog->table_cnt = image->table_cnt;
+	prog->cb = cb;
+
+	prog->insns = kmalloc(sizeof(struct bpf_insn) * prog->insn_cnt,
+			      GFP_KERNEL);
+	if (!prog->insns) {
+		ret = -ENOMEM;
+		goto free_prog;
+	}
+
+	prog->tables = kmalloc(sizeof(struct bpf_table) * prog->table_cnt,
+			       GFP_KERNEL);
+	if (!prog->tables) {
+		ret = -ENOMEM;
+		goto free_insns;
+	}
+
+	if (copy_from_user(prog->insns, image->insns,
+			   sizeof(struct bpf_insn) * prog->insn_cnt)) {
+		ret = -EFAULT;
+		goto free_tables;
+	}
+
+	if (copy_from_user(prog->tables, image->tables,
+			   sizeof(struct bpf_table) * prog->table_cnt)) {
+		ret = -EFAULT;
+		goto free_tables;
+	}
+
+	/* verify BPF program */
+	ret = bpf_check(prog);
+	if (ret)
+		goto free_tables;
+
+	/* JIT it */
+	bpf2_jit_compile(prog);
+
+	*p_prog = prog;
+
+	return 0;
+
+free_tables:
+	kfree(prog->tables);
+free_insns:
+	kfree(prog->insns);
+free_prog:
+	kfree(prog);
+	return ret;
+}
+EXPORT_SYMBOL(bpf_load);
+
+void bpf_free(struct bpf_program *prog)
+{
+	if (!prog)
+		return;
+	bpf2_jit_free(prog);
+	kfree(prog->tables);
+	kfree(prog->insns);
+	kfree(prog);
+}
+EXPORT_SYMBOL(bpf_free);
+
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] dm9601: fix IFF_ALLMULTI handling
From: David Miller @ 2013-09-30 23:49 UTC (permalink / raw)
  To: peter; +Cc: netdev, joseph_chang
In-Reply-To: <1380576500-8531-1-git-send-email-peter@korsgaard.com>

From: Peter Korsgaard <peter@korsgaard.com>
Date: Mon, 30 Sep 2013 23:28:20 +0200

> Pass-all-multicast is controlled by bit 3 in RX control, not bit 2
> (pass undersized frames).
> 
> Reported-by: Joseph Chang <joseph_chang@davicom.com.tw>
> Signed-off-by: Peter Korsgaard <peter@korsgaard.com>

Applied, thanks.

It would be so much better if these register values were all
properly documented, one by one, with macros.

^ permalink raw reply

* [PATCH V3 net-next] fib_trie: avoid a redundant bit judgement in inflate
From: baker.kernel @ 2013-09-30 23:45 UTC (permalink / raw)
  To: davem; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel,
	baker.zhang

From: "baker.zhang" <baker.kernel@gmail.com>

Because 'node' is the i'st child of 'oldnode',
thus, here 'i' equals
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)

we just get 1 more bit,
and need not care the detail value of this bits.

I apologize for the mistake.

I generated the patch on a branch version,
and did not notice the put_child has been changed.

I have redone the test on HEAD version with my patch.

two cases are used.
case 1. inflate a node which has a leaf child node.
case 2: inflate a node which has a an child node with skipped bits

test env:
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
here, we just focus on route table(MAIN),
so I use a "192.168.11.1/32" address to simplify the test case.

call trace:
+ fib_insert_node
+ + trie_rebalance
+ + + resize
+ + + + inflate

Test case 1:  inflate a node which has a leaf child node.
===========================================================
step 1. prepare a fib trie
------------------------------------------
  ip r a 192.168.0.0/24 via 192.168.11.1
  ip r a 192.168.1.0/24 via 192.168.11.1

we get a fib trie.
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/23 1 0 0
   |-- 192.168.0.0
    /24 universe UNICAST
   |-- 192.168.1.0
    /24 universe UNICAST
Local:
.....

step 2. Add the third route
------------------------------------------
root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1

A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.

For function 'inflate':
'inflate' is called with following trie.
  +-- 192.168.0.0/22 1 1 0 <=== tn node
    +-- 192.168.0.0/23 1 0 0    <== node a
        |-- 192.168.0.0
          /24 universe UNICAST
        |-- 192.168.1.0
          /24 universe UNICAST
      |-- 192.168.2.0          <== leaf(node b)

When process node b, which is a leaf. here:
i is 1,
node key "192.168.2.0"
oldnode is (pos:22, bits:1)

unpatch source:
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,2,0", 22 + 1, 1)

thus got 0, and call put_child(tn, 2*i, node); <== 2*i=2.

patched source:
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
tkey_extract_bits("192.168,2,0", 22, 1 + 1)  <== get 2.

Test case 2:  inflate a node which has a an child node with skipped bits
==========================================================================
step 1. prepare a fib trie.
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
  ip r a 192.168.128.0/24 via 192.168.11.1
  ip r a 192.168.0.0/24  via 192.168.11.1
  ip r a 192.168.16.0/24   via 192.168.11.1
  ip r a 192.168.32.0/24  via 192.168.11.1
  ip r a 192.168.48.0/24  via 192.168.11.1
  ip r a 192.168.144.0/24   via 192.168.11.1
  ip r a 192.168.160.0/24   via 192.168.11.1
  ip r a 192.168.176.0/24   via 192.168.11.1

check:
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/16 1 0 0
     +-- 192.168.0.0/18 2 0 0
        |-- 192.168.0.0
           /24 universe UNICAST
        |-- 192.168.16.0
           /24 universe UNICAST
        |-- 192.168.32.0
           /24 universe UNICAST
        |-- 192.168.48.0
           /24 universe UNICAST
     +-- 192.168.128.0/18 2 0 0
        |-- 192.168.128.0
           /24 universe UNICAST
        |-- 192.168.144.0
           /24 universe UNICAST
        |-- 192.168.160.0
           /24 universe UNICAST
        |-- 192.168.176.0
           /24 universe UNICAST
Local:
  ...

step 2. add a route to trigger inflate.
  ip r a 192.168.96.0/24   via 192.168.11.1

This command will call serveral times inflate.
In the first time, the fib_trie is:
________________________
+-- 192.168.128.0/(16, 1) <== tn node
 +-- 192.168.0.0/(17, 1)  <== node a
  +-- 192.168.0.0/(18, 2)
   |-- 192.168.0.0
   |-- 192.168.16.0
   |-- 192.168.32.0
   |-- 192.168.48.0
  |-- 192.168.96.0
 +-- 192.168.128.0/(18, 2) <== node b.
  |-- 192.168.128.0
  |-- 192.168.144.0
  |-- 192.168.160.0
  |-- 192.168.176.0

NOTE: node b is a interal node with skipped bits.
here,
i:1,
node->key "192.168.128.0",
oldnode:(pos:16, bits:1)
so
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0

tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16, 1+1) <=== 2

2*i + 0 == 2, so the result is same.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
---
 net/ipv4/fib_trie.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3df6d3e..45c74ba 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -762,12 +762,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 
 		if (IS_LEAF(node) || ((struct tnode *) node)->pos >
 		   tn->pos + tn->bits - 1) {
-			if (tkey_extract_bits(node->key,
-					      oldtnode->pos + oldtnode->bits,
-					      1) == 0)
-				put_child(tn, 2*i, node);
-			else
-				put_child(tn, 2*i+1, node);
+			put_child(tn,
+				tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
+				node);
 			continue;
 		}
 
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH] dm9601: fix IFF_ALLMULTI handling
From: Peter Korsgaard @ 2013-09-30 21:28 UTC (permalink / raw)
  To: netdev, davem; +Cc: joseph_chang, Peter Korsgaard

Pass-all-multicast is controlled by bit 3 in RX control, not bit 2
(pass undersized frames).

Reported-by: Joseph Chang <joseph_chang@davicom.com.tw>
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
---
 drivers/net/usb/dm9601.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c
index 2dbb946..c6867f9 100644
--- a/drivers/net/usb/dm9601.c
+++ b/drivers/net/usb/dm9601.c
@@ -303,7 +303,7 @@ static void dm9601_set_multicast(struct net_device *net)
 		rx_ctl |= 0x02;
 	} else if (net->flags & IFF_ALLMULTI ||
 		   netdev_mc_count(net) > DM_MAX_MCAST) {
-		rx_ctl |= 0x04;
+		rx_ctl |= 0x08;
 	} else if (!netdev_mc_empty(net)) {
 		struct netdev_hw_addr *ha;
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: Question regarding failure utilizing bonding mode 5 (balance-tlb)
From: Veaceslav Falico @ 2013-09-30 21:24 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: Jay Vosburgh, netdev@vger.kernel.org, Ariel Elior
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A52ADF1DE5@SJEXCHMB10.corp.ad.broadcom.com>

On Mon, Sep 30, 2013 at 11:30:40AM +0000, Yuval Mintz wrote:
>> > >Again, I think the permanent address is restored only when the bond
>> > >releases the slave, which I don't think happens when the slave is unloaded.
>> >
>> > 	Ah, ok, I was understanding "unloaded" to mean "remove from the
>> > bond."  I think you actually mean "set administratively down," e.g., "ip
>> > link set dev slave down" or the like.  I don't think mere loss of
>> > carrier would trigger the sequence of events, because that won't go
>> > through a dev_close / dev_open cycle.
>> >
>> > 	Doing that (an admin down / up bounce) would, indeed, cause a
>> > failover, but the bond will not reprogram the MAC on the slave (it
>> > presumes that a fail / recovery will not disrupt the MAC address, which
>> > is apparently not true in this instance).
>> >
>> > 	I'll have to look at the code a bit, but for now can you confirm
>> > that what you actually mean is, essentially:
>> >
>> > 	Given a bond0 with two slaves, eth0 and eth1, in tlb mode, eth0
>> > being the active,
>> >
>> > 	1) "ip link set dev eth0 down" which will fail over to eth1
>> > 		(swapping the contents of their dev_addr fields).
>> >
>> > 	2) "ip link set dev eth0 up" eth0 comes back up, reprograms its
>> > 		MAC to the wrong thing (what was in dev_addr).
>> >
>> > 	3) repeat steps 1 and 2 for eth1
>> >
>> > 	Is this correct?
>> >
>>
>> Yes, sorry for the earlier confusion.
>> I think in the case described `alb_swap_mac_addr()' will be called,
>> replacing eth0 and eth1's dev_addr, causing eth0 to have dev_addr
>> which defers from  the bond device's. Once eth0 reloads, it will use
>> the different MAC address for configuring FW/HW.
>
>Hi,
>
>Did you by any chance had the time to look at this issue?

Hi Yuval,

Sorry for getting into the discussion - but I've tried to understand the
problem and, possibly, find a fix.

I'm not sure that I completely understand it, and I don't have currently
hardware on which to test it (though I might have it in the nearest
future), so, again, I really am not sure that I won't suggest something
completely stupid.

Anyway, that being said, I hope that the following patch might fix the
problem. I've described the bug and the fix in the changelog, and the code
is pretty self-explanitory.

And even if the patch fixes the issue - I'm not sure that it's the proper
and correct way to do it. But it's definitely worth a try... So, if it's
possible, could you please test this patch and see if it fixes it?

Warning: I've just compile-tested it.

So, FWIW...

 From 87e6c584b0ae0f0261610d60cf83778feb9c1edb Mon Sep 17 00:00:00 2001
From: Veaceslav Falico <vfalico@redhat.com>
Date: Mon, 30 Sep 2013 23:14:23 +0200
Subject: [PATCH] bonding: ensure that TLB mode's active slave has correct mac filter

Currently, in TLB mode we change mac addresses only by memcpy-ing the to
net_device->dev_addr, without actually setting them via
dev_set_mac_address(). This permits us to receive all the traffic always on
one mac address.

However, in case the interface flips, some drivers might enforce the
mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
be able to receive traffic on that interface, in case it will be selected
as active in TLB mode.

Fix it by setting the mac address forcefully on every new active slave that
we select in TLB mode.

CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
  drivers/net/bonding/bond_alb.c | 17 +++++++++++++++++
  1 file changed, 17 insertions(+)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index e960418..576ccea 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1699,6 +1699,23 @@ void bond_alb_handle_active_change(struct bonding *bond, struct slave *new_slave
  
  	ASSERT_RTNL();
  
+	/* in TLB mode, the slave might flip down/up with the old dev_addr,
+	 * and thus filter bond->dev_addr's packets, so force bond's mac
+	 */
+	if (bond->params.mode == BOND_MODE_TLB) {
+		struct sockaddr sa;
+		u8 tmp_addr[ETH_ALEN];
+
+		memcpy(tmp_addr, new_slave->dev->dev_addr, ETH_ALEN);
+
+		memcpy(sa.sa_data, bond->dev->dev_addr, bond->dev->addr_len);
+		sa.sa_family = bond->dev->type;
+		/* we don't care if it can't change its mac, best effort */
+		dev_set_mac_address(new_slave->dev, &sa);
+
+		memcpy(new_slave->dev->dev_addr, tmp_addr, ETH_ALEN);
+	}
+
  	/* curr_active_slave must be set before calling alb_swap_mac_addr */
  	if (swap_slave) {
  		/* swap mac address */
-- 
1.8.4


>
>Thanks,
>Yuval
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [patch net] udp6: respect IPV6_DONTFRAG sockopt in case there are pending frames
From: Hannes Frederic Sowa @ 2013-09-30 21:21 UTC (permalink / raw)
  To: Jiri Pirko, netdev, davem, kuznet, jmorris, kaber, yoshfuji
In-Reply-To: <20130930175640.GF10771@order.stressinduktion.org>

On Mon, Sep 30, 2013 at 07:56:40PM +0200, Hannes Frederic Sowa wrote:
> Hmm, I wonder if we need the same change in ipv6/raw.c.

Nope, ipv6/raw.c is fine.

> Looks good at first sight, but I need to do some more tests.

Strange, loopback traffic is not bound by the frag_size. This fixes it but
should not be necessary. Don't know the reason, yet.

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a54c45c..9dd136e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -124,7 +124,14 @@ static int ip6_finish_output2(struct sk_buff *skb)
 
 static int ip6_finish_output(struct sk_buff *skb)
 {
-	if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
+	int mtu;
+	struct ipv6_pinfo *np = skb->sk ? inet6_sk(skb->sk) : NULL;
+
+	mtu = ip6_skb_dst_mtu(skb);
+	if (np && np->frag_size && np->frag_size < mtu)
+		mtu = np->frag_size;
+
+	if ((skb->len > mtu && !skb_is_gso(skb)) ||
 	    dst_allfrag(skb_dst(skb)))
 		return ip6_fragment(skb, ip6_finish_output2);
 	else


Regarding your patch, this is fine by me:

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks,

  Hannes

^ permalink raw reply related

* [PATCH] ip: make -resolve addr to print names rather than addresses
From: Sami Kerola @ 2013-09-30 21:01 UTC (permalink / raw)
  To: netdev; +Cc: kerolasa

As a system admin I occasionally want to be able to check that all
interfaces has a name in DNS or /etc/hosts file.

Signed-off-by: Sami Kerola <kerolasa@iki.fi>
---
 ip/ipaddress.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 1c3e4da..d02eaaf 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -636,7 +636,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
 		fprintf(fp, "    family %d ", ifa->ifa_family);
 
 	if (rta_tb[IFA_LOCAL]) {
-		fprintf(fp, "%s", rt_addr_n2a(ifa->ifa_family,
+		fprintf(fp, "%s", format_host(ifa->ifa_family,
 					      RTA_PAYLOAD(rta_tb[IFA_LOCAL]),
 					      RTA_DATA(rta_tb[IFA_LOCAL]),
 					      abuf, sizeof(abuf)));
@@ -647,7 +647,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
 			fprintf(fp, "/%d ", ifa->ifa_prefixlen);
 		} else {
 			fprintf(fp, " peer %s/%d ",
-				rt_addr_n2a(ifa->ifa_family,
+				format_host(ifa->ifa_family,
 					    RTA_PAYLOAD(rta_tb[IFA_ADDRESS]),
 					    RTA_DATA(rta_tb[IFA_ADDRESS]),
 					    abuf, sizeof(abuf)),
@@ -657,14 +657,14 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
 
 	if (rta_tb[IFA_BROADCAST]) {
 		fprintf(fp, "brd %s ",
-			rt_addr_n2a(ifa->ifa_family,
+			format_host(ifa->ifa_family,
 				    RTA_PAYLOAD(rta_tb[IFA_BROADCAST]),
 				    RTA_DATA(rta_tb[IFA_BROADCAST]),
 				    abuf, sizeof(abuf)));
 	}
 	if (rta_tb[IFA_ANYCAST]) {
 		fprintf(fp, "any %s ",
-			rt_addr_n2a(ifa->ifa_family,
+			format_host(ifa->ifa_family,
 				    RTA_PAYLOAD(rta_tb[IFA_ANYCAST]),
 				    RTA_DATA(rta_tb[IFA_ANYCAST]),
 				    abuf, sizeof(abuf)));
-- 
1.8.4

^ permalink raw reply related

* [PATCH 10/11] tcp: Always set options to 0 before calling tcp_established_options
From: Andi Kleen @ 2013-09-30 20:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, netdev
In-Reply-To: <1380572952-30729-1-git-send-email-andi@firstfloor.org>

From: Andi Kleen <ak@linux.intel.com>

tcp_established_options assumes opts->options is 0 before calling,
as it read modify writes it.

For the tcp_current_mss() case the opts structure is not zeroed,
so this can be done with uninitialized values.

This is ok, because ->options is not read in this path.
But it's still better to avoid the operation on the uninitialized
field. This shuts up a static code analyzer, and presumably
may help the optimizer.

Cc: netdev@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 net/ipv4/tcp_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7c83cb8..f3ed78d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -637,6 +637,8 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	unsigned int size = 0;
 	unsigned int eff_sacks;
 
+	opts->options = 0;
+
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (unlikely(*md5)) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 07/11] igb: Avoid uninitialized advertised variable in eee_set_cur
From: Andi Kleen @ 2013-09-30 20:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, jeffrey.t.kirsher, netdev
In-Reply-To: <1380572952-30729-1-git-send-email-andi@firstfloor.org>

From: Andi Kleen <ak@linux.intel.com>

eee_get_cur assumes that the output data is already zeroed. It can
read-modify-write the advertised field:

              if (ipcnfg & E1000_IPCNFG_EEE_100M_AN)
2594			edata->advertised |= ADVERTISED_100baseT_Full;

This is ok for the normal ethtool eee_get call, which always
zeroes the input data before.

But eee_set_cur also calls eee_get_cur and it did not zero the input
field. Later on it then compares agsinst the field, which can contain partial
stack garbage.

Zero the input field in eee_set_cur() too.

Cc: jeffrey.t.kirsher@intel.com
Cc: netdev@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 48cbc83..41e37ff 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2652,6 +2652,8 @@ static int igb_set_eee(struct net_device *netdev,
 	    (hw->phy.media_type != e1000_media_type_copper))
 		return -EOPNOTSUPP;
 
+	memset(&eee_curr, 0, sizeof(struct ethtool_eee));
+
 	ret_val = igb_get_eee(netdev, &eee_curr);
 	if (ret_val)
 		return ret_val;
-- 
1.8.3.1

^ permalink raw reply related

* [RFC PATCH] net: calxedaxgmac: add mac address learning
From: Rob Herring @ 2013-09-30 20:17 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Rob Herring

From: Rob Herring <rob.herring@calxeda.com>

The Calxeda xgmac must learn secondary mac addresses such as those
behind a bridge in order to properly receive packets with those mac
addresses. Add mac address learning on transmit with aging of entries.
The mac addresses are added to the driver's unicast address list, so
they are visible to user via "bridge fdb show" command.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
 drivers/net/ethernet/calxeda/xgmac.c | 84 ++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 48f5288..2afe8b7 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -360,6 +360,12 @@ struct xgmac_extra_stats {
 	unsigned long rx_process_stopped;
 	unsigned long tx_early;
 	unsigned long fatal_bus_error;
+	unsigned long learning_drop;
+};
+
+struct xgmac_mac {
+	char mac_addr[ETH_ALEN];
+	unsigned long time;
 };
 
 struct xgmac_priv {
@@ -384,6 +390,8 @@ struct xgmac_priv {
 	struct napi_struct napi;
 
 	int max_macs;
+	struct xgmac_mac mac_list[32];
+
 	struct xgmac_extra_stats xstats;
 
 	spinlock_t stats_lock;
@@ -392,8 +400,11 @@ struct xgmac_priv {
 	char tx_pause;
 	int wolopts;
 	struct work_struct tx_timeout_work;
+	struct delayed_work mac_aging_work;
 };
 
+#define XGMAC_AGING_TIMEOUT	(120 * HZ)
+
 /* XGMAC Configuration Settings */
 #define MAX_MTU			9000
 #define PAUSE_TIME		0x400
@@ -1047,6 +1058,8 @@ static int xgmac_open(struct net_device *dev)
 	writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_STATUS);
 	writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_INTR_ENA);
 
+	schedule_delayed_work(&priv->mac_aging_work, 5 * HZ);
+
 	return 0;
 }
 
@@ -1059,6 +1072,7 @@ static int xgmac_open(struct net_device *dev)
 static int xgmac_stop(struct net_device *dev)
 {
 	struct xgmac_priv *priv = netdev_priv(dev);
+	int i;
 
 	netif_stop_queue(dev);
 
@@ -1073,9 +1087,74 @@ static int xgmac_stop(struct net_device *dev)
 	/* Release and free the Rx/Tx resources */
 	xgmac_free_dma_desc_rings(priv);
 
+	cancel_delayed_work_sync(&priv->mac_aging_work);
+	for (i = 0; i < priv->max_macs; i++) {
+		if (!is_valid_ether_addr(priv->mac_list[i].mac_addr))
+			continue;
+		priv->mac_list[i].time = 0;
+		dev_uc_del(dev, priv->mac_list[i].mac_addr);
+		memset(priv->mac_list[i].mac_addr, 0, ETH_ALEN);
+	}
+
 	return 0;
 }
 
+static void xgmac_mac_aging_work(struct work_struct *work)
+{
+	int i;
+	struct xgmac_priv *priv =
+		container_of(work, struct xgmac_priv, mac_aging_work.work);
+	struct net_device *dev = priv->dev;
+
+	for (i = 0; i < priv->max_macs; i++) {
+		if (time_is_after_jiffies(priv->mac_list[i].time + XGMAC_AGING_TIMEOUT))
+			continue;
+		if (!is_valid_ether_addr(priv->mac_list[i].mac_addr))
+			continue;
+
+		priv->mac_list[i].time = 0;
+		printk("unlearned addr %pM\n", priv->mac_list[i].mac_addr);
+		dev_uc_del(dev, priv->mac_list[i].mac_addr);
+		memset(priv->mac_list[i].mac_addr, 0, ETH_ALEN);
+	}
+
+	schedule_delayed_work(to_delayed_work(work), 5 * HZ);
+}
+
+static void xgmac_learn_mac(struct xgmac_priv *priv, struct sk_buff *skb)
+{
+	struct net_device *dev = priv->dev;
+	struct ethhdr *phdr = (struct ethhdr *)(skb->data);
+	char *src_addr = phdr->h_source;
+	int i, slot = -1;
+
+	if (ether_addr_equal(src_addr, dev->dev_addr) ||
+	    !is_valid_ether_addr(src_addr))
+		return;
+
+	for (i = 0; i < priv->max_macs; i++) {
+		/* update timestamp if we already learned the address */
+		if (ether_addr_equal(priv->mac_list[i].mac_addr, src_addr)) {
+			priv->mac_list[i].time = jiffies;
+			return;
+		}
+		/* find empty slot */
+		if ((slot < 0) && !priv->mac_list[i].time)
+			slot = i;
+	}
+
+	/* Check if we've already filled all slots */
+	if (slot < 0) {
+		priv->xstats.learning_drop++;
+		return;
+	}
+
+	printk("learned addr %pM\n", src_addr);
+	priv->mac_list[slot].time = jiffies;
+	memcpy(priv->mac_list[slot].mac_addr, src_addr, ETH_ALEN);
+	dev_uc_add_excl(dev, src_addr);
+}
+
 /**
  *  xgmac_xmit:
  *  @skb : the socket buffer
@@ -1155,6 +1234,9 @@ static netdev_tx_t xgmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (tx_dma_ring_space(priv) > MAX_SKB_FRAGS)
 			netif_start_queue(dev);
 	}
+
+	xgmac_learn_mac(priv, skb);
+
 	return NETDEV_TX_OK;
 
 dma_err:
@@ -1605,6 +1687,7 @@ static const struct xgmac_stats xgmac_gstrings_stats[] = {
 	XGMAC_STAT(rx_ip_header_error),
 	XGMAC_STAT(rx_da_filter_fail),
 	XGMAC_STAT(fatal_bus_error),
+	XGMAC_STAT(learning_drop),
 	XGMAC_HW_STAT(rx_watchdog, XGMAC_MMC_RXWATCHDOG),
 	XGMAC_HW_STAT(tx_vlan, XGMAC_MMC_TXVLANFRAME),
 	XGMAC_HW_STAT(rx_vlan, XGMAC_MMC_RXVLANFRAME),
@@ -1743,6 +1826,7 @@ static int xgmac_probe(struct platform_device *pdev)
 	SET_ETHTOOL_OPS(ndev, &xgmac_ethtool_ops);
 	spin_lock_init(&priv->stats_lock);
 	INIT_WORK(&priv->tx_timeout_work, xgmac_tx_timeout_work);
+	INIT_DELAYED_WORK(&priv->mac_aging_work, xgmac_mac_aging_work);
 
 	priv->device = &pdev->dev;
 	priv->dev = ndev;
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH 3/3] net: calxedaxgmac: determine number of address filters at runtime
From: Rob Herring @ 2013-09-30 20:12 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Rob Herring
In-Reply-To: <1380571937-23439-1-git-send-email-robherring2@gmail.com>

From: Rob Herring <rob.herring@calxeda.com>

Highbank and Midway xgmac h/w have different number of MAC address filter
registers with 7 and 31, respectively. Highbank has been wrong, so fix it
and detect the number of filter registers at run-time. Unfortunately,
the version register is the same on both SOCs, so simply test if write to
the last filter register will take a value. It always reads as 0 if not.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
 drivers/net/ethernet/calxeda/xgmac.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 35da09b..48f5288 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -106,7 +106,6 @@
 #define XGMAC_DMA_HW_FEATURE	0x00000f58	/* Enabled Hardware Features */
 
 #define XGMAC_ADDR_AE		0x80000000
-#define XGMAC_MAX_FILTER_ADDR	31
 
 /* PMT Control and Status */
 #define XGMAC_PMT_POINTER_RESET	0x80000000
@@ -384,6 +383,7 @@ struct xgmac_priv {
 	struct device *device;
 	struct napi_struct napi;
 
+	int max_macs;
 	struct xgmac_extra_stats xstats;
 
 	spinlock_t stats_lock;
@@ -1296,7 +1296,7 @@ static void xgmac_set_rx_mode(struct net_device *dev)
 
 	memset(hash_filter, 0, sizeof(hash_filter));
 
-	if (netdev_uc_count(dev) > XGMAC_MAX_FILTER_ADDR) {
+	if (netdev_uc_count(dev) > priv->max_macs) {
 		use_hash = true;
 		value |= XGMAC_FRAME_FILTER_HUC | XGMAC_FRAME_FILTER_HPF;
 	}
@@ -1319,7 +1319,7 @@ static void xgmac_set_rx_mode(struct net_device *dev)
 		goto out;
 	}
 
-	if ((netdev_mc_count(dev) + reg - 1) > XGMAC_MAX_FILTER_ADDR) {
+	if ((netdev_mc_count(dev) + reg - 1) > priv->max_macs) {
 		use_hash = true;
 		value |= XGMAC_FRAME_FILTER_HMC | XGMAC_FRAME_FILTER_HPF;
 	} else {
@@ -1340,7 +1340,7 @@ static void xgmac_set_rx_mode(struct net_device *dev)
 	}
 
 out:
-	for (i = reg; i <= XGMAC_MAX_FILTER_ADDR; i++)
+	for (i = reg; i <= priv->max_macs; i++)
 		xgmac_set_mac_addr(ioaddr, NULL, i);
 	for (i = 0; i < XGMAC_NUM_HASH; i++)
 		writel(hash_filter[i], ioaddr + XGMAC_HASH(i));
@@ -1759,6 +1759,13 @@ static int xgmac_probe(struct platform_device *pdev)
 	uid = readl(priv->base + XGMAC_VERSION);
 	netdev_info(ndev, "h/w version is 0x%x\n", uid);
 
+	/* Figure out how many valid mac address filter registers we have */
+	writel(1, priv->base + XGMAC_ADDR_HIGH(31));
+	if (readl(priv->base + XGMAC_ADDR_HIGH(31)) == 1)
+		priv->max_macs = 31;
+	else
+		priv->max_macs = 7;
+
 	writel(0, priv->base + XGMAC_DMA_INTR_ENA);
 	ndev->irq = platform_get_irq(pdev, 0);
 	if (ndev->irq == -ENXIO) {
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH 2/3] net: calxedaxgmac: add uc and mc filter addresses in promiscuous mode
From: Rob Herring @ 2013-09-30 20:12 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Rob Herring
In-Reply-To: <1380571937-23439-1-git-send-email-robherring2@gmail.com>

From: Rob Herring <rob.herring@calxeda.com>

Even in promiscuous mode, we need to add filter addresses for correct
operation. This fixes silent failures when using a bridge and adding
addresses using the "bridge fdb add" command.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
 drivers/net/ethernet/calxeda/xgmac.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 94358d2..35da09b 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -1291,10 +1291,8 @@ static void xgmac_set_rx_mode(struct net_device *dev)
 	netdev_dbg(priv->dev, "# mcasts %d, # unicast %d\n",
 		 netdev_mc_count(dev), netdev_uc_count(dev));
 
-	if (dev->flags & IFF_PROMISC) {
-		writel(XGMAC_FRAME_FILTER_PR, ioaddr + XGMAC_FRAME_FILTER);
-		return;
-	}
+	if (dev->flags & IFF_PROMISC)
+		value |= XGMAC_FRAME_FILTER_PR;
 
 	memset(hash_filter, 0, sizeof(hash_filter));
 
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH 1/3] net: calxedaxgmac: fix clearing of old filter addresses
From: Rob Herring @ 2013-09-30 20:12 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Rob Herring
In-Reply-To: <1380571937-23439-1-git-send-email-robherring2@gmail.com>

From: Rob Herring <rob.herring@calxeda.com>

In commit 2ee68f621af280 (net: calxedaxgmac: fix various errors in
xgmac_set_rx_mode), a fix to clean-up old address entries was added.
However, the loop to zero out the entries failed to increment the register
address resulting in only 1 entry getting cleared. Fix this to correctly
use the loop index. Also, the end of the loop condition was off by 1 and
should have been <= rather than <.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
 drivers/net/ethernet/calxeda/xgmac.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c
index 78d6d6b..94358d2 100644
--- a/drivers/net/ethernet/calxeda/xgmac.c
+++ b/drivers/net/ethernet/calxeda/xgmac.c
@@ -1342,8 +1342,8 @@ static void xgmac_set_rx_mode(struct net_device *dev)
 	}
 
 out:
-	for (i = reg; i < XGMAC_MAX_FILTER_ADDR; i++)
-		xgmac_set_mac_addr(ioaddr, NULL, reg);
+	for (i = reg; i <= XGMAC_MAX_FILTER_ADDR; i++)
+		xgmac_set_mac_addr(ioaddr, NULL, i);
 	for (i = 0; i < XGMAC_NUM_HASH; i++)
 		writel(hash_filter[i], ioaddr + XGMAC_HASH(i));
 
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH 0/3] calxedaxgmac: fixes for xgmac_set_rx_mode
From: Rob Herring @ 2013-09-30 20:12 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Rob Herring

From: Rob Herring <rob.herring@calxeda.com>

This is a couple of fixes related to xgmac_set_rx_mode. The changes are
necessary for "bridge fdb add" to work correctly.

Rob

Rob Herring (3):
  net: calxedaxgmac: fix clearing of old filter addresses
  net: calxedaxgmac: add uc and mc filter addresses in promiscuous mode
  net: calxedaxgmac: determine number of address filters at runtime

 drivers/net/ethernet/calxeda/xgmac.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

-- 
1.8.1.2

^ permalink raw reply

* [PATCH net] unix_diag: fix info leak
From: Mathias Krause @ 2013-09-30 20:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: Mathias Krause, netdev

When filling the netlink message we miss to wipe the pad field,
therefore leak one byte of heap memory to userland. Fix this by
setting pad to 0.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
Probably material for stable as well (v3.3+).

 net/unix/diag.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/diag.c b/net/unix/diag.c
index d591091..86fa0f3 100644
--- a/net/unix/diag.c
+++ b/net/unix/diag.c
@@ -124,6 +124,7 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, struct unix_diag_r
 	rep->udiag_family = AF_UNIX;
 	rep->udiag_type = sk->sk_type;
 	rep->udiag_state = sk->sk_state;
+	rep->pad = 0;
 	rep->udiag_ino = sk_ino;
 	sock_diag_save_cookie(sk, rep->udiag_cookie);
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 4/4] connector - documentation: simplify netlink message length assignment
From: Mathias Krause @ 2013-09-30 20:03 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Mathias Krause, netdev
In-Reply-To: <1380571389-15343-1-git-send-email-minipli@googlemail.com>

Use the precalculated size instead of obfuscating the message length
calculation by first subtracting the netlink header length from size
and then use the NLMSG_LENGTH() macro to add it back again.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
 Documentation/connector/ucon.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/connector/ucon.c b/Documentation/connector/ucon.c
index 4848db8..8a4da64 100644
--- a/Documentation/connector/ucon.c
+++ b/Documentation/connector/ucon.c
@@ -71,7 +71,7 @@ static int netlink_send(int s, struct cn_msg *msg)
 	nlh->nlmsg_seq = seq++;
 	nlh->nlmsg_pid = getpid();
 	nlh->nlmsg_type = NLMSG_DONE;
-	nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh));
+	nlh->nlmsg_len = size;
 	nlh->nlmsg_flags = 0;
 
 	m = NLMSG_DATA(nlh);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/4] connector: use nlmsg_len() to check message length
From: Mathias Krause @ 2013-09-30 20:03 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Mathias Krause, netdev
In-Reply-To: <1380571389-15343-1-git-send-email-minipli@googlemail.com>

The current code tests the length of the whole netlink message to be
at least as long to fit a cn_msg. This is wrong as nlmsg_len includes
the length of the netlink message header. Use nlmsg_len() instead to
fix this "off-by-NLMSG_HDRLEN" size check.

Cc: stable@vger.kernel.org  # v2.6.14+
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
 drivers/connector/connector.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 6ecfa75..0daa11e 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -157,17 +157,18 @@ static int cn_call_callback(struct sk_buff *skb)
 static void cn_rx_skb(struct sk_buff *__skb)
 {
 	struct nlmsghdr *nlh;
-	int err;
 	struct sk_buff *skb;
+	int len, err;
 
 	skb = skb_get(__skb);
 
 	if (skb->len >= NLMSG_HDRLEN) {
 		nlh = nlmsg_hdr(skb);
+		len = nlmsg_len(nlh);
 
-		if (nlh->nlmsg_len < sizeof(struct cn_msg) ||
+		if (len < (int)sizeof(struct cn_msg) ||
 		    skb->len < nlh->nlmsg_len ||
-		    nlh->nlmsg_len > CONNECTOR_MAX_MSG_SIZE) {
+		    len > CONNECTOR_MAX_MSG_SIZE) {
 			kfree_skb(skb);
 			return;
 		}
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/4] connector: use 'size' everywhere in cn_netlink_send()
From: Mathias Krause @ 2013-09-30 20:03 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Mathias Krause, netdev
In-Reply-To: <1380571389-15343-1-git-send-email-minipli@googlemail.com>

We calculated the size for the netlink message buffer as size. Use size
in the memcpy() call as well instead of recalculating it.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
 drivers/connector/connector.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 0daa11e..a36749f 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -109,7 +109,7 @@ int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask)
 
 	data = nlmsg_data(nlh);
 
-	memcpy(data, msg, sizeof(*data) + msg->len);
+	memcpy(data, msg, size);
 
 	NETLINK_CB(skb).dst_group = group;
 
-- 
1.7.10.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox