Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Neil Horman @ 2013-10-28 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131017084121.GC22705@gmail.com>



Ingo, et al.-
	Ok, sorry for the delay, here are the test results you've been asking
for.


First, some information about what I did.  I attached the module that I ran this
test with at the bottom of this email.  You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path.  The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate.  I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.

First, the run time tests:

I ran this command:
for i in `seq 0 1 3`
do
	 echo $i > /sys/module/csum_test/parameters/module_test_mode
	 perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done

The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both).  The results are:


Base:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.093269042 seconds time elapsed                                          ( +-  2.24% )

Prefetch (5x64):
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.079440009 seconds time elapsed                                          ( +-  2.29% )

Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.087666677 seconds time elapsed                                          ( +-  4.01% )

Prefetch + Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.080758702 seconds time elapsed                                          ( +-  2.34% )

So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.

Looking at the specific cpu counters we get this:


Base:
     Total time: 0.179 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
            14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
             2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
            75,402 page-faults               #    0.048 M/sec                    ( +-  0.07% )
     1,597,349,326 cycles                    #    1.017 GHz                      ( +-  1.74% ) [40.51%]
       104,882,858 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.25% ) [40.33%]
     1,043,429,984 stalled-cycles-backend    #   65.32% backend  cycles idle     ( +-  1.25% ) [39.73%]
       868,372,132 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.43% ) [39.88%]
       161,143,820 branches                  #  102.554 M/sec                    ( +-  1.49% ) [39.76%]
         4,348,075 branch-misses             #    2.70% of all branches          ( +-  1.43% ) [39.99%]
       457,042,576 L1-dcache-loads           #  290.868 M/sec                    ( +-  1.25% ) [40.63%]
         8,928,240 L1-dcache-load-misses     #    1.95% of all L1-dcache hits    ( +-  1.26% ) [41.17%]
        15,821,051 LLC-loads                 #   10.069 M/sec                    ( +-  1.56% ) [41.20%]
         4,902,576 LLC-load-misses           #   30.99% of all LL-cache hits     ( +-  1.51% ) [41.36%]
       235,775,688 L1-icache-loads           #  150.051 M/sec                    ( +-  1.39% ) [41.10%]
         3,116,106 L1-icache-load-misses     #    1.32% of all L1-icache hits    ( +-  3.43% ) [40.96%]
       461,315,416 dTLB-loads                #  293.588 M/sec                    ( +-  1.43% ) [41.18%]
           140,280 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.30% ) [40.96%]
       236,127,031 iTLB-loads                #  150.275 M/sec                    ( +-  1.63% ) [41.43%]
            46,173 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.40% ) [41.11%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.82%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.37%]

       0.301414024 seconds time elapsed                                          ( +-  0.47% )

Prefetch (5x64):
     Total time: 0.172 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1565.797128 task-clock                #    5.238 CPUs utilized            ( +-  0.46% )
            13,845 context-switches          #    0.009 M/sec                    ( +-  4.20% )
             2,624 cpu-migrations            #    0.002 M/sec                    ( +-  2.72% )
            75,452 page-faults               #    0.048 M/sec                    ( +-  0.08% )
     1,642,106,355 cycles                    #    1.049 GHz                      ( +-  1.33% ) [40.17%]
       107,786,666 stalled-cycles-frontend   #    6.56% frontend cycles idle     ( +-  1.37% ) [39.90%]
     1,065,286,880 stalled-cycles-backend    #   64.87% backend  cycles idle     ( +-  1.59% ) [39.14%]
       888,815,001 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.29% ) [38.92%]
       163,106,907 branches                  #  104.169 M/sec                    ( +-  1.32% ) [38.93%]
         4,333,456 branch-misses             #    2.66% of all branches          ( +-  1.94% ) [39.77%]
       459,779,806 L1-dcache-loads           #  293.639 M/sec                    ( +-  1.60% ) [40.23%]
         8,827,680 L1-dcache-load-misses     #    1.92% of all L1-dcache hits    ( +-  1.77% ) [41.38%]
        15,556,816 LLC-loads                 #    9.935 M/sec                    ( +-  1.76% ) [41.16%]
         4,885,618 LLC-load-misses           #   31.40% of all LL-cache hits     ( +-  1.40% ) [40.84%]
       236,131,778 L1-icache-loads           #  150.806 M/sec                    ( +-  1.32% ) [40.59%]
         3,037,537 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.23% ) [41.13%]
       454,835,028 dTLB-loads                #  290.481 M/sec                    ( +-  1.23% ) [41.34%]
           139,907 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.18% ) [41.21%]
       236,357,655 iTLB-loads                #  150.950 M/sec                    ( +-  1.31% ) [41.29%]
            46,633 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.74% ) [40.67%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.16%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.09%]

       0.298948767 seconds time elapsed                                          ( +-  0.36% )

Here it appears everything between the two runs is about the same.  We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.

Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1553.544876 task-clock                #    5.217 CPUs utilized            ( +-  0.42% )
            14,066 context-switches          #    0.009 M/sec                    ( +-  6.24% )
             2,831 cpu-migrations            #    0.002 M/sec                    ( +-  3.33% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.08% )
     1,659,509,743 cycles                    #    1.068 GHz                      ( +-  1.27% ) [40.10%]
       106,466,680 stalled-cycles-frontend   #    6.42% frontend cycles idle     ( +-  1.50% ) [39.98%]
     1,035,481,957 stalled-cycles-backend    #   62.40% backend  cycles idle     ( +-  1.23% ) [39.38%]
       875,104,201 instructions              #    0.53  insns per cycle        
                                             #    1.18  stalled cycles per insn  ( +-  1.30% ) [38.66%]
       160,553,275 branches                  #  103.346 M/sec                    ( +-  1.32% ) [38.85%]
         4,329,119 branch-misses             #    2.70% of all branches          ( +-  1.39% ) [39.59%]
       448,195,116 L1-dcache-loads           #  288.498 M/sec                    ( +-  1.91% ) [41.07%]
         8,632,347 L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  1.90% ) [41.56%]
        15,143,145 LLC-loads                 #    9.747 M/sec                    ( +-  1.89% ) [41.05%]
         4,698,204 LLC-load-misses           #   31.03% of all LL-cache hits     ( +-  1.03% ) [41.23%]
       224,316,468 L1-icache-loads           #  144.390 M/sec                    ( +-  1.27% ) [41.39%]
         2,902,842 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.65% ) [42.60%]
       433,914,588 dTLB-loads                #  279.306 M/sec                    ( +-  1.75% ) [43.07%]
           132,090 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.15% ) [43.12%]
       230,701,361 iTLB-loads                #  148.500 M/sec                    ( +-  1.77% ) [43.47%]
            45,562 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.76% ) [42.88%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [42.29%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [41.32%]

       0.297758185 seconds time elapsed                                          ( +-  0.40% )

Here It seems the major advantage was backend stall cycles saved (which makes
sense to me).  Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire.  As a result we dropped two percentage points in our stall number.

Prefetch + Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1549.171283 task-clock                #    5.231 CPUs utilized            ( +-  0.50% )
            13,717 context-switches          #    0.009 M/sec                    ( +-  4.32% )
             2,721 cpu-migrations            #    0.002 M/sec                    ( +-  2.47% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.07% )
     1,579,140,244 cycles                    #    1.019 GHz                      ( +-  1.71% ) [40.06%]
       103,803,034 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.74% ) [39.60%]
     1,016,582,613 stalled-cycles-backend    #   64.38% backend  cycles idle     ( +-  1.79% ) [39.57%]
       881,036,653 instructions              #    0.56  insns per cycle        
                                             #    1.15  stalled cycles per insn  ( +-  1.61% ) [39.29%]
       164,333,010 branches                  #  106.078 M/sec                    ( +-  1.51% ) [39.38%]
         4,385,459 branch-misses             #    2.67% of all branches          ( +-  1.62% ) [40.29%]
       463,987,526 L1-dcache-loads           #  299.507 M/sec                    ( +-  1.52% ) [40.20%]
         8,739,535 L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  1.95% ) [40.37%]
        15,318,497 LLC-loads                 #    9.888 M/sec                    ( +-  1.80% ) [40.43%]
         4,846,148 LLC-load-misses           #   31.64% of all LL-cache hits     ( +-  1.68% ) [40.59%]
       231,982,874 L1-icache-loads           #  149.746 M/sec                    ( +-  1.43% ) [41.25%]
         3,141,106 L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  2.32% ) [41.76%]
       459,688,615 dTLB-loads                #  296.732 M/sec                    ( +-  1.75% ) [41.87%]
           138,667 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  1.97% ) [42.31%]
       235,629,204 iTLB-loads                #  152.100 M/sec                    ( +-  1.40% ) [42.04%]
            46,038 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.75% ) [41.20%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.77%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.27%]

       0.296173305 seconds time elapsed                                          ( +-  0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution.  I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes.  I'll
try messing with the stride a bit more to see if I can get some more insight
there.

So there you have it.  I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.

Thoughts?

Regards
Neil

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;

static int __init csum_init_module(void)
{
        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}

static int set_param_str(const char *val, const struct kernel_param *kp)
{
        int i;
        __wsum sum = 0;
        /*u64 start, end;*/
        void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	
        memset(addrs, 0, sizeof(addrs));
        for (i = 0; i < NBPAGES; i++) {
                addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
                if (!addrs[i])
                        goto out;
        }

	csum_mode = module_test_mode;

	local_bh_disable();
        /*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
        /*start = ktime_to_ns(ktime_get());*/

        for (i = 0; i < 100000; i++) {
                rnd = prandom_u32();
                base = addrs[rnd % NBPAGES];
                rnd /= NBPAGES;
                offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial(base + offset, 1500, sum);
        }
        /*end = ktime_to_ns(ktime_get());*/
        local_bh_enable();

	/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/

	csum_mode = 0;
out:
        for (i = 0; i < NBPAGES; i++)
                kfree(addrs[i]);

        return 0;
}

static int get_param_str(char *buffer, const struct kernel_param *kp)
{
	return sprintf(buffer, "%d\n", test_fire);
}

static struct kernel_param_ops param_ops_str = {
	.set = set_param_str,
	.get = get_param_str,
};

module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, &param_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply

* [PATCH net-next v2] net: sched: cls_bpf: add BPF-based classifier
From: Daniel Borkmann @ 2013-10-28 15:43 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, netdev, Thomas Graf
In-Reply-To: <cover.1382974535.git.dborkman@redhat.com>

This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:

     0: No match
    -1: Select classid given in "tc filter ..." command
  else: flowid, overwrite the default one

As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:

echo 1 > /proc/sys/net/core/bpf_jit_enable

tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16

tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3

BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:

1) People familiar with tcpdump-like filters:

   tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf

2) People that want to low-level program their filters or use BPF
   extensions that lack support by libpcap's compiler:

   bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf

   ssh.ops example code:
   ldh [12]
   jne #0x800, drop
   ldb [23]
   jneq #6, drop
   ldh [20]
   jset #0x1fff, drop
   ldxb 4 * ([14] & 0xf)
   ldh [%x + 14]
   jeq #0x16, pass
   ldh [%x + 16]
   jne #0x16, drop
   pass: ret #-1
   drop: ret #0

It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 v1->v2:
  - Ok, applied Eric Dumazet's feedback with possibility to overwrite
    default flowid

 iproute2 part: http://patchwork.ozlabs.org/patch/286496/

 include/uapi/linux/pkt_cls.h |  14 ++
 net/sched/Kconfig            |  10 ++
 net/sched/Makefile           |   1 +
 net/sched/cls_bpf.c          | 385 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 410 insertions(+)
 create mode 100644 net/sched/cls_bpf.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 082eafa..25731df 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -388,6 +388,20 @@ enum {
 
 #define TCA_CGROUP_MAX (__TCA_CGROUP_MAX - 1)
 
+/* BPF classifier */
+
+enum {
+	TCA_BPF_UNSPEC,
+	TCA_BPF_ACT,
+	TCA_BPF_POLICE,
+	TCA_BPF_CLASSID,
+	TCA_BPF_OPS_LEN,
+	TCA_BPF_OPS,
+	__TCA_BPF_MAX,
+};
+
+#define TCA_BPF_MAX (__TCA_BPF_MAX - 1)
+
 /* Extended Matches */
 
 struct tcf_ematch_tree_hdr {
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03a32a..ad1f1d8 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -443,6 +443,16 @@ config NET_CLS_CGROUP
 	  To compile this code as a module, choose M here: the
 	  module will be called cls_cgroup.
 
+config NET_CLS_BPF
+	tristate "BPF-based classifier"
+	select NET_CLS
+	---help---
+	  If you say Y here, you will be able to classify packets based on
+	  programmable BPF (JIT'ed) filters as an alternative to ematches.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called cls_bpf.
+
 config NET_EMATCH
 	bool "Extended Matches"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index e5f9abe..35fa47a 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_NET_CLS_RSVP6)	+= cls_rsvp6.o
 obj-$(CONFIG_NET_CLS_BASIC)	+= cls_basic.o
 obj-$(CONFIG_NET_CLS_FLOW)	+= cls_flow.o
 obj-$(CONFIG_NET_CLS_CGROUP)	+= cls_cgroup.o
+obj-$(CONFIG_NET_CLS_BPF)	+= cls_bpf.o
 obj-$(CONFIG_NET_EMATCH)	+= ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)	+= em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE)	+= em_nbyte.o
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
new file mode 100644
index 0000000..1002a82
--- /dev/null
+++ b/net/sched/cls_bpf.c
@@ -0,0 +1,385 @@
+/*
+ * Berkeley Packet Filter based traffic classifier
+ *
+ * Might be used to classify traffic through flexible, user-defined and
+ * possibly JIT-ed BPF filters for traffic control as an alternative to
+ * ematches.
+ *
+ * (C) 2013 Daniel Borkmann <dborkman@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/skbuff.h>
+#include <linux/filter.h>
+#include <net/rtnetlink.h>
+#include <net/pkt_cls.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Daniel Borkmann <dborkman@redhat.com>");
+MODULE_DESCRIPTION("TC BPF based classifier");
+
+struct cls_bpf_head {
+	struct list_head plist;
+	u32 hgen;
+};
+
+struct cls_bpf_prog {
+	struct sk_filter *filter;
+	struct sock_filter *bpf_ops;
+	struct tcf_exts exts;
+	struct tcf_result res;
+	struct list_head link;
+	u32 handle;
+	u16 bpf_len;
+};
+
+static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
+	[TCA_BPF_CLASSID]	= { .type = NLA_U32 },
+	[TCA_BPF_OPS_LEN]	= { .type = NLA_U16 },
+	[TCA_BPF_OPS]		= { .type = NLA_BINARY,
+				    .len = sizeof(struct sock_filter) * BPF_MAXINSNS },
+};
+
+static const struct tcf_ext_map bpf_ext_map = {
+	.action = TCA_BPF_ACT,
+	.police = TCA_BPF_POLICE,
+};
+
+static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+			    struct tcf_result *res)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog;
+	int ret;
+
+	list_for_each_entry(prog, &head->plist, link) {
+		int filter_res = SK_RUN_FILTER(prog->filter, skb);
+
+		if (filter_res == 0)
+			continue;
+
+		*res = prog->res;
+		if (filter_res != -1)
+			res->classid = filter_res;
+
+		ret = tcf_exts_exec(skb, &prog->exts, res);
+		if (ret < 0)
+			continue;
+
+		return ret;
+	}
+
+	return -1;
+}
+
+static int cls_bpf_init(struct tcf_proto *tp)
+{
+	struct cls_bpf_head *head;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (head == NULL)
+		return -ENOBUFS;
+
+	INIT_LIST_HEAD(&head->plist);
+	tp->root = head;
+
+	return 0;
+}
+
+static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog *prog)
+{
+	tcf_unbind_filter(tp, &prog->res);
+	tcf_exts_destroy(tp, &prog->exts);
+
+	sk_unattached_filter_destroy(prog->filter);
+
+	kfree(prog->bpf_ops);
+	kfree(prog);
+}
+
+static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog, *todel = (struct cls_bpf_prog *) arg;
+
+	list_for_each_entry(prog, &head->plist, link) {
+		if (prog == todel) {
+			tcf_tree_lock(tp);
+			list_del(&prog->link);
+			tcf_tree_unlock(tp);
+
+			cls_bpf_delete_prog(tp, prog);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static void cls_bpf_destroy(struct tcf_proto *tp)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog, *tmp;
+
+	list_for_each_entry_safe(prog, tmp, &head->plist, link) {
+		list_del(&prog->link);
+		cls_bpf_delete_prog(tp, prog);
+	}
+
+	kfree(head);
+}
+
+static unsigned long cls_bpf_get(struct tcf_proto *tp, u32 handle)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog;
+	unsigned long ret = 0UL;
+
+	if (head == NULL)
+		return 0UL;
+
+	list_for_each_entry(prog, &head->plist, link) {
+		if (prog->handle == handle) {
+			ret = (unsigned long) prog;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void cls_bpf_put(struct tcf_proto *tp, unsigned long f)
+{
+}
+
+static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
+				   struct cls_bpf_prog *prog,
+				   unsigned long base, struct nlattr **tb,
+				   struct nlattr *est)
+{
+	struct sock_filter *bpf_ops, *bpf_old;
+	struct tcf_exts exts;
+	struct sock_fprog tmp;
+	struct sk_filter *fp, *fp_old;
+	u16 bpf_size, bpf_len;
+	u32 classid;
+	int ret;
+
+	if (!tb[TCA_BPF_OPS_LEN] || !tb[TCA_BPF_OPS] || !tb[TCA_BPF_CLASSID])
+		return -EINVAL;
+
+	ret = tcf_exts_validate(net, tp, tb, est, &exts, &bpf_ext_map);
+	if (ret < 0)
+		return ret;
+
+	classid = nla_get_u32(tb[TCA_BPF_CLASSID]);
+	bpf_len = nla_get_u16(tb[TCA_BPF_OPS_LEN]);
+	if (bpf_len > BPF_MAXINSNS || bpf_len == 0) {
+		ret = -EINVAL;
+		goto errout;
+	}
+
+	bpf_size = bpf_len * sizeof(*bpf_ops);
+	bpf_ops = kzalloc(bpf_size, GFP_KERNEL);
+	if (bpf_ops == NULL) {
+		ret = -ENOMEM;
+		goto errout;
+	}
+
+	memcpy(bpf_ops, nla_data(tb[TCA_BPF_OPS]), bpf_size);
+
+	tmp.len = bpf_len;
+	tmp.filter = (struct sock_filter __user *) bpf_ops;
+
+	ret = sk_unattached_filter_create(&fp, &tmp);
+	if (ret)
+		goto errout_free;
+
+	tcf_tree_lock(tp);
+	fp_old = prog->filter;
+	bpf_old = prog->bpf_ops;
+
+	prog->bpf_len = bpf_len;
+	prog->bpf_ops = bpf_ops;
+	prog->filter = fp;
+	prog->res.classid = classid;
+	tcf_tree_unlock(tp);
+
+	tcf_bind_filter(tp, &prog->res, base);
+	tcf_exts_change(tp, &prog->exts, &exts);
+
+	if (fp_old)
+		sk_unattached_filter_destroy(fp_old);
+	if (bpf_old)
+		kfree(bpf_old);
+
+	return 0;
+
+errout_free:
+	kfree(bpf_ops);
+errout:
+	tcf_exts_destroy(tp, &exts);
+	return ret;
+}
+
+static u32 cls_bpf_grab_new_handle(struct tcf_proto *tp,
+				   struct cls_bpf_head *head)
+{
+	unsigned int i = 0x80000000;
+
+	do {
+		if (++head->hgen == 0x7FFFFFFF)
+			head->hgen = 1;
+	} while (--i > 0 && cls_bpf_get(tp, head->hgen));
+	if (i == 0)
+		pr_err("Insufficient number of handles\n");
+
+	return i;
+}
+
+static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
+			  struct tcf_proto *tp, unsigned long base,
+			  u32 handle, struct nlattr **tca,
+			  unsigned long *arg)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog = (struct cls_bpf_prog *) *arg;
+	struct nlattr *tb[TCA_BPF_MAX + 1];
+	int ret;
+
+	if (tca[TCA_OPTIONS] == NULL)
+		return -EINVAL;
+
+	ret = nla_parse_nested(tb, TCA_BPF_MAX, tca[TCA_OPTIONS], bpf_policy);
+	if (ret < 0)
+		return ret;
+
+	if (prog != NULL) {
+		if (handle && prog->handle != handle)
+			return -EINVAL;
+		return cls_bpf_modify_existing(net, tp, prog, base, tb,
+					       tca[TCA_RATE]);
+	}
+
+	prog = kzalloc(sizeof(*prog), GFP_KERNEL);
+	if (prog == NULL)
+		return -ENOBUFS;
+
+	if (handle == 0)
+		prog->handle = cls_bpf_grab_new_handle(tp, head);
+	else
+		prog->handle = handle;
+	if (prog->handle == 0) {
+		ret = -EINVAL;
+		goto errout;
+	}
+
+	ret = cls_bpf_modify_existing(net, tp, prog, base, tb, tca[TCA_RATE]);
+	if (ret < 0)
+		goto errout;
+
+	tcf_tree_lock(tp);
+	list_add(&prog->link, &head->plist);
+	tcf_tree_unlock(tp);
+
+	*arg = (unsigned long) prog;
+
+	return 0;
+errout:
+	if (*arg == 0UL && prog)
+		kfree(prog);
+
+	return ret;
+}
+
+static int cls_bpf_dump(struct tcf_proto *tp, unsigned long fh,
+			struct sk_buff *skb, struct tcmsg *tm)
+{
+	struct cls_bpf_prog *prog = (struct cls_bpf_prog *) fh;
+	struct nlattr *nest, *nla;
+
+	if (prog == NULL)
+		return skb->len;
+
+	tm->tcm_handle = prog->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (nest == NULL)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_BPF_CLASSID, prog->res.classid))
+		goto nla_put_failure;
+	if (nla_put_u16(skb, TCA_BPF_OPS_LEN, prog->bpf_len))
+		goto nla_put_failure;
+
+	nla = nla_reserve(skb, TCA_BPF_OPS, prog->bpf_len *
+			  sizeof(struct sock_filter));
+	if (nla == NULL)
+		goto nla_put_failure;
+
+        memcpy(nla_data(nla), prog->bpf_ops, nla_len(nla));
+
+	if (tcf_exts_dump(skb, &prog->exts, &bpf_ext_map) < 0)
+		goto nla_put_failure;
+
+	nla_nest_end(skb, nest);
+
+	if (tcf_exts_dump_stats(skb, &prog->exts, &bpf_ext_map) < 0)
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+{
+	struct cls_bpf_head *head = tp->root;
+	struct cls_bpf_prog *prog;
+
+	list_for_each_entry(prog, &head->plist, link) {
+		if (arg->count < arg->skip)
+			goto skip;
+		if (arg->fn(tp, (unsigned long) prog, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+skip:
+		arg->count++;
+	}
+}
+
+static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
+	.kind		=	"bpf",
+	.owner		=	THIS_MODULE,
+	.classify	=	cls_bpf_classify,
+	.init		=	cls_bpf_init,
+	.destroy	=	cls_bpf_destroy,
+	.get		=	cls_bpf_get,
+	.put		=	cls_bpf_put,
+	.change		=	cls_bpf_change,
+	.delete		=	cls_bpf_delete,
+	.walk		=	cls_bpf_walk,
+	.dump		=	cls_bpf_dump,
+};
+
+static int __init cls_bpf_init_mod(void)
+{
+	return register_tcf_proto_ops(&cls_bpf_ops);
+}
+
+static void __exit cls_bpf_exit_mod(void)
+{
+	unregister_tcf_proto_ops(&cls_bpf_ops);
+}
+
+module_init(cls_bpf_init_mod);
+module_exit(cls_bpf_exit_mod);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Stephen Hemminger @ 2013-10-28 15:33 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Dan Williams, Pali Rohár, Luciano Coelho, John W. Linville,
	David S. Miller, linux-wireless, netdev, linux-kernel,
	freemangordon, aaro.koskinen, pavel, sre, joni.lapilainen
In-Reply-To: <1382972215.17956.30.camel@jlt4.sipsolutions.net>

On Mon, 28 Oct 2013 15:56:55 +0100
Johannes Berg <johannes@sipsolutions.net> wrote:

> On Mon, 2013-10-28 at 09:46 -0500, Dan Williams wrote:
> 
> > If the device doesn't actually *have* a permanent MAC address, then it
> > shouldn't be returning one via ethtool, and should return an error for
> > ETHTOOL_GPERMADDR.
> 
> There's currently no provision for that, but I agree it would be better
> to do so.
> 
> johannes
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The current netdevice API handle the case of a device without a permanent
MAC address, slightly differently.

If device does not have a permanent address,
then:
  1. dev->addr_assign_type should not be NET_ADDR_PERM
  2. when device is registered dev->perm_addr will not be set
     and retain all zeros value
  4. when ethtool gets address it will return all zeros which
     is not a valid address.

This case doesn't seem to be handled threo mac80211 API's

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Dan Williams @ 2013-10-28 15:29 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Johannes Berg, Luciano Coelho, John W. Linville, David S. Miller,
	linux-wireless, netdev, linux-kernel, freemangordon,
	aaro.koskinen, pavel, sre, joni.lapilainen
In-Reply-To: <201310281604.48542@pali>

On Mon, 2013-10-28 at 16:04 +0100, Pali Rohár wrote:
> On Monday 28 October 2013 15:56:55 Johannes Berg wrote:
> > On Mon, 2013-10-28 at 09:46 -0500, Dan Williams wrote:
> > > If the device doesn't actually *have* a permanent MAC
> > > address, then it shouldn't be returning one via ethtool,
> > > and should return an error for ETHTOOL_GPERMADDR.
> > 
> > There's currently no provision for that, but I agree it would
> > be better to do so.
> > 
> > johannes
> 
> So what to do with devices which has MAC address stored in some 
> obscure place and there is userspace binary which can read it?
> 
> I think that there should be some way to tell kernel that *this* 
> is the permanent address and it should use it.
> 
> This is reason why I proposed patch which adding sysfs entry for 
> setting permanent address from userspace in wl1251 driver.

Ok, so the N900's MAC is stored in the "cal partition", which is a
region of flash exposed to the OS as /dev/mtd1.  That also stores the
regulatory domain.  Typically userspace binaries are used to pull out
this and other data (see
https://dev.openwrt.org/browser/packages/utils/calvaria/files/src/calvaria.c ) which is then used to initialize the device.

Any idea what kernel driver is used to expose the CAL partition
as /dev/mtd1?  If it's nothing special maybe a small EEPROM-style driver
could be written to read the relevant areas (and since they don't ever
change, don't need to worry about something else writing at the same
time).

Dan

^ permalink raw reply

* Re: [PATCH net-next] net: sched: cls_bpf: add BPF-based classifier
From: Eric Dumazet @ 2013-10-28 15:27 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, Thomas Graf
In-Reply-To: <526E7556.1060902@redhat.com>

On Mon, 2013-10-28 at 15:31 +0100, Daniel Borkmann wrote:

> I thought about this, I think this can partially be resolved by the
> user implementing one BPF program to match all possible flows for a
> class instead of implementing multiple BPF programs as mentioned in
> the commit, iow that's up to the user space. And the case you suggest,
> would be another option to further improve this, but would come with
> some difficulties in contrast to the notion 0: mismatch, !0: match.
> I think, this would need additional walk-through through all 'ret'
> opcodes to see if those classes actually exist. Then, we would need
> refcounting and call tcf_{un,}bind_filter() for each class that is
> related to this filter, and tcf_exts_exec() would either need to be
> i) understood in a "per-filter" notion, so as long as something
> matches (!0) exec the very same action/policy (which might not be
> what we want) or ii) could just not be implemented as multiple
> user-defined filters could be defined in iproute2 with different
> actions each, but in some paths return the same flowid. So I think
> this here seems the better trade-off.

I have no idea why you want to do all this.

classid are not checked at filter installation time, but at run time.

All you need to do is change :

+       list_for_each_entry(prog, &head->plist, link) {
+               if (SK_RUN_FILTER(prog->filter, skb) == 0)
+                       continue;
+
+               *res = prog->res;
+               ret = tcf_exts_exec(skb, &prog->exts, res);
+               if (ret < 0)
+                       continue;
+
+               return ret;
+       }

to

	list_for_each_entry(prog, &head->plist, link) {
		int filter_res = SK_RUN_FILTER(prog->filter, skb);

		if (!filter_res)
			continue;


               *res = prog->res;
		if (filter_res != -1)
			res->classid = filter_res;

               ret = tcf_exts_exec(skb, &prog->exts, res);
               if (ret < 0)
                       continue;

               return ret;
       }

(Reserving filter returns codes :
 0 : No match
 -1: select classid given in the "tc filter ...." command)
other values : flowid, overiding the default one.

^ permalink raw reply

* Re: [PATCH 1/3] vxlan: silence one build warning
From: Stephen Hemminger @ 2013-10-28 15:25 UTC (permalink / raw)
  To: David Miller; +Cc: zwu.kernel, netdev, linux-kernel, wuzhy
In-Reply-To: <20131028.003807.1591686723281776238.davem@davemloft.net>

On Mon, 28 Oct 2013 00:38:07 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Fri, 25 Oct 2013 08:41:34 -0700
> 
> > I would rather not fix the warning this way since it risks masking
> > later bugs if this code ever changes.
> 
> But this is suboptimally coded, and is asking for the warning.
> 
> Anything returning a pointer by reference is asking for trouble
> in my opinion.
> 
> The correct thing to do is to make create_v{4,6}_sock() return
> the "struct socket *" as an error pointer.
> 
> No more ambiguous initializations, no more warnings.

Agreed, original code used ERR_PTR (see vxlan_socket_create),
the side effect stuff only came with the addition of IPv6.

^ permalink raw reply

* Re: [PATCH v2 1/3] vxlan: silence one build warning
From: Stephen Hemminger @ 2013-10-28 15:24 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: netdev, linux-kernel, davem, Zhi Yong Wu
In-Reply-To: <1382940110-10737-1-git-send-email-zwu.kernel@gmail.com>

On Mon, 28 Oct 2013 14:01:48 +0800
Zhi Yong Wu <zwu.kernel@gmail.com> wrote:

> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> drivers/net/vxlan.c: In function ‘vxlan_sock_add’:
> drivers/net/vxlan.c:2298:11: warning: ‘sock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> drivers/net/vxlan.c:2275:17: note: ‘sock’ was declared here
>   LD      drivers/net/built-in.o
> 
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This is much better.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Pali Rohár @ 2013-10-28 15:04 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Dan Williams, Luciano Coelho, John W. Linville, David S. Miller,
	linux-wireless, netdev, linux-kernel, freemangordon,
	aaro.koskinen, pavel, sre, joni.lapilainen
In-Reply-To: <1382972215.17956.30.camel@jlt4.sipsolutions.net>

[-- Attachment #1: Type: Text/Plain, Size: 819 bytes --]

On Monday 28 October 2013 15:56:55 Johannes Berg wrote:
> On Mon, 2013-10-28 at 09:46 -0500, Dan Williams wrote:
> > If the device doesn't actually *have* a permanent MAC
> > address, then it shouldn't be returning one via ethtool,
> > and should return an error for ETHTOOL_GPERMADDR.
> 
> There's currently no provision for that, but I agree it would
> be better to do so.
> 
> johannes

So what to do with devices which has MAC address stored in some 
obscure place and there is userspace binary which can read it?

I think that there should be some way to tell kernel that *this* 
is the permanent address and it should use it.

This is reason why I proposed patch which adding sysfs entry for 
setting permanent address from userspace in wl1251 driver.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Johannes Berg @ 2013-10-28 14:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pali Rohár, Luciano Coelho, John W. Linville,
	David S. Miller, linux-wireless, netdev, linux-kernel,
	freemangordon, aaro.koskinen, pavel, sre, joni.lapilainen
In-Reply-To: <1382971562.1542.6.camel@dcbw.foobar.com>

On Mon, 2013-10-28 at 09:46 -0500, Dan Williams wrote:

> If the device doesn't actually *have* a permanent MAC address, then it
> shouldn't be returning one via ethtool, and should return an error for
> ETHTOOL_GPERMADDR.

There's currently no provision for that, but I agree it would be better
to do so.

johannes

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Dan Williams @ 2013-10-28 14:46 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Johannes Berg, Luciano Coelho, John W. Linville, David S. Miller,
	linux-wireless, netdev, linux-kernel, freemangordon,
	aaro.koskinen, pavel, sre, joni.lapilainen
In-Reply-To: <201310281500.24159@pali>

On Mon, 2013-10-28 at 15:00 +0100, Pali Rohár wrote:
> On Monday 28 October 2013 14:55:22 Johannes Berg wrote:
> > On Mon, 2013-10-28 at 14:49 +0100, Pali Rohár wrote:
> > > On Monday 28 October 2013 14:45:05 Johannes Berg wrote:
> > > > On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> > > > > Driver wl1251 generating mac address randomly at startup
> > > > > and there is no way to set permanent mac address via
> > > > > SET_IEEE80211_PERM_ADDR. This patch export sysfs file
> > > > > which can set permanent mac address by userspace helper
> > > > > program. Patch is needed for devices which do not store
> > > > > mac address in internal wl1251 eeprom.
> > > > 
> > > > This doesn't really seem like a good idea since you can
> > > > also just use 'ip' or whatever to set the MAC address.
> > > > 
> > > > johannes
> > > 
> > > AFAIK you cannot set permanent address (show by ethtool -P
> > > wlan0) via ip/ifconfig.
> > 
> > You probably can't, but that address also doesn't matter at
> > all and isn't really used anywhere.
> > 
> > johannes
> 
> There are some (proprietary) applications which using permanent 
> address for something...
> 
> And also more important: some network managing tools using it. 
> NetworkManager resetting current MAC address to permanent one 
> before starting configuring interface.
> 
> So this will lead to never use correct MAC address (but random 
> permanent one) assigned for that wl1251 wireless card...

If the device doesn't actually *have* a permanent MAC address, then it
shouldn't be returning one via ethtool, and should return an error for
ETHTOOL_GPERMADDR.  Setting the permanent MAC address shouldn't ever be
allowed except by the driver inspecting EEPROM, and certainly not via
sysfs.

mac80211 does have to do some special stuff due to virtual interfaces
and such, but in general, if the MAC address is randomly generated,
neither the driver nor mac80211 should be reporting that address as the
permanent address, only as the current one.

Dan

^ permalink raw reply

* Re: [PATCH net-next] net: sched: cls_bpf: add BPF-based classifier
From: Daniel Borkmann @ 2013-10-28 14:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, Thomas Graf
In-Reply-To: <1382967292.13037.22.camel@edumazet-glaptop.roam.corp.google.com>

On 10/28/2013 02:34 PM, Eric Dumazet wrote:
> On Mon, 2013-10-28 at 12:35 +0100, Daniel Borkmann wrote:
>> This work contains a lightweight BPF-based traffic classifier that can
>> serve as a flexible alternative to ematch-based tree classification, i.e.
>> now that BPF filter engine can also be JITed in the kernel. Naturally, tc
>> actions and policies are supported as well with cls_bpf. Multiple BPF
>> programs/filter can be attached for a class, or they can just as well be
>> written within a single BPF program, that's really up to the user how he
>> wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
>> The notion of a BPF program's return/exit codes is being kept as follows:
>> non-zero for match, zero for mismatch.
>>
>> As a minimal usage example with iproute2, we use a 3 band prio root qdisc
>> on a router with sfq each as leave, and assign ssh and icmp bpf-based
>> filters to band 1, http traffic to band 2 and the rest to band 3. For the
>> first two bands we load the bytecode from a file, in the 2nd we load it
>> inline as an example:
>>
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>
>> tc qdisc del dev em1 root
>> tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
>
>> tc qdisc add dev em1 parent 1:1 sfq perturb 16
>> tc qdisc add dev em1 parent 1:2 sfq perturb 16
>> tc qdisc add dev em1 parent 1:3 sfq perturb 16
>>
>> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
>> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
>> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
>> tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
>>
>> BPF programs can be easily created and passed to tc, either as inline
>> 'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
>> compile opcodes, for example:
>>
>> 1) People familiar with tcpdump-like filters:
>>
>>     tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
>>
>> 2) People that want to low-level program their filters or use BPF
>>     extensions that lack support by libpcap's compiler:
>>
>>     bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
>>
>>     ssh.ops example code:
>>     ldh [12]
>>     jne #0x800, drop
>>     ldb [23]
>>     jneq #6, drop
>>     ldh [20]
>>     jset #0x1fff, drop
>>     ldxb 4 * ([14] & 0xf)
>>     ldh [%x + 14]
>>     jeq #0x16, pass
>>     ldh [%x + 16]
>>     jne #0x16, drop
>>     pass: ret #-1
>>     drop: ret #0
>>
>> It was chosen to load bytecode into tc, since the reverse operation,
>> tc filter list dev em1, is then able to show the exact commands again.
>> Possible follow-up work could also include a small expression compiler
>> for iproute2. Tested with the help of bmon. This idea came up during
>> the Netfilter Workshop 2013 in Copenhagen.
>>
>
> Well, running a large amount of filters might be very expensive [1],
> have you considered returning the flowid from the filter, instead of
> returning 0 and !0 ?
>
> 0 : would mean : not matched filter
> <>0 : flowid

I thought about this, I think this can partially be resolved by the
user implementing one BPF program to match all possible flows for a
class instead of implementing multiple BPF programs as mentioned in
the commit, iow that's up to the user space. And the case you suggest,
would be another option to further improve this, but would come with
some difficulties in contrast to the notion 0: mismatch, !0: match.
I think, this would need additional walk-through through all 'ret'
opcodes to see if those classes actually exist. Then, we would need
refcounting and call tcf_{un,}bind_filter() for each class that is
related to this filter, and tcf_exts_exec() would either need to be
i) understood in a "per-filter" notion, so as long as something
matches (!0) exec the very same action/policy (which might not be
what we want) or ii) could just not be implemented as multiple
user-defined filters could be defined in iproute2 with different
actions each, but in some paths return the same flowid. So I think
this here seems the better trade-off.

^ permalink raw reply

* RE: [PATCH 7/8] rds: Pass pointers to virt_to_page(), not integers
From: Venkat Venkatsubra @ 2013-10-28 14:22 UTC (permalink / raw)
  To: David Miller, ben; +Cc: linux-kernel, rds-devel, netdev
In-Reply-To: <20131028.002654.404759341390215962.davem@davemloft.net>

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: Sunday, October 27, 2013 11:27 PM
To: ben@decadent.org.uk
Cc: Venkat Venkatsubra; linux-kernel@vger.kernel.org; rds-devel@oss.oracle.com; netdev@vger.kernel.org
Subject: Re: [PATCH 7/8] rds: Pass pointers to virt_to_page(), not integers

From: Ben Hutchings <ben@decadent.org.uk>
Date: Sun, 27 Oct 2013 21:54:16 +0000

> Most architectures define virt_to_page() as a macro that casts its 
> argument such that an argument of type unsigned long will be accepted 
> without complaint.  However, the proper type is void *, and passing 
> unsigned long results in a warning on MIPS.
> 
> Compile-tested only.
> 
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

This looks fine:

Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Pali Rohár @ 2013-10-28 14:00 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen
In-Reply-To: <1382968522.17956.11.camel@jlt4.sipsolutions.net>

[-- Attachment #1: Type: Text/Plain, Size: 1415 bytes --]

On Monday 28 October 2013 14:55:22 Johannes Berg wrote:
> On Mon, 2013-10-28 at 14:49 +0100, Pali Rohár wrote:
> > On Monday 28 October 2013 14:45:05 Johannes Berg wrote:
> > > On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> > > > Driver wl1251 generating mac address randomly at startup
> > > > and there is no way to set permanent mac address via
> > > > SET_IEEE80211_PERM_ADDR. This patch export sysfs file
> > > > which can set permanent mac address by userspace helper
> > > > program. Patch is needed for devices which do not store
> > > > mac address in internal wl1251 eeprom.
> > > 
> > > This doesn't really seem like a good idea since you can
> > > also just use 'ip' or whatever to set the MAC address.
> > > 
> > > johannes
> > 
> > AFAIK you cannot set permanent address (show by ethtool -P
> > wlan0) via ip/ifconfig.
> 
> You probably can't, but that address also doesn't matter at
> all and isn't really used anywhere.
> 
> johannes

There are some (proprietary) applications which using permanent 
address for something...

And also more important: some network managing tools using it. 
NetworkManager resetting current MAC address to permanent one 
before starting configuring interface.

So this will lead to never use correct MAC address (but random 
permanent one) assigned for that wl1251 wireless card...

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [patch net-next] ipv6: allow userspace to create address with IFLA_F_TEMPORARY flag
From: Vladislav Yasevich @ 2013-10-28 13:56 UTC (permalink / raw)
  To: Jiri Pirko, Vladislav Yasevich, netdev@vger.kernel.org,
	David Miller, Alexey Kuznetsov, jmorris, yoshfuji,
	Patrick McHardy, thaller, Stephen Hemminger
In-Reply-To: <20131027164835.GB4714@order.stressinduktion.org>

On Sun, Oct 27, 2013 at 12:48 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Jiri!
>
> On Sun, Oct 27, 2013 at 02:29:41PM +0100, Jiri Pirko wrote:
>> The idea is to provide possibility to do address configuration not in
>> kernel but rather in userspace (as it is done for example in NetworkManager)
>>
>> Maybe I'm missing something, but why is it problem to have the
>> possibility to set lifetime even for temporary prefix?
>
> There is no problem setting the lifetime for a temporary prefix (in
> contrary, it needs one) but the code paths designed for IFA_F_TEMPORARY
> may not fiddle with it. This needs to be checked.
>
> In this constellation addrconf_verify does not refresh the privacy
> address when its preferred lifetime is expired, if you create the
> address by only passing IFA_F_TEMPORARY to rtm_newaddr (as Vlad pointed
> out). E.g. NetworkManager has to take care about that, then.
>
> A temporary address is also bound to a non-privacy public address so
> it's lifetime is determined by its lifetime (e.g. if you switch the
> network and don't receive on-link information for that prefix any
> more). NetworkManager would have to take care about that, too. It is
> just a question of what NetworkManager wants to handle itself or lets
> the kernel handle for it.
>


Exactly.  If we want to re-use the as much of the kernel implementation as
possible, it might be nice to also add a regen notification in case there
is no public address.  This way, Network Manger (or another application)
can do the re-generation if need be for these user configured temporary
addresses.

-vlad


> Greetings,
>
>   Hannes
>

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Johannes Berg @ 2013-10-28 13:55 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen
In-Reply-To: <201310281449.58170@pali>

On Mon, 2013-10-28 at 14:49 +0100, Pali Rohár wrote:
> On Monday 28 October 2013 14:45:05 Johannes Berg wrote:
> > On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> > > Driver wl1251 generating mac address randomly at startup and
> > > there is no way to set permanent mac address via
> > > SET_IEEE80211_PERM_ADDR. This patch export sysfs file which
> > > can set permanent mac address by userspace helper program.
> > > Patch is needed for devices which do not store mac address
> > > in internal wl1251 eeprom.
> > 
> > This doesn't really seem like a good idea since you can also
> > just use 'ip' or whatever to set the MAC address.
> > 
> > johannes
> 
> AFAIK you cannot set permanent address (show by ethtool -P wlan0) 
> via ip/ifconfig.

You probably can't, but that address also doesn't matter at all and
isn't really used anywhere.

johannes

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Pali Rohár @ 2013-10-28 13:49 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen
In-Reply-To: <1382967905.17956.8.camel@jlt4.sipsolutions.net>

[-- Attachment #1: Type: Text/Plain, Size: 726 bytes --]

On Monday 28 October 2013 14:45:05 Johannes Berg wrote:
> On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> > Driver wl1251 generating mac address randomly at startup and
> > there is no way to set permanent mac address via
> > SET_IEEE80211_PERM_ADDR. This patch export sysfs file which
> > can set permanent mac address by userspace helper program.
> > Patch is needed for devices which do not store mac address
> > in internal wl1251 eeprom.
> 
> This doesn't really seem like a good idea since you can also
> just use 'ip' or whatever to set the MAC address.
> 
> johannes

AFAIK you cannot set permanent address (show by ethtool -P wlan0) 
via ip/ifconfig.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 01/16] mac80211: fix TX device statistics for monitor interfaces
From: Johannes Berg @ 2013-10-28 13:47 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen, David Gnedt
In-Reply-To: <1382819655-30430-2-git-send-email-pali.rohar@gmail.com>

On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:

> +	dev->stats.tx_packets++;
> +	dev->stats.tx_bytes += skb->len;

We can still easily drop the packet after this - we can't guarantee
it'll go out but maybe we shouldn't do it here?

johannes

^ permalink raw reply

* Re: [PATCH 15/16] wl1251: Add sysfs file tx_mgmt_frm_rate for setting rate
From: Johannes Berg @ 2013-10-28 13:45 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen
In-Reply-To: <1382819655-30430-16-git-send-email-pali.rohar@gmail.com>

On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> This patch was extracted from Maemo 2.6.28 kernel

That's not a description or justification for the patch ....

but again, it seems like a bad idea to use sysfs.

johannes

^ permalink raw reply

* Re: [PATCH 16/16] wl1251: Add sysfs file address for setting permanent mac address
From: Johannes Berg @ 2013-10-28 13:45 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Luciano Coelho, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel, freemangordon, aaro.koskinen, pavel, sre,
	joni.lapilainen
In-Reply-To: <1382819655-30430-17-git-send-email-pali.rohar@gmail.com>

On Sat, 2013-10-26 at 22:34 +0200, Pali Rohár wrote:
> Driver wl1251 generating mac address randomly at startup and there is no way to
> set permanent mac address via SET_IEEE80211_PERM_ADDR. This patch export sysfs
> file which can set permanent mac address by userspace helper program. Patch is
> needed for devices which do not store mac address in internal wl1251 eeprom.

This doesn't really seem like a good idea since you can also just use
'ip' or whatever to set the MAC address.

johannes

^ permalink raw reply

* [PATCH] bgmac: separate RX descriptor setup code into a new function
From: Rafał Miłecki @ 2013-10-28 13:40 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Rafał Miłecki

This cleans code a bit and will be useful when allocating buffers in
other places (like RX path, to avoid skb_copy_from_linear_data_offset).

Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
---
 drivers/net/ethernet/broadcom/bgmac.c |   41 ++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index e7d98ea..6b7541f 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -269,6 +269,26 @@ static int bgmac_dma_rx_skb_for_slot(struct bgmac *bgmac,
 	return 0;
 }
 
+static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
+				    struct bgmac_dma_ring *ring, int desc_idx)
+{
+	struct bgmac_dma_desc *dma_desc = ring->cpu_base + desc_idx;
+	u32 ctl0 = 0, ctl1 = 0;
+
+	if (desc_idx == ring->num_slots - 1)
+		ctl0 |= BGMAC_DESC_CTL0_EOT;
+	ctl1 |= BGMAC_RX_BUF_SIZE & BGMAC_DESC_CTL1_LEN;
+	/* Is there any BGMAC device that requires extension? */
+	/* ctl1 |= (addrext << B43_DMA64_DCTL1_ADDREXT_SHIFT) &
+	 * B43_DMA64_DCTL1_ADDREXT_MASK;
+	 */
+
+	dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
+	dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
+	dma_desc->ctl0 = cpu_to_le32(ctl0);
+	dma_desc->ctl1 = cpu_to_le32(ctl1);
+}
+
 static int bgmac_dma_rx_read(struct bgmac *bgmac, struct bgmac_dma_ring *ring,
 			     int weight)
 {
@@ -491,8 +511,6 @@ err_dma_free:
 static void bgmac_dma_init(struct bgmac *bgmac)
 {
 	struct bgmac_dma_ring *ring;
-	struct bgmac_dma_desc *dma_desc;
-	u32 ctl0, ctl1;
 	int i;
 
 	for (i = 0; i < BGMAC_MAX_TX_RINGS; i++) {
@@ -525,23 +543,8 @@ static void bgmac_dma_init(struct bgmac *bgmac)
 		if (ring->unaligned)
 			bgmac_dma_rx_enable(bgmac, ring);
 
-		for (j = 0, dma_desc = ring->cpu_base; j < ring->num_slots;
-		     j++, dma_desc++) {
-			ctl0 = ctl1 = 0;
-
-			if (j == ring->num_slots - 1)
-				ctl0 |= BGMAC_DESC_CTL0_EOT;
-			ctl1 |= BGMAC_RX_BUF_SIZE & BGMAC_DESC_CTL1_LEN;
-			/* Is there any BGMAC device that requires extension? */
-			/* ctl1 |= (addrext << B43_DMA64_DCTL1_ADDREXT_SHIFT) &
-			 * B43_DMA64_DCTL1_ADDREXT_MASK;
-			 */
-
-			dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[j].dma_addr));
-			dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[j].dma_addr));
-			dma_desc->ctl0 = cpu_to_le32(ctl0);
-			dma_desc->ctl1 = cpu_to_le32(ctl1);
-		}
+		for (j = 0; j < ring->num_slots; j++)
+			bgmac_dma_rx_setup_desc(bgmac, ring, j);
 
 		bgmac_write(bgmac, ring->mmio_base + BGMAC_DMA_RX_INDEX,
 			    ring->index_base +
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next] net: sched: cls_bpf: add BPF-based classifier
From: Eric Dumazet @ 2013-10-28 13:34 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, Thomas Graf
In-Reply-To: <1382960120-2198-1-git-send-email-dborkman@redhat.com>

On Mon, 2013-10-28 at 12:35 +0100, Daniel Borkmann wrote:
> This work contains a lightweight BPF-based traffic classifier that can
> serve as a flexible alternative to ematch-based tree classification, i.e.
> now that BPF filter engine can also be JITed in the kernel. Naturally, tc
> actions and policies are supported as well with cls_bpf. Multiple BPF
> programs/filter can be attached for a class, or they can just as well be
> written within a single BPF program, that's really up to the user how he
> wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
> The notion of a BPF program's return/exit codes is being kept as follows:
> non-zero for match, zero for mismatch.
> 
> As a minimal usage example with iproute2, we use a 3 band prio root qdisc
> on a router with sfq each as leave, and assign ssh and icmp bpf-based
> filters to band 1, http traffic to band 2 and the rest to band 3. For the
> first two bands we load the bytecode from a file, in the 2nd we load it
> inline as an example:
> 
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> 
> tc qdisc del dev em1 root
> tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


> tc qdisc add dev em1 parent 1:1 sfq perturb 16
> tc qdisc add dev em1 parent 1:2 sfq perturb 16
> tc qdisc add dev em1 parent 1:3 sfq perturb 16
> 
> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
> tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
> tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
> 
> BPF programs can be easily created and passed to tc, either as inline
> 'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
> compile opcodes, for example:
> 
> 1) People familiar with tcpdump-like filters:
> 
>    tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
> 
> 2) People that want to low-level program their filters or use BPF
>    extensions that lack support by libpcap's compiler:
> 
>    bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
> 
>    ssh.ops example code:
>    ldh [12]
>    jne #0x800, drop
>    ldb [23]
>    jneq #6, drop
>    ldh [20]
>    jset #0x1fff, drop
>    ldxb 4 * ([14] & 0xf)
>    ldh [%x + 14]
>    jeq #0x16, pass
>    ldh [%x + 16]
>    jne #0x16, drop
>    pass: ret #-1
>    drop: ret #0
> 
> It was chosen to load bytecode into tc, since the reverse operation,
> tc filter list dev em1, is then able to show the exact commands again.
> Possible follow-up work could also include a small expression compiler
> for iproute2. Tested with the help of bmon. This idea came up during
> the Netfilter Workshop 2013 in Copenhagen.
> 

Well, running a large amount of filters might be very expensive [1],
have you considered returning the flowid from the filter, instead of
returning 0 and !0 ?

0 : would mean : not matched filter
<>0 : flowid

[1] Because of lot of duplicated code in all filters...

^ permalink raw reply

* Re: Bug in skb_segment: fskb->len != len
From: Christoph Paasch @ 2013-10-28 13:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, netdev
In-Reply-To: <1382966471.13037.18.camel@edumazet-glaptop.roam.corp.google.com>

On 28/10/13 - 06:21:11, Eric Dumazet wrote:
> On Mon, 2013-10-28 at 12:55 +0100, Christoph Paasch wrote:
> > I have been seeing the below BUG in skb_segment with the latest net-next
> > head on my router.
> > 
> > I am forwarding Multipath TCP-traffic on this router. The MPTCP-sender is simply
> > doing an iperf-session. Strangely, I cannot reproduce the bug when sending
> > regular TCP-traffic across the router.
> > Note: The crash happens on a vanilla net-next kernel. It does not has any
> > MPTCP-code in it.
> > 
> > I bisected it down to 8a29111c7c (net: gro: allow to build full sized skb),
> > but I guess 8a29111c7c is just revealing a more fundamental bug in skb_segment.
> > 
> > Some info I found:
> > In skb_segment, when the bug happens, fskb->len is 4284 but the mss and len is 1428.
> 
> fskb seems to contain 3 segments -> 3*1428 = 4284, so it looks fine
> 
> But what do you mean by 'len is 1428' ?

I meant that the variable "len" equals 1428. And thus BUG_ON(fskb->len != len) triggers.

> > Shortly before the bug happens, skb_gro_receive is building a packet where
> > lp->len is equal to 4284 inside the frag_list.
> > 
> > 
> > Seems like skb_segment cannot handle those bigger skb's in the frag_list.
> > 
> 
> Thanks for the report, I'll take a look.
> 
> As mentioned earlier, building very large skbs (with a frag_list) for a
> router makes little sense, because we need to segment them before NIC
> ndo_start_xmit()
> 
> But we also need to fix the skb_segment() bug anyway.
> 
> Thanks !

Let me know if I should provide more info or test a patch.


Cheers,
Christoph

^ permalink raw reply

* Re: Bug in skb_segment: fskb->len != len
From: Eric Dumazet @ 2013-10-28 13:21 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Herbert Xu, netdev
In-Reply-To: <20131028115552.GC4408@cpaasch-mac>

On Mon, 2013-10-28 at 12:55 +0100, Christoph Paasch wrote:
> Hello,
> 
> I have been seeing the below BUG in skb_segment with the latest net-next
> head on my router.
> 
> I am forwarding Multipath TCP-traffic on this router. The MPTCP-sender is simply
> doing an iperf-session. Strangely, I cannot reproduce the bug when sending
> regular TCP-traffic across the router.
> Note: The crash happens on a vanilla net-next kernel. It does not has any
> MPTCP-code in it.
> 
> I bisected it down to 8a29111c7c (net: gro: allow to build full sized skb),
> but I guess 8a29111c7c is just revealing a more fundamental bug in skb_segment.
> 
> Some info I found:
> In skb_segment, when the bug happens, fskb->len is 4284 but the mss and len is 1428.

fskb seems to contain 3 segments -> 3*1428 = 4284, so it looks fine

But what do you mean by 'len is 1428' ?

> Shortly before the bug happens, skb_gro_receive is building a packet where
> lp->len is equal to 4284 inside the frag_list.
> 
> 
> Seems like skb_segment cannot handle those bigger skb's in the frag_list.
> 

Thanks for the report, I'll take a look.

As mentioned earlier, building very large skbs (with a frag_list) for a
router makes little sense, because we need to segment them before NIC
ndo_start_xmit()

But we also need to fix the skb_segment() bug anyway.

Thanks !

^ permalink raw reply

* Re: [virtio-net] BUG: sleeping function called from invalid context at kernel/mutex.c:616
From: Fengguang Wu @ 2013-10-28 12:49 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <526DF9A4.70709@redhat.com>

On Mon, Oct 28, 2013 at 01:44:04PM +0800, Jason Wang wrote:
> On 10/24/2013 07:20 AM, Fengguang Wu wrote:
> > Yes it reduces the sleeping function bug:
> >
> > /kernel/x86_64-lkp-CONFIG_SCSI_DEBUG/7c4ed2767afb813493b0a8fb18d666cd44550963
> >
> > +------------------------------------------------------------------------------------+-----------+--------------+--------------+
> > |                                                                                    | v3.12-rc3 | 3ab098df35f8 | 7c4ed2767afb |
> > +------------------------------------------------------------------------------------+-----------+--------------+--------------+
> > | good_boots                                                                         | 30        | 0            | 93           |
> > | has_kernel_error_warning                                                           | 0         | 20           | 7            |
> > | BUG:sleeping_function_called_from_invalid_context_at_kernel/mutex.c                | 0         | 20           |              |
> > | INFO:rcu_sched_self-detected_stall_on_CPU(t=jiffies_g=c=q=)                        | 0         | 0            | 1            |
> > | INFO:task_blocked_for_more_than_seconds                                            | 0         | 0            | 2            |
> > | INFO:NMI_handler(arch_trigger_all_cpu_backtrace_handler)took_too_long_to_run:msecs | 0         | 0            | 1            |
> > | Kernel_panic-not_syncing:hung_task:blocked_tasks                                   | 0         | 0            | 1            |
> > | BUG:kernel_test_crashed                                                            | 0         | 0            | 3            |
> > | BUG:kernel_test_hang                                                               | 0         | 0            | 1            |
> > +------------------------------------------------------------------------------------+-----------+--------------+--------------+
> >
> > However I'll need to increase tests on v3.12-rc3 to make sure it's not
> > the patch that added the other error messages.
> >
> > Thanks,
> > Fengguang
> 
> Thanks, any more update on the result of v3.12-rc3 for this?

Yes, it's confirmed that there are no new error types in 7c4ed2767afb
comparing to v3.12-rc3, so your fix is fine.

/kernel/x86_64-lkp-CONFIG_SCSI_DEBUG/7c4ed2767afb813493b0a8fb18d666cd44550963

+------------------------------------------------------------------------------------+-----------+--------------+--------------+
|                                                                                    | v3.12-rc3 | 3ab098df35f8 | 7c4ed2767afb |
+------------------------------------------------------------------------------------+-----------+--------------+--------------+
| good_boots                                                                         | 877       | 0            | 93           |
| has_kernel_error_warning                                                           | 154       | 20           | 7            |
| BUG:kernel_test_hang                                                               | 112       | 0            | 1            |
| INFO:rcu_sched_self-detected_stall_on_CPU(t=jiffies_g=c=q=)                        | 1         | 0            | 1            |
| INFO:task_blocked_for_more_than_seconds                                            | 18        | 0            | 2            |
| INFO:NMI_handler(arch_trigger_all_cpu_backtrace_handler)took_too_long_to_run:msecs | 11        | 0            | 1            |
| Kernel_panic-not_syncing:hung_task:blocked_tasks                                   | 11        | 0            | 1            |
| BUG:kernel_test_crashed                                                            | 13        | 0            | 3            |
| Out_of_memory:Kill_process                                                         | 9         |              |              |
| BUG:kernel_early_hang_without_any_printk_output                                    | 1         |              |              |
| page_allocation_failure:order:,mode                                                | 1         |              |              |
| Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes                   | 2         |              |              |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/mutex.c                | 0         | 20           |              |
+------------------------------------------------------------------------------------+-----------+--------------+--------------+

Thanks,
Fengguang

^ permalink raw reply

* [PATCH net V3] xen-netback: use jiffies_64 value to calculate credit timeout
From: Wei Liu @ 2013-10-28 12:07 UTC (permalink / raw)
  To: xen-devel, netdev
  Cc: david.vrabel, jbeulich, annie.li, Wei Liu, Ian Campbell,
	Jason Luan

time_after_eq() only works if the delta is < MAX_ULONG/2.

For a 32bit Dom0, if netfront sends packets at a very low rate, the time
between subsequent calls to tx_credit_exceeded() may exceed MAX_ULONG/2
and the test for timer_after_eq() will be incorrect. Credit will not be
replenished and the guest may become unable to send packets (e.g., if
prior to the long gap, all credit was exhausted).

Use jiffies_64 variant to mitigate this problem for 32bit Dom0.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jason Luan <jianhai.luan@oracle.com>
---
 drivers/net/xen-netback/common.h    |    1 +
 drivers/net/xen-netback/interface.c |    3 +--
 drivers/net/xen-netback/netback.c   |   10 +++++-----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 5715318..400fea1 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -163,6 +163,7 @@ struct xenvif {
 	unsigned long   credit_usec;
 	unsigned long   remaining_credit;
 	struct timer_list credit_timeout;
+	u64 credit_window_start;
 
 	/* Statistics */
 	unsigned long rx_gso_checksum_fixup;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 01bb854..459935a 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -312,8 +312,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->credit_bytes = vif->remaining_credit = ~0UL;
 	vif->credit_usec  = 0UL;
 	init_timer(&vif->credit_timeout);
-	/* Initialize 'expires' now: it's used to track the credit window. */
-	vif->credit_timeout.expires = jiffies;
+	vif->credit_window_start = get_jiffies_64();
 
 	dev->netdev_ops	= &xenvif_netdev_ops;
 	dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index f3e591c..900da4b 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1185,9 +1185,8 @@ out:
 
 static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 {
-	unsigned long now = jiffies;
-	unsigned long next_credit =
-		vif->credit_timeout.expires +
+	u64 now = get_jiffies_64();
+	u64 next_credit = vif->credit_window_start +
 		msecs_to_jiffies(vif->credit_usec / 1000);
 
 	/* Timer could already be pending in rare cases. */
@@ -1195,8 +1194,8 @@ static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 		return true;
 
 	/* Passed the point where we can replenish credit? */
-	if (time_after_eq(now, next_credit)) {
-		vif->credit_timeout.expires = now;
+	if (time_after_eq64(now, next_credit)) {
+		vif->credit_window_start = now;
 		tx_add_credit(vif);
 	}
 
@@ -1208,6 +1207,7 @@ static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 			tx_credit_callback;
 		mod_timer(&vif->credit_timeout,
 			  next_credit);
+		vif->credit_window_start = next_credit;
 
 		return true;
 	}
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox