* Re: Can we rely on ethernet header padding?
From: Eric Dumazet @ 2013-03-19 15:21 UTC (permalink / raw)
To: Michal Kubecek; +Cc: netdev, netfilter-devel
In-Reply-To: <20130319150545.GA18218@unicorn.suse.cz>
On Tue, 2013-03-19 at 16:05 +0100, Michal Kubecek wrote:
> Hello,
>
> a customer of ours ran into
>
> http://bugzilla.netfilter.org/show_bug.cgi?id=765
>
> They checked that commit a504b86e prevents the crash but I'm not sure it
> is sufficient.
>
> The crash happens when br_nf_pre_routing_finish_bridge() calls
> neigh_hh_bridge() which copies not only destination MAC address but also
> the padding with it. IIUC this is for performance reasons (so that
> aligned 8 bytes are copied rather than 6).
>
> But I wonder whether we can rely on the fact that every skb on an
> ethernet-like device has ethernet header padded at least to the 16 bytes
> expected by neigh_hh_bridge() and neigh_hh_output() or whether the
> bridge code should make sure. I tried to look for such test but couldn't
> find any, even if commit a504b86e description mentions reallocating the
> skb rather than a crash.
Thats a side effect.
Before calling netif_rx() the driver usually calls eth_type_trans()
to pull the ethernet header, so there is the room for 14 bytes.
Normally a driver has NET_SKB_PAD bytes of headroom before the ethernet
header, so the bridge code is safe only if all drivers use this
NET_SKB_PAD padding on receive side. And they really should for
performance reasons.
Better not touch bridge code to catch offending drivers
^ permalink raw reply
* Re: [PATCH v2 net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Daniel Borkmann @ 2013-03-19 15:06 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, davem, jasowang
In-Reply-To: <1363705390.2558.3.camel@edumazet-glaptop>
On 03/19/2013 04:03 PM, Eric Dumazet wrote:
> On Tue, 2013-03-19 at 15:34 +0100, Daniel Borkmann wrote:
>> In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
>> doing the work here anyway, also store thoff for a later usage, e.g. in
>> the BPF filter. Also, by having thoff 16 Bit, we do not need to pack
>> flow_keys and reorder choke_skb_cb.
>>
>> Suggested-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>> This patch also needs to go into the net tree, since Eric or Jason will
>> post a bug fix on top of this one.
>>
>> include/net/flow_keys.h | 1 +
>> net/core/flow_dissector.c | 5 ++++-
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> Oh well, you left the choke_skb_cb description in changelog
But it's not the old changelog. ;-)
Also, by having thoff 16 Bit, we do *not* need to pack flow_keys and
reorder choke_skb_cb.
> Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH v2 net-next 4/4] filter: add minimal BPF JIT emitted image disassembler
From: Eric Dumazet @ 2013-03-19 15:05 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, davem, jasowang, Eric Dumazet
In-Reply-To: <18e6abdb8689e02ac099f084d97666c63c22064a.1363702456.git.dborkman@redhat.com>
On Tue, 2013-03-19 at 15:34 +0100, Daniel Borkmann wrote:
> This is a minimal stand-alone user space helper, that allows for debugging or
> verification of emitted BPF JIT images. This is in particular useful for
> emitted opcode debugging, since minor bugs in the JIT compiler can be fatal.
> The disassembler is architecture generic and uses libopcodes and libbfd.
>
> How to get to the disassembly, example:
>
> 1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
> 2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
> 3) Run e.g. `bpf_jit_disasm -o` to disassemble the most recent JIT code output
>
> `bpf_jit_disasm -o` will display the related opcodes to a particular instruction
> as well. Example for x86_64:
>
> $./bpf_jit_disasm
> 94 bytes emitted from JIT compiler (pass:3, flen:9)
> ffffffffa0356000 + <x>:
> 0: push %rbp
> 1: mov %rsp,%rbp
> 4: sub $0x60,%rsp
> 8: mov %rbx,-0x8(%rbp)
> c: mov 0x68(%rdi),%r9d
> 10: sub 0x6c(%rdi),%r9d
> 14: mov 0xe0(%rdi),%r8
> 1b: mov $0xc,%esi
> 20: callq 0xffffffffe0d01b71
> 25: cmp $0x86dd,%eax
> 2a: jne 0x000000000000003d
> 2c: mov $0x14,%esi
> 31: callq 0xffffffffe0d01b8d
> 36: cmp $0x6,%eax
> [...]
> 5c: leaveq
> 5d: retq
>
> $ ./bpf_jit_disasm -o
> 94 bytes emitted from JIT compiler (pass:3, flen:9)
> ffffffffa0356000 + <x>:
> 0: push %rbp
> 55
> 1: mov %rsp,%rbp
> 48 89 e5
> 4: sub $0x60,%rsp
> 48 83 ec 60
> 8: mov %rbx,-0x8(%rbp)
> 48 89 5d f8
> c: mov 0x68(%rdi),%r9d
> 44 8b 4f 68
> 10: sub 0x6c(%rdi),%r9d
> 44 2b 4f 6c
> [...]
> 5c: leaveq
> c9
> 5d: retq
> c3
>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
Very useful, thanks Daniel !
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Can we rely on ethernet header padding?
From: Michal Kubecek @ 2013-03-19 15:05 UTC (permalink / raw)
To: netdev; +Cc: netfilter-devel
Hello,
a customer of ours ran into
http://bugzilla.netfilter.org/show_bug.cgi?id=765
They checked that commit a504b86e prevents the crash but I'm not sure it
is sufficient.
The crash happens when br_nf_pre_routing_finish_bridge() calls
neigh_hh_bridge() which copies not only destination MAC address but also
the padding with it. IIUC this is for performance reasons (so that
aligned 8 bytes are copied rather than 6).
But I wonder whether we can rely on the fact that every skb on an
ethernet-like device has ethernet header padded at least to the 16 bytes
expected by neigh_hh_bridge() and neigh_hh_output() or whether the
bridge code should make sure. I tried to look for such test but couldn't
find any, even if commit a504b86e description mentions reallocating the
skb rather than a crash.
Thanks in advance,
Michal Kubecek
^ permalink raw reply
* Re: [PATCH v2 net-next 3/4] filter: add ANC_PAY_OFFSET instruction for loading payload start offset
From: Eric Dumazet @ 2013-03-19 15:04 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, davem, jasowang
In-Reply-To: <c5924543d43d3870840806cb951b385135d1c0c4.1363702456.git.dborkman@redhat.com>
On Tue, 2013-03-19 at 15:34 +0100, Daniel Borkmann wrote:
> It is very useful to do dynamic truncation of packets. In particular,
> we're interested to push the necessary header bytes to the user space and
> cut off user payload that should probably not be transferred for some reasons
> (e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
> we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
>
> ld #poff ; { 0x20, 0, 0, 0xfffff034 },
> ret a ; { 0x16, 0, 0, 0x00000000 },
>
> ... as a filter will accomplish this without having to do a big hackery in
> a BPF filter itself. Follow-up JIT implementations are welcome.
>
> Thanks to Eric Dumazet for suggesting and discussing this during the
> Netfilter Workshop in Copenhagen.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
Thanks a lot Daniel
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH v2 net-next 2/4] net: flow_dissector: add __skb_get_poff to get a start offset to payload
From: Eric Dumazet @ 2013-03-19 15:03 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, davem, jasowang
In-Reply-To: <fd6036248f9337e3679cb441f262ce5d5b228c9b.1363702456.git.dborkman@redhat.com>
On Tue, 2013-03-19 at 15:34 +0100, Daniel Borkmann wrote:
> __skb_get_poff() returns the offset to the payload as far as it could
> be dissected. The main user is currently BPF, so that we can dynamically
> truncate packets without needing to push actual payload to the user
> space and instead can analyze headers only.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
> include/linux/skbuff.h | 2 ++
> net/core/flow_dissector.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+)
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH v2 net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Eric Dumazet @ 2013-03-19 15:03 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, davem, jasowang
In-Reply-To: <cc0d2911a173be99651dc39e77286aab260c3c19.1363702456.git.dborkman@redhat.com>
On Tue, 2013-03-19 at 15:34 +0100, Daniel Borkmann wrote:
> In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
> doing the work here anyway, also store thoff for a later usage, e.g. in
> the BPF filter. Also, by having thoff 16 Bit, we do not need to pack
> flow_keys and reorder choke_skb_cb.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
> This patch also needs to go into the net tree, since Eric or Jason will
> post a bug fix on top of this one.
>
> include/net/flow_keys.h | 1 +
> net/core/flow_dissector.c | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
Oh well, you left the choke_skb_cb description in changelog
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH v2 net-next 0/4] net: filter: BPF updates
From: David Miller @ 2013-03-19 14:51 UTC (permalink / raw)
To: dborkman; +Cc: netdev, eric.dumazet, jasowang
In-Reply-To: <5148793F.3000000@redhat.com>
From: Daniel Borkmann <dborkman@redhat.com>
Date: Tue, 19 Mar 2013 15:42:07 +0100
> On 03/19/2013 03:38 PM, David Miller wrote:
>> From: Daniel Borkmann <dborkman@redhat.com>
>> Date: Tue, 19 Mar 2013 15:33:59 +0100
>>
>>> This set adds i) an ancillary operation to the BPF engine and ii) a
>>> BPF JIT image disassembler in order to verify or debug the BPF JIT
>>> compilers under arch/*/net/.
>>>
>>> v1 -> v2:
>>> - No need to reorder choke_skb_cb structure
>>
>> So we want the first patch in 'net' right?
>
> Eric and Jason mentioned that they need to do a bug fix, where they
> could simplify code for the fix by having the 1st patch of this set
> in the net tree as well.
>
> However, all other patches of this set except the last one also depend
> on the first one, so it would be net and net-next then. Eric, please
> correct me if I'm wrong?
That's fine, I can handle such dependencies without any problems.
^ permalink raw reply
* [PATCH] igb: fix PHC stopping on max freq
From: Jiri Benc @ 2013-03-19 14:42 UTC (permalink / raw)
To: netdev; +Cc: e1000-devel, Jeff Kirsher, Stefan Assmann, Miroslav Lichvar
For 82576 MAC type, max_adj is reported as 1000000000 ppb. However, if
this value is passed to igb_ptp_adjfreq_82576, incvalue overflows out of
INCVALUE_82576_MASK, resulting in setting of zero TIMINCA.incvalue, stopping
the PHC (instead of going at twice the nominal speed).
Fix the advertised max_adj value to the largest value hardware can handle.
As there is no min_adj value available (-max_adj is used instead), this will
also prevent stopping the clock intentionally. It's probably not a big deal,
other igb MAC types don't support stopping the clock, either.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
drivers/net/ethernet/intel/igb/igb_ptp.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 0987822..0a23750 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -740,7 +740,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
case e1000_82576:
snprintf(adapter->ptp_caps.name, 16, "%pm", netdev->dev_addr);
adapter->ptp_caps.owner = THIS_MODULE;
- adapter->ptp_caps.max_adj = 1000000000;
+ adapter->ptp_caps.max_adj = 999999881;
adapter->ptp_caps.n_ext_ts = 0;
adapter->ptp_caps.pps = 0;
adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82576;
--
1.7.6.5
^ permalink raw reply related
* Re: [PATCH v2 net-next 0/4] net: filter: BPF updates
From: Daniel Borkmann @ 2013-03-19 14:42 UTC (permalink / raw)
To: David Miller; +Cc: netdev, eric.dumazet, jasowang
In-Reply-To: <20130319.103803.368214683799955215.davem@davemloft.net>
On 03/19/2013 03:38 PM, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Tue, 19 Mar 2013 15:33:59 +0100
>
>> This set adds i) an ancillary operation to the BPF engine and ii) a
>> BPF JIT image disassembler in order to verify or debug the BPF JIT
>> compilers under arch/*/net/.
>>
>> v1 -> v2:
>> - No need to reorder choke_skb_cb structure
>
> So we want the first patch in 'net' right?
Eric and Jason mentioned that they need to do a bug fix, where they
could simplify code for the fix by having the 1st patch of this set
in the net tree as well.
However, all other patches of this set except the last one also depend
on the first one, so it would be net and net-next then. Eric, please
correct me if I'm wrong?
^ permalink raw reply
* Re: [PATCH v2 net-next 0/4] net: filter: BPF updates
From: David Miller @ 2013-03-19 14:38 UTC (permalink / raw)
To: dborkman; +Cc: netdev, eric.dumazet, jasowang
In-Reply-To: <cover.1363702456.git.dborkman@redhat.com>
From: Daniel Borkmann <dborkman@redhat.com>
Date: Tue, 19 Mar 2013 15:33:59 +0100
> This set adds i) an ancillary operation to the BPF engine and ii) a
> BPF JIT image disassembler in order to verify or debug the BPF JIT
> compilers under arch/*/net/.
>
> v1 -> v2:
> - No need to reorder choke_skb_cb structure
So we want the first patch in 'net' right?
^ permalink raw reply
* Re: [PATCH] use xfrm direction when lookup policy
From: David Miller @ 2013-03-19 14:35 UTC (permalink / raw)
To: baker.kernel; +Cc: steffen.klassert, herbert, netdev, linux-kernel, baker.zhang
In-Reply-To: <1363703070-2910-1-git-send-email-baker.zhang@gmail.com>
From: Baker Zhang <baker.kernel@gmail.com>
Date: Tue, 19 Mar 2013 22:24:30 +0800
> because xfrm policy direction has same value with corresponding
> flow direction, so this problem is covered.
>
> In xfrm_lookup and __xfrm_policy_check, flow_cache_lookup is used to
> accelerate the lookup.
>
> Flow direction is given to flow_cache_lookup by policy_to_flow_dir.
>
> When the flow cache is mismatched, callback 'resolver' is called.
>
> 'resolver' requires xfrm direction,
> so convert direction back to xfrm direction.
>
> Signed-off-by: Baker Zhang <baker.zhang@gmail.com>
Since this is a cleanup and doesn't really fix a bug, I've applied
this to net-next.
Thanks.
^ permalink raw reply
* [PATCH v2 net-next 4/4] filter: add minimal BPF JIT emitted image disassembler
From: Daniel Borkmann @ 2013-03-19 14:34 UTC (permalink / raw)
To: netdev; +Cc: davem, eric.dumazet, jasowang, Eric Dumazet
In-Reply-To: <cover.1363702456.git.dborkman@redhat.com>
This is a minimal stand-alone user space helper, that allows for debugging or
verification of emitted BPF JIT images. This is in particular useful for
emitted opcode debugging, since minor bugs in the JIT compiler can be fatal.
The disassembler is architecture generic and uses libopcodes and libbfd.
How to get to the disassembly, example:
1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
3) Run e.g. `bpf_jit_disasm -o` to disassemble the most recent JIT code output
`bpf_jit_disasm -o` will display the related opcodes to a particular instruction
as well. Example for x86_64:
$./bpf_jit_disasm
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
0: push %rbp
1: mov %rsp,%rbp
4: sub $0x60,%rsp
8: mov %rbx,-0x8(%rbp)
c: mov 0x68(%rdi),%r9d
10: sub 0x6c(%rdi),%r9d
14: mov 0xe0(%rdi),%r8
1b: mov $0xc,%esi
20: callq 0xffffffffe0d01b71
25: cmp $0x86dd,%eax
2a: jne 0x000000000000003d
2c: mov $0x14,%esi
31: callq 0xffffffffe0d01b8d
36: cmp $0x6,%eax
[...]
5c: leaveq
5d: retq
$ ./bpf_jit_disasm -o
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
0: push %rbp
55
1: mov %rsp,%rbp
48 89 e5
4: sub $0x60,%rsp
48 83 ec 60
8: mov %rbx,-0x8(%rbp)
48 89 5d f8
c: mov 0x68(%rdi),%r9d
44 8b 4f 68
10: sub 0x6c(%rdi),%r9d
44 2b 4f 6c
[...]
5c: leaveq
c9
5d: retq
c3
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
scripts/bpf_jit_disasm.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 216 insertions(+)
create mode 100644 scripts/bpf_jit_disasm.c
diff --git a/scripts/bpf_jit_disasm.c b/scripts/bpf_jit_disasm.c
new file mode 100644
index 0000000..1fe9fb5
--- /dev/null
+++ b/scripts/bpf_jit_disasm.c
@@ -0,0 +1,216 @@
+/*
+ * Minimal BPF JIT image disassembler
+ *
+ * Disassembles BPF JIT compiler emitted opcodes back to asm insn's for
+ * debugging or verification purposes.
+ *
+ * There is no Makefile. Compile with
+ *
+ * `gcc -Wall -O2 bpf_jit_disasm.c -o bpf_jit_disasm -lopcodes -lbfd -ldl`
+ *
+ * or similar.
+ *
+ * To get the disassembly of the JIT code, do the following:
+ *
+ * 1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
+ * 2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
+ * 3) Run e.g. `./bpf_jit_disasm -o` to read out the last JIT code
+ *
+ * Copyright 2013 Daniel Borkmann <borkmann@redhat.com>
+ * Licensed under the GNU General Public License, version 2.0 (GPLv2)
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <string.h>
+#include <bfd.h>
+#include <dis-asm.h>
+#include <sys/klog.h>
+#include <sys/types.h>
+#include <regex.h>
+
+#define VERSION_STRING "1.0"
+
+static void get_exec_path(char *tpath, size_t size)
+{
+ char *path;
+ ssize_t len;
+
+ snprintf(tpath, size, "/proc/%d/exe", (int) getpid());
+ tpath[size - 1] = 0;
+
+ path = strdup(tpath);
+ assert(path);
+
+ len = readlink(path, tpath, size);
+ tpath[len] = 0;
+
+ free(path);
+}
+
+static void get_asm_insns(uint8_t *image, size_t len, unsigned long base,
+ int opcodes)
+{
+ int count, i, pc = 0;
+ char tpath[256];
+ struct disassemble_info info;
+ disassembler_ftype disassemble;
+ bfd *bfdf;
+
+ memset(tpath, 0, sizeof(tpath));
+ get_exec_path(tpath, sizeof(tpath));
+
+ bfdf = bfd_openr(tpath, NULL);
+ assert(bfdf);
+ assert(bfd_check_format(bfdf, bfd_object));
+
+ init_disassemble_info(&info, stdout, (fprintf_ftype) fprintf);
+ info.arch = bfd_get_arch(bfdf);
+ info.mach = bfd_get_mach(bfdf);
+ info.buffer = image;
+ info.buffer_length = len;
+
+ disassemble_init_for_target(&info);
+
+ disassemble = disassembler(bfdf);
+ assert(disassemble);
+
+ do {
+ printf("%4x:\t", pc);
+
+ count = disassemble(pc, &info);
+
+ if (opcodes) {
+ printf("\n\t");
+ for (i = 0; i < count; ++i)
+ printf("%02x ", (uint8_t) image[pc + i]);
+ }
+ printf("\n");
+
+ pc += count;
+ } while(count > 0 && pc < len);
+
+ bfd_close(bfdf);
+}
+
+static char *get_klog_buff(int *klen)
+{
+ int ret, len = klogctl(10, NULL, 0);
+ char *buff = malloc(len);
+
+ assert(buff && klen);
+ ret = klogctl(3, buff, len);
+ assert(ret >= 0);
+ *klen = ret;
+
+ return buff;
+}
+
+static void put_klog_buff(char *buff)
+{
+ free(buff);
+}
+
+static int get_last_jit_image(char *haystack, size_t hlen,
+ uint8_t *image, size_t ilen,
+ unsigned long *base)
+{
+ char *ptr, *pptr, *tmp;
+ off_t off = 0;
+ int ret, flen, proglen, pass, ulen = 0;
+ regmatch_t pmatch[1];
+ regex_t regex;
+
+ if (hlen == 0)
+ return 0;
+
+ ret = regcomp(®ex, "flen=[[:alnum:]]+ proglen=[[:digit:]]+ "
+ "pass=[[:digit:]]+ image=[[:xdigit:]]+", REG_EXTENDED);
+ assert(ret == 0);
+
+ ptr = haystack;
+ while (1) {
+ ret = regexec(®ex, ptr, 1, pmatch, 0);
+ if (ret == 0) {
+ ptr += pmatch[0].rm_eo;
+ off += pmatch[0].rm_eo;
+ assert(off < hlen);
+ } else
+ break;
+ }
+
+ ptr = haystack + off - (pmatch[0].rm_eo - pmatch[0].rm_so);
+ ret = sscanf(ptr, "flen=%d proglen=%d pass=%d image=%lx",
+ &flen, &proglen, &pass, base);
+ if (ret != 4)
+ return 0;
+
+ tmp = ptr = haystack + off;
+ while ((ptr = strtok(tmp, "\n")) != NULL && ulen < ilen) {
+ tmp = NULL;
+ if (!strstr(ptr, "JIT code"))
+ continue;
+ pptr = ptr;
+ while ((ptr = strstr(pptr, ":")))
+ pptr = ptr + 1;
+ ptr = pptr;
+ do {
+ image[ulen++] = (uint8_t) strtoul(pptr, &pptr, 16);
+ if (ptr == pptr || ulen >= ilen) {
+ ulen--;
+ break;
+ }
+ ptr = pptr;
+ } while (1);
+ }
+
+ assert(ulen == proglen);
+ printf("%d bytes emitted from JIT compiler (pass:%d, flen:%d)\n",
+ proglen, pass, flen);
+ printf("%lx + <x>:\n", *base);
+
+ regfree(®ex);
+ return ulen;
+}
+
+static void help(void)
+{
+ printf("Usage: bpf_jit_disasm [-ohv]\n");
+ printf("Version %s, written by Daniel Borkmann <borkmann@redhat.com>\n",
+ VERSION_STRING);
+ printf(" -o Include opcodes in output\n");
+ printf(" -h|-v Show help/version\n");
+ exit(0);
+}
+
+int main(int argc, char **argv)
+{
+ int len, klen, opcodes = 0;
+ char *kbuff;
+ unsigned long base;
+ uint8_t image[4096];
+
+ if (argc > 1) {
+ if (!strncmp("-o", argv[argc - 1], 2))
+ opcodes = 1;
+ if (!strncmp("-h", argv[argc - 1], 2) ||
+ !strncmp("-v", argv[argc - 1], 2))
+ help();
+ }
+
+ bfd_init();
+ memset(image, 0, sizeof(image));
+
+ kbuff = get_klog_buff(&klen);
+
+ len = get_last_jit_image(kbuff, klen, image, sizeof(image), &base);
+ if (len > 0 && base > 0)
+ get_asm_insns(image, len, base, opcodes);
+
+ put_klog_buff(kbuff);
+
+ return 0;
+}
--
1.7.11.7
^ permalink raw reply related
* [PATCH v2 net-next 3/4] filter: add ANC_PAY_OFFSET instruction for loading payload start offset
From: Daniel Borkmann @ 2013-03-19 14:34 UTC (permalink / raw)
To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363702456.git.dborkman@redhat.com>
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
include/linux/filter.h | 1 +
include/uapi/linux/filter.h | 3 ++-
net/core/filter.c | 5 +++++
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index c45eabc..d2059cb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -126,6 +126,7 @@ enum {
BPF_S_ANC_SECCOMP_LD_W,
BPF_S_ANC_VLAN_TAG,
BPF_S_ANC_VLAN_TAG_PRESENT,
+ BPF_S_ANC_PAY_OFFSET,
};
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 9cfde69..8eb9cca 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -129,7 +129,8 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define SKF_AD_ALU_XOR_X 40
#define SKF_AD_VLAN_TAG 44
#define SKF_AD_VLAN_TAG_PRESENT 48
-#define SKF_AD_MAX 52
+#define SKF_AD_PAY_OFFSET 52
+#define SKF_AD_MAX 56
#define SKF_NET_OFF (-0x100000)
#define SKF_LL_OFF (-0x200000)
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e20b55..dad2a17 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -348,6 +348,9 @@ load_b:
case BPF_S_ANC_VLAN_TAG_PRESENT:
A = !!vlan_tx_tag_present(skb);
continue;
+ case BPF_S_ANC_PAY_OFFSET:
+ A = __skb_get_poff(skb);
+ continue;
case BPF_S_ANC_NLATTR: {
struct nlattr *nla;
@@ -612,6 +615,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
ANCILLARY(ALU_XOR_X);
ANCILLARY(VLAN_TAG);
ANCILLARY(VLAN_TAG_PRESENT);
+ ANCILLARY(PAY_OFFSET);
}
/* ancillary operation unknown or unsupported */
@@ -814,6 +818,7 @@ static void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to)
[BPF_S_ANC_SECCOMP_LD_W] = BPF_LD|BPF_B|BPF_ABS,
[BPF_S_ANC_VLAN_TAG] = BPF_LD|BPF_B|BPF_ABS,
[BPF_S_ANC_VLAN_TAG_PRESENT] = BPF_LD|BPF_B|BPF_ABS,
+ [BPF_S_ANC_PAY_OFFSET] = BPF_LD|BPF_B|BPF_ABS,
[BPF_S_LD_W_LEN] = BPF_LD|BPF_W|BPF_LEN,
[BPF_S_LD_W_IND] = BPF_LD|BPF_W|BPF_IND,
[BPF_S_LD_H_IND] = BPF_LD|BPF_H|BPF_IND,
--
1.7.11.7
^ permalink raw reply related
* [PATCH v2 net-next 2/4] net: flow_dissector: add __skb_get_poff to get a start offset to payload
From: Daniel Borkmann @ 2013-03-19 14:34 UTC (permalink / raw)
To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363702456.git.dborkman@redhat.com>
__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
include/linux/skbuff.h | 2 ++
net/core/flow_dissector.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eb2106f..0e84fd8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2835,6 +2835,8 @@ static inline void skb_checksum_none_assert(const struct sk_buff *skb)
bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+u32 __skb_get_poff(const struct sk_buff *skb);
+
/**
* skb_head_is_locked - Determine if the skb->head is locked down
* @skb: skb to check
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index eb9dde1..8213da7 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -5,6 +5,10 @@
#include <linux/if_vlan.h>
#include <net/ip.h>
#include <net/ipv6.h>
+#include <linux/igmp.h>
+#include <linux/icmp.h>
+#include <linux/sctp.h>
+#include <linux/dccp.h>
#include <linux/if_tunnel.h>
#include <linux/if_pppox.h>
#include <linux/ppp_defs.h>
@@ -229,6 +233,59 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
}
EXPORT_SYMBOL(__skb_tx_hash);
+/* __skb_get_poff() returns the offset to the payload as far as it could
+ * be dissected. The main user is currently BPF, so that we can dynamically
+ * truncate packets without needing to push actual payload to the user
+ * space and can analyze headers only, instead.
+ */
+u32 __skb_get_poff(const struct sk_buff *skb)
+{
+ struct flow_keys keys;
+ u32 poff = 0;
+
+ if (!skb_flow_dissect(skb, &keys))
+ return 0;
+
+ poff += keys.thoff;
+ switch (keys.ip_proto) {
+ case IPPROTO_TCP: {
+ const struct tcphdr *tcph;
+ struct tcphdr _tcph;
+
+ tcph = skb_header_pointer(skb, poff, sizeof(_tcph), &_tcph);
+ if (!tcph)
+ return poff;
+
+ poff += max_t(u32, sizeof(struct tcphdr), tcph->doff * 4);
+ break;
+ }
+ case IPPROTO_UDP:
+ case IPPROTO_UDPLITE:
+ poff += sizeof(struct udphdr);
+ break;
+ /* For the rest, we do not really care about header
+ * extensions at this point for now.
+ */
+ case IPPROTO_ICMP:
+ poff += sizeof(struct icmphdr);
+ break;
+ case IPPROTO_ICMPV6:
+ poff += sizeof(struct icmp6hdr);
+ break;
+ case IPPROTO_IGMP:
+ poff += sizeof(struct igmphdr);
+ break;
+ case IPPROTO_DCCP:
+ poff += sizeof(struct dccp_hdr);
+ break;
+ case IPPROTO_SCTP:
+ poff += sizeof(struct sctphdr);
+ break;
+ }
+
+ return poff;
+}
+
static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
{
if (unlikely(queue_index >= dev->real_num_tx_queues)) {
--
1.7.11.7
^ permalink raw reply related
* [PATCH v2 net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Daniel Borkmann @ 2013-03-19 14:34 UTC (permalink / raw)
To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363702456.git.dborkman@redhat.com>
In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
doing the work here anyway, also store thoff for a later usage, e.g. in
the BPF filter. Also, by having thoff 16 Bit, we do not need to pack
flow_keys and reorder choke_skb_cb.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
This patch also needs to go into the net tree, since Eric or Jason will
post a bug fix on top of this one.
include/net/flow_keys.h | 1 +
net/core/flow_dissector.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
index 80461c1..bb8271d 100644
--- a/include/net/flow_keys.h
+++ b/include/net/flow_keys.h
@@ -9,6 +9,7 @@ struct flow_keys {
__be32 ports;
__be16 port16[2];
};
+ u16 thoff;
u8 ip_proto;
};
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index f8d9e03..eb9dde1 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -23,7 +23,8 @@ static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *i
bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
{
- int poff, nhoff = skb_network_offset(skb);
+ int poff;
+ u16 nhoff = skb_network_offset(skb);
u8 ip_proto;
__be16 proto = skb->protocol;
@@ -151,6 +152,8 @@ ipv6:
flow->ports = *ports;
}
+ flow->thoff = nhoff;
+
return true;
}
EXPORT_SYMBOL(skb_flow_dissect);
--
1.7.11.7
^ permalink raw reply related
* [PATCH v2 net-next 0/4] net: filter: BPF updates
From: Daniel Borkmann @ 2013-03-19 14:33 UTC (permalink / raw)
To: netdev; +Cc: davem, eric.dumazet, jasowang
This set adds i) an ancillary operation to the BPF engine and ii) a
BPF JIT image disassembler in order to verify or debug the BPF JIT
compilers under arch/*/net/.
v1 -> v2:
- No need to reorder choke_skb_cb structure
Daniel Borkmann (4):
flow_keys: include thoff into flow_keys for later usage
net: flow_dissector: add __skb_get_poff to get a start offset to payload
filter: add ANC_PAY_OFFSET instruction for loading payload start offset
filter: add minimal BPF JIT emitted image disassembler
include/linux/filter.h | 1 +
include/linux/skbuff.h | 2 +
include/net/flow_keys.h | 1 +
include/uapi/linux/filter.h | 3 +-
net/core/filter.c | 5 +
net/core/flow_dissector.c | 62 ++++++++++++-
scripts/bpf_jit_disasm.c | 216 ++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 288 insertions(+), 2 deletions(-)
create mode 100644 scripts/bpf_jit_disasm.c
--
1.7.11.7
^ permalink raw reply
* Re: Re: [BUG][mvebu] mvneta: cannot request irq 25 on openblocks-ax3
From: Jason Cooper @ 2013-03-19 14:33 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: linux-arm-kernel, thomas.petazzoni, netdev, linux-kernel,
yrl.pp-manager.tt@hitachi.com
In-Reply-To: <514873D9.6060704@hitachi.com>
On Tue, Mar 19, 2013 at 11:19:05PM +0900, Masami Hiramatsu wrote:
> Hi Jason,
>
> (2013/03/19 22:33), Jason Cooper wrote:
> > On Tue, Mar 19, 2013 at 10:12:37PM +0900, Masami Hiramatsu wrote:
> >> Hi,
> >>
> >> Here I've hit a bug on the recent kernel. As far as I know, this bug
> >> exists on 3.9-rc1 too.
> >>
> >> When I tried the latest mvebu for-next tree
> >> (git://git.infradead.org/users/jcooper/linux.git mvebu/for-next),
> >
> > FYI: that branch isn't stable, it's used as a merge-test of
> > arm-soc/for-next (also not stable) and any branches I am trying to push
> > upstream that day.
>
> Thanks! could you tell me which branch is stable?
> (however, I'd like to try new fixes/features on my device too :))
Generally, we advise using one of Linus' tags (eg v3.9-rc3). Only if
that doesn't work, seek out a branch containing the fix/feature. When
in doubt, I'll do:
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/marvell/mvneta.c
and ask those listed which branch might contain a fix or feature.
In this case, it'll probably go through David Miller and the netdev
mailinglist, but it isn't there yet. ;-)
hth,
Jason.
^ permalink raw reply
* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Hannes Frederic Sowa @ 2013-03-19 14:31 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, eric.dumazet
In-Reply-To: <1363702840.3232.104.camel@localhost>
On Tue, Mar 19, 2013 at 03:20:40PM +0100, Jesper Dangaard Brouer wrote:
> I think it's overkill to implement this now. I just want this patch in
> as a safeguard.
>
> The idea I discussed with Eric, will remove the need for this patch.
> The idea is to drop the LRU lists, increase the hash size a bit, and do
> cleanup/eviction directly on the frag hash tables. And e.g. only allow
> 5 frag queue elements in each hash bucket... but more work and testing
> is needed before I have something ready.
Thats cool, I won't rebase the patch.
Thanks,
Hannes
^ permalink raw reply
* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: David Miller @ 2013-03-19 14:29 UTC (permalink / raw)
To: hannes; +Cc: netdev, eric.dumazet, jbrouer
In-Reply-To: <20130315213230.GB24041@order.stressinduktion.org>
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri, 15 Mar 2013 22:32:30 +0100
> This patch introduces a constant limit of the fragment queue hash
> table bucket list lengths. Currently the limit 128 is choosen somewhat
> arbitrary and just ensures that we can fill up the fragment cache with
> empty packets up to the default ip_frag_high_thresh limits. It should
> just protect from list iteration eating considerable amounts of cpu.
>
> If we reach the maximum length in one hash bucket a warning is printed.
> This is implemented on the caller side of inet_frag_find to distinguish
> between the different users of inet_fragment.c.
>
> I dropped the out of memory warning in the ipv4 fragment lookup path,
> because we already get a warning by the slab allocator.
>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: David Miller @ 2013-03-19 14:28 UTC (permalink / raw)
To: jbrouer; +Cc: hannes, netdev, eric.dumazet
In-Reply-To: <1363702840.3232.104.camel@localhost>
From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Tue, 19 Mar 2013 15:20:40 +0100
> I think it's overkill to implement this now. I just want this patch in
> as a safeguard.
>
> The idea I discussed with Eric, will remove the need for this patch.
> The idea is to drop the LRU lists, increase the hash size a bit, and do
> cleanup/eviction directly on the frag hash tables. And e.g. only allow
> 5 frag queue elements in each hash bucket... but more work and testing
> is needed before I have something ready.
Fair enough.
^ permalink raw reply
* [PATCH] use xfrm direction when lookup policy
From: Baker Zhang @ 2013-03-19 14:24 UTC (permalink / raw)
To: Steffen Klassert, Herbert Xu, David S. Miller
Cc: netdev, linux-kernel, Baker Zhang
because xfrm policy direction has same value with corresponding
flow direction, so this problem is covered.
In xfrm_lookup and __xfrm_policy_check, flow_cache_lookup is used to
accelerate the lookup.
Flow direction is given to flow_cache_lookup by policy_to_flow_dir.
When the flow cache is mismatched, callback 'resolver' is called.
'resolver' requires xfrm direction,
so convert direction back to xfrm direction.
Signed-off-by: Baker Zhang <baker.zhang@gmail.com>
---
net/xfrm/xfrm_policy.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 167c67d..23cea0f 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1037,6 +1037,24 @@ __xfrm_policy_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir
return xfrm_policy_lookup_bytype(net, XFRM_POLICY_TYPE_MAIN, fl, family, dir);
}
+static int flow_to_policy_dir(int dir)
+{
+ if (XFRM_POLICY_IN == FLOW_DIR_IN &&
+ XFRM_POLICY_OUT == FLOW_DIR_OUT &&
+ XFRM_POLICY_FWD == FLOW_DIR_FWD)
+ return dir;
+
+ switch (dir) {
+ default:
+ case FLOW_DIR_IN:
+ return XFRM_POLICY_IN;
+ case FLOW_DIR_OUT:
+ return XFRM_POLICY_OUT;
+ case FLOW_DIR_FWD:
+ return XFRM_POLICY_FWD;
+ }
+}
+
static struct flow_cache_object *
xfrm_policy_lookup(struct net *net, const struct flowi *fl, u16 family,
u8 dir, struct flow_cache_object *old_obj, void *ctx)
@@ -1046,7 +1064,7 @@ xfrm_policy_lookup(struct net *net, const struct flowi *fl, u16 family,
if (old_obj)
xfrm_pol_put(container_of(old_obj, struct xfrm_policy, flo));
- pol = __xfrm_policy_lookup(net, fl, family, dir);
+ pol = __xfrm_policy_lookup(net, fl, family, flow_to_policy_dir(dir));
if (IS_ERR_OR_NULL(pol))
return ERR_CAST(pol);
@@ -1932,7 +1950,8 @@ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir,
* previous cache entry */
if (xdst == NULL) {
num_pols = 1;
- pols[0] = __xfrm_policy_lookup(net, fl, family, dir);
+ pols[0] = __xfrm_policy_lookup(net, fl, family,
+ flow_to_policy_dir(dir));
err = xfrm_expand_policies(fl, family, pols,
&num_pols, &num_xfrms);
if (err < 0)
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Hannes Frederic Sowa @ 2013-03-19 14:22 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, jbrouer
In-Reply-To: <1363702543.2558.2.camel@edumazet-glaptop>
On Tue, Mar 19, 2013 at 07:15:43AM -0700, Eric Dumazet wrote:
> On Tue, 2013-03-19 at 10:03 -0400, David Miller wrote:
> > From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Date: Fri, 15 Mar 2013 22:32:30 +0100
> >
> > > This patch introduces a constant limit of the fragment queue hash
> > > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > > arbitrary and just ensures that we can fill up the fragment cache with
> > > empty packets up to the default ip_frag_high_thresh limits. It should
> > > just protect from list iteration eating considerable amounts of cpu.
> > >
> > > If we reach the maximum length in one hash bucket a warning is printed.
> > > This is implemented on the caller side of inet_frag_find to distinguish
> > > between the different users of inet_fragment.c.
> > >
> > > I dropped the out of memory warning in the ipv4 fragment lookup path,
> > > because we already get a warning by the slab allocator.
> > >
> > > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > > Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >
> > This looks mostly fine to me, Eric could you give it a quick review?
> >
>
> Sure, it looks ok for me
>
> Acked-by: Eric Dumazet <edumazet@google.com>
>
> > Although one comment from me:
> >
> > > +/* averaged:
> > > + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > > + * rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > > + * struct frag_queue))
> > > + */
> > > +#define INETFRAGS_MAXDEPTH 128
> >
> > If we deem this to be the ideal formula, maybe we can maintain it
> > accurately and very cheaply at run time. We'd do this by adding a
> > handler for the ipfrag_high_thresh sysctl, and use that to recalculate
> > the maxdepth any time ipfrag_high_thresh is changed by the user.
>
> This can probably be done in a second patch for net-next
I'll rebase the old patch introducing inet_frag_update_high_thresh ontop
this one. I think the dynamic update might be useful if we lower the
maxdepth limit in future.
^ permalink raw reply
* Re: [PATCH 1/1] net/smsc911x: Use NULL instead of integer for pointer
From: David Miller @ 2013-03-19 14:20 UTC (permalink / raw)
To: sachin.kamat; +Cc: netdev, steve.glendinning
In-Reply-To: <1363676498-9370-1-git-send-email-sachin.kamat@linaro.org>
From: Sachin Kamat <sachin.kamat@linaro.org>
Date: Tue, 19 Mar 2013 12:31:38 +0530
> Silences the following sparse warning:
> drivers/net/ethernet/smsc/smsc911x.c:2145:30:
> warning: Using plain integer as NULL pointer
>
> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Jesper Dangaard Brouer @ 2013-03-19 14:20 UTC (permalink / raw)
To: David Miller; +Cc: hannes, netdev, eric.dumazet
In-Reply-To: <20130319.100324.927922515830950770.davem@davemloft.net>
On Tue, 2013-03-19 at 10:03 -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 15 Mar 2013 22:32:30 +0100
>
> > This patch introduces a constant limit of the fragment queue hash
> > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > arbitrary and just ensures that we can fill up the fragment cache with
> > empty packets up to the default ip_frag_high_thresh limits. It should
> > just protect from list iteration eating considerable amounts of cpu.
> >
> > If we reach the maximum length in one hash bucket a warning is printed.
> > This is implemented on the caller side of inet_frag_find to distinguish
> > between the different users of inet_fragment.c.
> >
> > I dropped the out of memory warning in the ipv4 fragment lookup path,
> > because we already get a warning by the slab allocator.
> >
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>
> This looks mostly fine to me, Eric could you give it a quick review?
>
> Although one comment from me:
>
> > +/* averaged:
> > + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > + * rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > + * struct frag_queue))
> > + */
> > +#define INETFRAGS_MAXDEPTH 128
>
> If we deem this to be the ideal formula, maybe we can maintain it
> accurately and very cheaply at run time. We'd do this by adding a
> handler for the ipfrag_high_thresh sysctl, and use that to recalculate
> the maxdepth any time ipfrag_high_thresh is changed by the user.
I think it's overkill to implement this now. I just want this patch in
as a safeguard.
The idea I discussed with Eric, will remove the need for this patch.
The idea is to drop the LRU lists, increase the hash size a bit, and do
cleanup/eviction directly on the frag hash tables. And e.g. only allow
5 frag queue elements in each hash bucket... but more work and testing
is needed before I have something ready.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox