* Linux IP forwarding performance benchmarks
From: Oleg Arkhangelsky @ 2012-12-08 12:01 UTC (permalink / raw)
To: netdev
Hello,
Does anyone have some Linux IP forwarding performance
benchmarks on Nehalem Xeon platform versus Sandy
Bridge Xeon E5? Intel DDIO looks pretty tasty and should
give significant performance boost (at least in theory)
but currently we doesn't have this hardware at hand to
test so asking here.
Thank you!
--
wbr, Oleg.
"Anarchy is about taking complete responsibility for yourself."
Alan Moore.
^ permalink raw reply
* Re: [PATCH net-next 2/9] tipc: eliminate aggregate sk_receive_queue limit
From: Neil Horman @ 2012-12-08 14:07 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
In-Reply-To: <1354929558-16948-3-git-send-email-paul.gortmaker@windriver.com>
On Fri, Dec 07, 2012 at 08:19:11PM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
>
> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> global atomic counter for the sum of sk_recv_queue sizes across all
> tipc sockets. When incremented, the counter is compared to an upper
> threshold value, and if this is reached, the message is rejected
> with error code TIPC_OVERLOAD.
>
> This check was originally meant to protect the node against
> buffer exhaustion and general CPU overload. However, all experience
> indicates that the feature not only is redundant on Linux, but even
> harmful. Users run into the limit very often, causing disturbances
> for their applications, while removing it seems to have no negative
> effects at all. We have also seen that overall performance is
> boosted significantly when this bottleneck is removed.
>
> Furthermore, we don't see any other network protocols maintaining
> such a mechanism, something strengthening our conviction that this
> control can be eliminated.
>
> As a result, the atomic variable tipc_queue_size is now unused
> and so it can be deleted. There is a getsockopt call that used
> to allow reading it; we retain that but just return zero for
> maximum compatibility.
>
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> [PG: phase out tipc_queue_size as pointed out by Neil Horman]
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> net/tipc/socket.c | 23 ++++-------------------
> 1 file changed, 4 insertions(+), 19 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 1a720c8..848be69 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2,7 +2,7 @@
> * net/tipc/socket.c: TIPC socket API
> *
> * Copyright (c) 2001-2007, Ericsson AB
> - * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
> + * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
> * All rights reserved.
> *
> * Redistribution and use in source and binary forms, with or without
> @@ -73,8 +73,6 @@ static struct proto tipc_proto;
>
> static int sockets_enabled;
>
> -static atomic_t tipc_queue_size = ATOMIC_INIT(0);
> -
> /*
> * Revised TIPC socket locking policy:
> *
> @@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
> static void advance_rx_queue(struct sock *sk)
> {
> kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
> - atomic_dec(&tipc_queue_size);
> }
>
> /**
> @@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
> {
> struct sk_buff *buf;
>
> - while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
> - atomic_dec(&tipc_queue_size);
> + while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
> kfree_skb(buf);
> - }
> }
>
> /**
> @@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
> {
> struct sk_buff *buf;
>
> - while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
> + while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
> tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
> - atomic_dec(&tipc_queue_size);
> - }
> }
>
> /**
> @@ -280,7 +273,6 @@ static int release(struct socket *sock)
> buf = __skb_dequeue(&sk->sk_receive_queue);
> if (buf == NULL)
> break;
> - atomic_dec(&tipc_queue_size);
> if (TIPC_SKB_CB(buf)->handle != 0)
> kfree_skb(buf);
> else {
> @@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> }
>
> /* Reject message if there isn't room to queue it */
> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> - return TIPC_ERR_OVERLOAD;
> - }
> recv_q_len = skb_queue_len(&sk->sk_receive_queue);
> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
> @@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
>
> /* Enqueue message (finally!) */
> TIPC_SKB_CB(buf)->handle = 0;
> - atomic_inc(&tipc_queue_size);
> __skb_queue_tail(&sk->sk_receive_queue, buf);
>
> /* Initiate connection termination for an incoming 'FIN' */
> @@ -1578,7 +1564,6 @@ restart:
> /* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
> buf = __skb_dequeue(&sk->sk_receive_queue);
> if (buf) {
> - atomic_dec(&tipc_queue_size);
> if (TIPC_SKB_CB(buf)->handle != 0) {
> kfree_skb(buf);
> goto restart;
> @@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
> /* no need to set "res", since already 0 at this point */
> break;
> case TIPC_NODE_RECVQ_DEPTH:
> - value = (u32)atomic_read(&tipc_queue_size);
> + value = 0; /* was tipc_queue_size, now obsolete */
> break;
> case TIPC_SOCK_RECVQ_DEPTH:
> value = skb_queue_len(&sk->sk_receive_queue);
> --
> 1.7.12.1
>
>
Thank you, looks good
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Daniel Borkmann @ 2012-12-08 16:02 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Willem de Bruijn, netfilter-devel, netdev, Eric Dumazet,
David Miller, kaber
In-Reply-To: <20121208033111.GB28114@1984>
On Sat, Dec 8, 2012 at 4:31 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Dec 07, 2012 at 11:56:05AM -0500, Willem de Bruijn wrote:
>> On Fri, Dec 7, 2012 at 8:16 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Wed, Dec 05, 2012 at 03:10:13PM -0500, Willem de Bruijn wrote:
>> >> On Wed, Dec 5, 2012 at 2:48 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> >> > Hi Willem,
>> >> >
>> >> > On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
>> >> >> A new match that executes sk_run_filter on every packet. BPF filters
>> >> >> can access skbuff fields that are out of scope for existing iptables
>> >> >> rules, allow more expressive logic, and on platforms with JIT support
>> >> >> can even be faster.
>> >> >>
>> >> >> I have a corresponding iptables patch that takes `tcpdump -ddd`
>> >> >> output, as used in the examples below. The two parts communicate
>> >> >> using a variable length structure. This is similar to ebt_among,
>> >> >> but new for iptables.
>> >> >>
>> >> >> Verified functionality by inserting an ip source filter on chain
>> >> >> INPUT and an ip dest filter on chain OUTPUT and noting that ping
>> >> >> failed while a rule was active:
>> >> >>
>> >> >> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
>> >> >> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP
>> >> >
>> >> > I like this BPF idea for iptables.
>> >> >
>> >> > I made a similar extension time ago, but it was taking a file as
>> >> > parameter. That file contained in BPF code. I made a simple bison
>> >> > parser that takes BPF code and put it into the bpf array of
>> >> > instructions. It would be a bit more intuitive to define a filter and
>> >> > we can distribute it with iptables.
>> >>
>> >> That's cleaner, indeed. I actually like how tcpdump operates as a
>> >> code generator if you pass -ddd. Unfortunately, it generates code only
>> >> for link layer types of its supported devices, such as DLT_EN10MB and
>> >> DLT_LINUX_SLL. The network layer interface of basic iptables
>> >> (forgetting device dependent mechanisms as used in xt_mac) is DLT_RAW,
>> >> but that is rarely supported.
>> >
>> > Indeed, you'll have to hack on tcpdump to select the offset. In
>> > iptables the base is the layer 3 header. With that change you could
>> > use tcpdump for generate code automagically from their syntax.
>> >
>> >> > Let me check on my internal trees, I can put that user-space code
>> >> > somewhere in case you're interested.
>> >>
>> >> Absolutely. I'll be happy to revise to get it in. I'm also considering
>> >> sending a patch to tcpdump to make it generate code independent of the
>> >> installed hardware when specifying -y.
>> >
>> > I found a version of the old parser code I made:
>> >
>> > http://1984.lsi.us.es/git/nfbpf/
>> >
>> > It interprets a filter expressed in a similar way to tcpdump -dd but
>> > it's using the BPF constants. It's quite preliminary and simple if you
>> > look at the code.
>> >
>> > Extending it to interpret some syntax similar to tcpdump -d would even
>> > make more readable the BPF filter.
>> >
>> > Time ago I also thought about taking the kernel code that checks that
>> > the filter is correct. Currently you get -EINVAL if you pass a
>> > handcrafted filter which is incorrect, so it's hard task to debug what
>> > you made wrong.
>> >
>> > It could be added to the iptables tree. Or if generic enough for BPF
>> > and the effort is worth, just provide some small library that iptables
>> > can link with and a small compiler/checker to help people develop BPF
>> > filters.
>>
>> Or use pcap_compile? I went with the tcpdump output to avoid
>> introducing a direct dependency on pcap to iptables. One possible
>> downside I see to pcap_compile vs. developing from scratch is that it
>> might lag in supporting the LSF ancillary data fields.
>
> I suggest to put the code of that preliminary nfbpf utility into
> iptables to allow to read the BPF filters from a file and put them
> into the BPF array of instructions. I can help with that.
>
>> > Back to your xt_bpf thing, we can use the file containing the code
>> > instead:
>> >
>> > iptables -v -A INPUT -m bpf --bytecode-file filter1.bpf -j DROP
>> > iptables -v -A OUTPUT -m bpf --bytecode-file filter2.bpf -j DROP
>> >
>> > We can still allow the inlined filter via --bytecode if you want.
>>
>> I'll add that. I'd like to keep --bytecode to able to generate the
>> code inline using backticks.
>
> As said, I'm fine with that, but I'll be really happy if we can
> provide some utility to generate that code using backticks for the
> masses (in case they want to pass it inlined in that format).
If it helps, you could use "bpfc", or rip-off its code to not have a
dependency; it's part of the netsniff-ng toolkit.
It can be used like:
bpfc examples/bpfc/arp.bpf
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00000806 },
{ 0x6, 0, 0, 0xffffffff },
{ 0x6, 0, 0, 0x00000000 },
where arp.bpf is, for instance:
_main:
ldh [12]
jeq #0x806, keep, drop
keep:
ret #0xffffffff
drop:
ret #0
"Core" files are: src/bpf_lexer.l, src/bpf_parser.y
It also supports all Linux ANC-operations that were added to the
kernel (like VLAN, XOR and so on). I started but didn't have time to
continue a higher-level language for that, that would translate to
such an example above (which then translates again to opcodes).
^ permalink raw reply
* Re: Linux IP forwarding performance benchmarks
From: Ben Greear @ 2012-12-08 18:44 UTC (permalink / raw)
To: Oleg Arkhangelsky; +Cc: netdev
In-Reply-To: <1219381354968093@web3d.yandex.ru>
On 12/08/2012 04:01 AM, Oleg Arkhangelsky wrote:
> Hello,
>
> Does anyone have some Linux IP forwarding performance
> benchmarks on Nehalem Xeon platform versus Sandy
> Bridge Xeon E5? Intel DDIO looks pretty tasty and should
> give significant performance boost (at least in theory)
> but currently we doesn't have this hardware at hand to
> test so asking here.
Well, I don't have forwarding numbers, but using a modified pktgen,
we can send and receive about 800,000 packets per second on each of
4 10G NICs in our E5 test system.
Our modified pktgen is probably slower than the upstream, as it has
a bunch of rx logic in it as well...
And, for our network emulator module (like a bridge, mostly), we can
get 7-9.8Gbps bi-directional throughput, depending on some issues
due to IRQ pinning and the spread among the rx-queues, it seems.
Thanks,
Ben
>
> Thank you!
>
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* ipgre rss is broken since gro
From: Dmitry Kravkov @ 2012-12-08 22:35 UTC (permalink / raw)
To: Eric Dumazet, netdev@vger.kernel.org
Hi Eric,
I'm trying to use GRE with RSS, but it looks broken on net-next since:
60769a5dcd8755715c7143b4571d5c44f01796f1 is the first bad commit
commit 60769a5dcd8755715c7143b4571d5c44f01796f1
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Sep 27 02:48:50 2012 +0000
ipv4: gre: add GRO capability
Add GRO capability to IPv4 GRE tunnels, using the gro_cells
infrastructure.
Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
checking GRO is building large packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
:040000 040000 8eb4f570181b6d72abe24f8c1123b7e49134e662 fa20194bb14d1745e9271c8a962d0f140801a226 M include
:040000 040000 6f605ade7fed9fbe5fd57d4a0c3a8dc687e64ed6 4c06b880a6a6068aa791decb900f7b449c6ec7b5 M net
Multiple TCP streams over the tunnel cause (almost) immediately GRE interface to drop any ingress packet.
Please note that at current net-next head behavior is different - I hit null pointer dereference, I will try to bisect this behavior too.
^ permalink raw reply
* [PATCH] ssb: use WARN in main.c
From: Cong Ding @ 2012-12-08 23:11 UTC (permalink / raw)
To: Michael Buesch, netdev, linux-kernel; +Cc: Cong Ding
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
Signed-off-by: Cong Ding <dinggnu@gmail.com>
---
drivers/ssb/main.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/drivers/ssb/main.c b/drivers/ssb/main.c
index bd7115c..c82c5c9 100644
--- a/drivers/ssb/main.c
+++ b/drivers/ssb/main.c
@@ -1133,8 +1133,7 @@ static u32 ssb_tmslow_reject_bitmask(struct ssb_device *dev)
case SSB_IDLOW_SSBREV_27: /* same here */
return SSB_TMSLOW_REJECT; /* this is a guess */
default:
- printk(KERN_INFO "ssb: Backplane Revision 0x%.8X\n", rev);
- WARN_ON(1);
+ WARN(1, KERN_INFO "ssb: Backplane Revision 0x%.8X\n", rev);
}
return (SSB_TMSLOW_REJECT | SSB_TMSLOW_REJECT_23);
}
--
1.7.4.5
^ permalink raw reply related
* RE: ipgre rss is broken since gro
From: Dmitry Kravkov @ 2012-12-08 23:31 UTC (permalink / raw)
To: Dmitry Kravkov, Eric Dumazet, netdev@vger.kernel.org
In-Reply-To: <504C9EFCA2D0054393414C9CB605C37F1BFB80B2@SJEXCHMB06.corp.ad.broadcom.com>
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Dmitry Kravkov
> Sent: Sunday, December 09, 2012 12:35 AM
> To: Eric Dumazet; netdev@vger.kernel.org
> Subject: ipgre rss is broken since gro
>
> Please note that at current net-next head behavior is different - I hit null pointer
> dereference, I will try to bisect this behavior too.
Here is the trace for a while:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: ip_gre gre bnx2x(O) netconsole configfs ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support sg coretemp hwmon kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i7core_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core igb dca ptp pps_core libcrc32c mdio ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 x
ts gf128mul pata_acpi ata_generic ata_piix [last unloaded: bnx2x]
CPU 0
Pid: 0, comm: swapper/0 Tainted: G O 3.7.0-rc7+ #38 Supermicro X8QB6/X8QB6
RIP: 0010:[<ffffffff8144f35e>] [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
RSP: 0018:ffff88047f803c80 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88046cc557c0 RCX: 0000000000001c04
RDX: 0000000000000900 RSI: 0000000000000000 RDI: ffff88046e37d800
RBP: ffff88047f803cf0 R08: ffff88046cc557e8 R09: ffff88046e37dec0
R10: 00000000000005c4 R11: ffff88046d872ec0 R12: ffff88046e013480
R13: 0000000000000034 R14: 0000000000000590 R15: ffff880466b9dc50
FS: 0000000000000000(0000) GS:ffff88047f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001a0b000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
Stack:
ffff880469cc79a8 ffff880866296800 0000000000000482 ffff880469cc7980
ffff88047f803cd0 ffffffff81461558 ffff880866296800 ffff88046d452080
ffff88086ce601c8 ffff88046e013480 0000000000000590 ffff88046cc557e8
Call Trace:
<IRQ>
[<ffffffff81461558>] ? napi_gro_receive+0x238/0x270
[<ffffffff814a3ef1>] tcp_gro_receive+0x271/0x2d0
[<ffffffff814b6aa0>] tcp4_gro_receive+0xb0/0x130
[<ffffffff814cd02a>] inet_gro_receive+0x16a/0x210
[<ffffffff81460c79>] dev_gro_receive+0x1c9/0x2d0
[<ffffffff8146144b>] napi_gro_receive+0x12b/0x270
[<ffffffffa00daace>] gro_cell_poll+0x2e/0x60 [ip_gre]
[<ffffffff81460f73>] net_rx_action+0x103/0x280
[<ffffffff8105dfd7>] __do_softirq+0xd7/0x240
[<ffffffff815362dc>] call_softirq+0x1c/0x30
[<ffffffff810164a5>] do_softirq+0x65/0xa0
[<ffffffff8105ddbd>] irq_exit+0xbd/0xe0
[<ffffffff81536b66>] do_IRQ+0x66/0xe0
[<ffffffff8152ca2d>] common_interrupt+0x6d/0x6d
<EOI>
[<ffffffff8107e4df>] ? __hrtimer_start_range_ns+0x18f/0x420
[<ffffffff812bccc1>] ? intel_idle+0xe1/0x150
[<ffffffff812bcca7>] ? intel_idle+0xc7/0x150
[<ffffffff8141df79>] cpuidle_enter+0x19/0x20
[<ffffffff8141df97>] cpuidle_enter_state+0x17/0x50
[<ffffffff8141e8af>] cpuidle_idle_call+0xcf/0x1a0
[<ffffffff8101cd2f>] cpu_idle+0xcf/0x120
[<ffffffff815110e5>] rest_init+0x75/0x80
[<ffffffff81b05f10>] start_kernel+0x3da/0x3e7
[<ffffffff81b05954>] ? repair_env_string+0x5b/0x5b
[<ffffffff81b05356>] x86_64_start_reservations+0x131/0x136
[<ffffffff81b0545e>] x86_64_start_kernel+0x103/0x112
Code: e8 00 00 00 0f 87 8b 00 00 00 8b 43 68 44 29 e8 3b 43 6c 89 43 68 0f 82 c7 04 00 00 45 89 ed 4c 01 ab e0 00 00 00 49 8b 44 24 08 <48> 89 18 49 89 5c 24 08 0f b6 43 7c a8 10 0f 85 a8 04 00 00 83
RIP [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
RSP <ffff88047f803c80>
CR2: 0000000000000000
---[ end trace e828b50927d09339 ]---
Kernel panic - not syncing: Fatal exception in interrupt
>
^ permalink raw reply
* Re: [PATCH v2 net-next 0/9] tipc: more updates for the v3.8 content
From: David Miller @ 2012-12-09 1:26 UTC (permalink / raw)
To: paul.gortmaker; +Cc: netdev, jon.maloy
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Paul Gortmaker <paul.gortmaker@windriver.com>
Date: Fri, 7 Dec 2012 20:19:09 -0500
> The following changes since commit b93196dc5af7729ff7cc50d3d322ab1a364aa14f:
>
> net: fix some compiler warning in net/core/neighbour.c (2012-12-05 21:50:37 -0500)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next_v2
Pulled, thanks Paul.
^ permalink raw reply
* Re: ipgre rss is broken since gro
From: Eric Dumazet @ 2012-12-09 2:01 UTC (permalink / raw)
To: Dmitry Kravkov; +Cc: netdev@vger.kernel.org
In-Reply-To: <504C9EFCA2D0054393414C9CB605C37F1BFB80D8@SJEXCHMB06.corp.ad.broadcom.com>
On Sat, Dec 8, 2012 at 3:31 PM, Dmitry Kravkov <dmitry@broadcom.com> wrote:
> Here is the trace for a while:
>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
Hi Dmitry
NULL pointer deref probably fixed on net tree (or Linus tree) by
http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255
For the GRO stuff and RSS, I wonder if skbs have a property that makes
them dropped somewhere, you might try drop_monitor / drop_watch
^ permalink raw reply
* Re: ipgre rss is broken since gro
From: Dmitry Kravkov @ 2012-12-09 2:04 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <CANn89iKE2RxRBWRuXGwjydQKjPLhPbcoZaJS5rcBKbJWWnL4ZQ@mail.gmail.com>
On Sat, 2012-12-08 at 18:01 -0800, Eric Dumazet wrote:
> On Sat, Dec 8, 2012 at 3:31 PM, Dmitry Kravkov <dmitry@broadcom.com> wrote:
> > Here is the trace for a while:
> >
> > BUG: unable to handle kernel NULL pointer dereference at (null)
> > IP: [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
>
> Hi Dmitry
>
> NULL pointer deref probably fixed on net tree (or Linus tree) by
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255
>
> For the GRO stuff and RSS, I wonder if skbs have a property that makes
> them dropped somewhere, you might try drop_monitor / drop_watch
>
I will try both and update ...
Thanks
^ permalink raw reply
* Re: GPF in ip6_dst_lookup_tail
From: Eric Dumazet @ 2012-12-09 2:04 UTC (permalink / raw)
To: Dave Jones; +Cc: netdev
In-Reply-To: <20121207141525.GA20613@redhat.com>
On Fri, Dec 7, 2012 at 6:15 AM, Dave Jones <davej@redhat.com> wrote:
> I just hit this gpf in overnight testing.
>
> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: sctp libcrc32c ipt_ULOG fuse binfmt_misc nfnetlink nfc caif_socket caif phonet bluetooth rfkill can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 nfsv3 nfs_acl nfs fscache lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables usb_debug microcode serio_raw pcspkr i2c_piix4 i2c_core r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
> CPU 3
> Pid: 19371, comm: trinity-child3 Not tainted 3.7.0-rc8+ #6 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff815fe038>] [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
> RSP: 0018:ffff880046277968 EFLAGS: 00010206
> RAX: 2000000000000011 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff880046277b00 RSI: ffff880070a6cd80 RDI: ffff8800bb22d000
> RBP: ffff8800462779f8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff880046277a10
> R13: ffff880046277b00 R14: ffff8800bb22d000 R15: ffffffff81cb40c0
> FS: 00007f0817b3f740(0000) GS:ffff88012b400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000004 CR3: 00000000a7dfb000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child3 (pid: 19371, threadinfo ffff880046276000, task ffff880044dca450)
> Stack:
> ffff8800bb22d000 0000000000000002 0000000000000001 0000000000000000
> ffff880046277998 ffffffff81335b56 ffff8800462779b8 ffffffff81074cec
> 0000000000000001 ffff880070a6cd80 ffff8800462779f8 00000000ae4317df
> Call Trace:
> [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
> [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
> [<ffffffff81527945>] ? sk_dst_check+0x5/0x260
> [<ffffffff815fe3fb>] ip6_sk_dst_lookup_flow+0xcb/0x1b0
> [<ffffffff8161f45e>] udpv6_sendmsg+0x66e/0xb90
> [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
> [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
> [<ffffffff815bd4e4>] inet_sendmsg+0x114/0x230
> [<ffffffff815bd3d5>] ? inet_sendmsg+0x5/0x230
> [<ffffffff815261e9>] ? sock_update_classid+0xa9/0x2d0
> [<ffffffff81526270>] ? sock_update_classid+0x130/0x2d0
> [<ffffffff81520d1c>] sock_sendmsg+0xbc/0xf0
> [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
> [<ffffffff810bd037>] ? lock_release_non_nested+0x2b7/0x2f0
> [<ffffffff815210fc>] __sys_sendmsg+0x3ac/0x3c0
> [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
> [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
> [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
> [<ffffffff8100a136>] ? native_sched_clock+0x26/0x90
> [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
> [<ffffffff81055e54>] ? do_setitimer+0x1c4/0x300
> [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
> [<ffffffff811d6bea>] ? fget_light+0x3ca/0x500
> [<ffffffff810bdf9d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff815237d9>] sys_sendmsg+0x49/0x90
> [<ffffffff81684b42>] system_call_fastpath+0x16/0x1b
> Code: 00 00 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 00 49 8b 34 24 48 8b 86 98 00 00 00 48 85 c0 74 c2 <f6> 80 6d 01 00 00 de 75 b9 48 8b 56 18 49 8d 75 24 b9 01 00 00
> RIP [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
> RSP <ffff880046277968>
> ---[ end trace f7dde22e5674fdd6 ]---
>
>
>
> 0000000000000000 <.text>:
> 0: 00 00 add %al,(%rax)
> 2: 48 8b 5d d8 mov -0x28(%rbp),%rbx
> 6: 4c 8b 65 e0 mov -0x20(%rbp),%r12
> a: 4c 8b 6d e8 mov -0x18(%rbp),%r13
> e: 4c 8b 75 f0 mov -0x10(%rbp),%r14
> 12: 4c 8b 7d f8 mov -0x8(%rbp),%r15
> 16: c9 leaveq
> 17: c3 retq
> 18: 0f 1f 00 nopl (%rax)
> 1b: 49 8b 34 24 mov (%r12),%rsi
> 1f: 48 8b 86 98 00 00 00 mov 0x98(%rsi),%rax
> 26: 48 85 c0 test %rax,%rax
> 29: 74 c2 je 0xffffffffffffffed
>
> /home/davej/tmp/tmp.YOesIDyrIr.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <.text>:
> 0: f6 80 6d 01 00 00 de testb $0xde,0x16d(%rax)
> 7: 75 b9 jne 0xffffffffffffffc2
> 9: 48 8b 56 18 mov 0x18(%rsi),%rdx
> d: 49 8d 75 24 lea 0x24(%r13),%rsi
> 11: b9 .byte 0xb9
> 12: 01 00 add %eax,(%rax)
> ...
>
>
> which looks like an inlined copy of ipv6_addr_any
>
> * marked as OPTIMISTIC, we release the found
> * dst entry and replace it instead with the
> * dst entry of the nexthop router
> */
> rt = (struct rt6_info *) *dst;
> n = rt->n;
> dc: 48 8b 86 98 00 00 00 mov 0x98(%rsi),%rax
> if (n && !(n->nud_state & NUD_VALID)) {
> e3: 48 85 c0 test %rax,%rax
> e6: 74 c2 je aa <ip6_dst_lookup_tail+0xaa>
> e8: f6 80 6d 01 00 00 de testb $0xde,0x16d(%rax)
> ef: 75 b9 jne aa <ip6_dst_lookup_tail+0xaa>
>
>
> RAX here is 2000000000000011 , which clearly isn't a valid rt address to dereference.
>
More exactly rt was fine, but n (rt->n) contains garbage
^ permalink raw reply
* [net-next:master 195/198] net/bridge/br_mdb.c:79:35: sparse: incompatible types in comparison expression (different address spaces)
From: kbuild test robot @ 2012-12-09 2:54 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev
tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head: 9ecb9aabaf634677c77af467f4e3028b09d7bcda
commit: ee07c6e7a6f8a25c18f0a6b18152fbd7499245f6 [195/198] bridge: export multicast database via netlink
sparse warnings:
+ net/bridge/br_mdb.c:79:35: sparse: incompatible types in comparison expression (different address spaces)
vim +79 net/bridge/br_mdb.c
ee07c6e7 Cong Wang 2012-12-07 63 struct hlist_node *h;
ee07c6e7 Cong Wang 2012-12-07 64 struct net_bridge_mdb_entry *mp;
ee07c6e7 Cong Wang 2012-12-07 65 struct net_bridge_port_group *p, **pp;
ee07c6e7 Cong Wang 2012-12-07 66 struct net_bridge_port *port;
ee07c6e7 Cong Wang 2012-12-07 67
ee07c6e7 Cong Wang 2012-12-07 68 hlist_for_each_entry_rcu(mp, h, &mdb->mhash[i], hlist[mdb->ver]) {
ee07c6e7 Cong Wang 2012-12-07 69 if (idx < s_idx)
ee07c6e7 Cong Wang 2012-12-07 70 goto skip;
ee07c6e7 Cong Wang 2012-12-07 71
ee07c6e7 Cong Wang 2012-12-07 72 nest2 = nla_nest_start(skb, MDBA_MDB_ENTRY);
ee07c6e7 Cong Wang 2012-12-07 73 if (nest2 == NULL) {
ee07c6e7 Cong Wang 2012-12-07 74 err = -EMSGSIZE;
ee07c6e7 Cong Wang 2012-12-07 75 goto out;
ee07c6e7 Cong Wang 2012-12-07 76 }
ee07c6e7 Cong Wang 2012-12-07 77
ee07c6e7 Cong Wang 2012-12-07 78 for (pp = &mp->ports;
ee07c6e7 Cong Wang 2012-12-07 @79 (p = rcu_dereference(*pp)) != NULL;
ee07c6e7 Cong Wang 2012-12-07 80 pp = &p->next) {
ee07c6e7 Cong Wang 2012-12-07 81 port = p->port;
ee07c6e7 Cong Wang 2012-12-07 82 if (port) {
ee07c6e7 Cong Wang 2012-12-07 83 struct br_mdb_entry e;
ee07c6e7 Cong Wang 2012-12-07 84 e.ifindex = port->dev->ifindex;
ee07c6e7 Cong Wang 2012-12-07 85 e.addr.u.ip4 = p->addr.u.ip4;
ee07c6e7 Cong Wang 2012-12-07 86 #if IS_ENABLED(CONFIG_IPV6)
ee07c6e7 Cong Wang 2012-12-07 87 e.addr.u.ip6 = p->addr.u.ip6;
---
0-DAY kernel build testing backend Open Source Technology Center
Fengguang Wu, Yuanhan Liu Intel Corporation
^ permalink raw reply
* (unknown),
From: Nate Wiley @ 2012-12-08 13:19 UTC (permalink / raw)
Our Ref: G20/CT/GA12
We are pleased to inform you that you have been selected by the G20 Group
to receive an award prize of $861,937.00 USD. Additional information on
payment modalities will be released on your response. We anticipate receiving
a prompt response from you.
Regards
Nate Wiley.
(Claim department)
IT COULD BE YOU® is a registered trade mark of the G20 Group
^ permalink raw reply
* Re: [PATCH v4 net-next 0/5] tunneling: Add support for hardware-offloaded encapsulation
From: David Miller @ 2012-12-09 5:21 UTC (permalink / raw)
To: joseph.gasparakis
Cc: shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
saeed.bishara, bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>
From: Joseph Gasparakis <joseph.gasparakis@intel.com>
Date: Fri, 7 Dec 2012 16:14:13 -0800
> The series contains updates to add in the NIC Rx and Tx checksumming
> support for encapsulated packets.
Ok, all applied (except the ixgbe patch, of course), thanks.
^ permalink raw reply
* Re: [PATCH net-next #2 1/1] r8169: workaround for missing extended GigaMAC registers
From: David Miller @ 2012-12-09 5:32 UTC (permalink / raw)
To: romieu; +Cc: netdev, jlee, udknight, hayeswang
In-Reply-To: <20121207212021.GA8412@electric-eye.fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Fri, 7 Dec 2012 22:20:21 +0100
> GigaMAC registers have been reported left unitialized in several
> situations:
> - after cold boot from power-off state
> - after S3 resume
>
> Tweaking rtl_hw_phy_config takes care of both.
>
> This patch removes an excess entry (",") at the end of the exgmac_reg
> array as well.
>
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Signed-off-by: Wang YanQing <udknight@gmail.com>
> Cc: Hayes Wang <hayeswang@realtek.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: David Miller @ 2012-12-09 5:32 UTC (permalink / raw)
To: jasowang
Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
bhutchings, jwhan, shiyer
In-Reply-To: <1354899897-10423-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Sat, 8 Dec 2012 01:04:54 +0800
> This series is an update version (hope the final version) of multiqueue
> (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> addressed, the work were based on Krishna Kumar's work to let virtio-net use
> multiple rx/tx queues to do the packets reception and transmission. Performance
> test show the aggregate latency were increased greately but may get some
> regression in small packet transmission. Due to this, multiqueue were disabled
> by default. If user want to benefit form the multiqueue, ethtool -L could be
> used to enable the feature.
These changes look fine to me, applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next 1/2] caif_usb: Check driver name before reading driver state in netdev notifier
From: David Miller @ 2012-12-09 5:34 UTC (permalink / raw)
To: bhutchings; +Cc: sjur.brandeland, netdev
In-Reply-To: <1354897046.2707.7.camel@bwh-desktop.uk.solarflarecom.com>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 7 Dec 2012 16:17:26 +0000
> In cfusbl_device_notify(), the usbnet and usbdev variables are
> initialised before the driver name has been checked. In case the
> device's driver is not cdc_ncm, this may result in reading beyond the
> end of the netdev private area. Move the initialisation below the
> driver name check.
>
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 2/2] caif_usb: Make the driver name check more efficient
From: David Miller @ 2012-12-09 5:34 UTC (permalink / raw)
To: bhutchings; +Cc: sjur.brandeland, netdev
In-Reply-To: <1354897227.2707.10.camel@bwh-desktop.uk.solarflarecom.com>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 7 Dec 2012 16:20:27 +0000
> Use the device model to get just the name, rather than using the
> ethtool API to get all driver information.
>
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
> Compile-tested only. I'm assuming that the strncmp() is not really
> necessary, but perhaps there is some OOT variant of cdc_ncm that is also
> supposed to be supported?
Applied, I guess you found this while looking around for tests
of ethtool_ops being NULL?
^ permalink raw reply
* [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-09 5:43 UTC (permalink / raw)
To: David Miller; +Cc: edumazet, netdev, Neal Cardwell
Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.
With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
net/ipv4/inet_diag.c | 53 ++++++++++++++++++++++++++++++++++++-------------
1 files changed, 39 insertions(+), 14 deletions(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 0c34bfa..16cfa42 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -44,6 +44,10 @@ struct inet_diag_entry {
u16 dport;
u16 family;
u16 userlocks;
+#if IS_ENABLED(CONFIG_IPV6)
+ struct in6_addr saddr_storage; /* for IPv4-mapped-IPv6 addresses */
+ struct in6_addr daddr_storage; /* for IPv4-mapped-IPv6 addresses */
+#endif
};
static DEFINE_MUTEX(inet_diag_table_mutex);
@@ -596,6 +600,36 @@ static int inet_twsk_diag_dump(struct inet_timewait_sock *tw,
cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
}
+/* Get the IPv4, IPv6, or IPv4-mapped-IPv6 local and remote addresses
+ * from a request_sock. For IPv4-mapped-IPv6 we must map IPv4 to IPv6.
+ */
+static inline void inet_diag_req_addrs(const struct sock *sk,
+ const struct request_sock *req,
+ struct inet_diag_entry *entry)
+{
+ struct inet_request_sock *ireq = inet_rsk(req);
+
+#if IS_ENABLED(CONFIG_IPV6)
+ if (sk->sk_family == AF_INET6) {
+ if (req->rsk_ops->family == AF_INET6) {
+ entry->saddr = inet6_rsk(req)->loc_addr.s6_addr32;
+ entry->daddr = inet6_rsk(req)->rmt_addr.s6_addr32;
+ } else if (req->rsk_ops->family == AF_INET) {
+ ipv6_addr_set_v4mapped(ireq->loc_addr,
+ &entry->saddr_storage);
+ ipv6_addr_set_v4mapped(ireq->rmt_addr,
+ &entry->daddr_storage);
+ entry->saddr = entry->saddr_storage.s6_addr32;
+ entry->daddr = entry->daddr_storage.s6_addr32;
+ }
+ } else
+#endif
+ {
+ entry->saddr = &ireq->loc_addr;
+ entry->daddr = &ireq->rmt_addr;
+ }
+}
+
static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk,
struct request_sock *req,
struct user_namespace *user_ns,
@@ -637,8 +671,10 @@ static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk,
r->idiag_inode = 0;
#if IS_ENABLED(CONFIG_IPV6)
if (r->idiag_family == AF_INET6) {
- *(struct in6_addr *)r->id.idiag_src = inet6_rsk(req)->loc_addr;
- *(struct in6_addr *)r->id.idiag_dst = inet6_rsk(req)->rmt_addr;
+ struct inet_diag_entry entry;
+ inet_diag_req_addrs(sk, req, &entry);
+ memcpy(r->id.idiag_src, entry.saddr, sizeof(struct in6_addr));
+ memcpy(r->id.idiag_dst, entry.daddr, sizeof(struct in6_addr));
}
#endif
@@ -691,18 +727,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
continue;
if (bc) {
- entry.saddr =
-#if IS_ENABLED(CONFIG_IPV6)
- (entry.family == AF_INET6) ?
- inet6_rsk(req)->loc_addr.s6_addr32 :
-#endif
- &ireq->loc_addr;
- entry.daddr =
-#if IS_ENABLED(CONFIG_IPV6)
- (entry.family == AF_INET6) ?
- inet6_rsk(req)->rmt_addr.s6_addr32 :
-#endif
- &ireq->rmt_addr;
+ inet_diag_req_addrs(sk, req, &entry);
entry.dport = ntohs(ireq->rmt_port);
if (!inet_diag_bc_run(bc, &entry))
--
1.7.7.3
^ permalink raw reply related
* [PATCH net 2/3] inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
From: Neal Cardwell @ 2012-12-09 5:43 UTC (permalink / raw)
To: David Miller; +Cc: edumazet, netdev, Neal Cardwell
In-Reply-To: <1355031803-14547-1-git-send-email-ncardwell@google.com>
Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.
Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
net/ipv4/inet_diag.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 16cfa42..529747d 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -513,6 +513,44 @@ static int valid_cc(const void *bc, int len, int cc)
return 0;
}
+/* Validate an inet_diag_hostcond. */
+static bool valid_hostcond(const struct inet_diag_bc_op *op, int len,
+ int *min_len)
+{
+ int addr_len;
+ struct inet_diag_hostcond *cond;
+
+ /* Check hostcond space. */
+ *min_len += sizeof(struct inet_diag_hostcond);
+ if (len < *min_len)
+ return false;
+ cond = (struct inet_diag_hostcond *)(op + 1);
+
+ /* Check address family and address length. */
+ switch (cond->family) {
+ case AF_UNSPEC:
+ addr_len = 0;
+ break;
+ case AF_INET:
+ addr_len = sizeof(struct in_addr);
+ break;
+ case AF_INET6:
+ addr_len = sizeof(struct in6_addr);
+ break;
+ default:
+ return false;
+ }
+ *min_len += addr_len;
+ if (len < *min_len)
+ return false;
+
+ /* Check prefix length (in bits) vs address length (in bytes). */
+ if (cond->prefix_len > 8 * addr_len)
+ return false;
+
+ return true;
+}
+
static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
{
const void *bc = bytecode;
@@ -520,18 +558,22 @@ static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
while (len > 0) {
const struct inet_diag_bc_op *op = bc;
+ int min_len = sizeof(struct inet_diag_bc_op);
//printk("BC: %d %d %d {%d} / %d\n", op->code, op->yes, op->no, op[1].no, len);
switch (op->code) {
- case INET_DIAG_BC_AUTO:
case INET_DIAG_BC_S_COND:
case INET_DIAG_BC_D_COND:
+ if (!valid_hostcond(bc, len, &min_len))
+ return -EINVAL;
+ /* fall through */
+ case INET_DIAG_BC_AUTO:
case INET_DIAG_BC_S_GE:
case INET_DIAG_BC_S_LE:
case INET_DIAG_BC_D_GE:
case INET_DIAG_BC_D_LE:
case INET_DIAG_BC_JMP:
- if (op->no < 4 || op->no > len + 4 || op->no & 3)
+ if (op->no < min_len || op->no > len + 4 || op->no & 3)
return -EINVAL;
if (op->no < len &&
!valid_cc(bytecode, bytecode_len, len - op->no))
@@ -542,7 +584,7 @@ static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
default:
return -EINVAL;
}
- if (op->yes < 4 || op->yes > len + 4 || op->yes & 3)
+ if (op->yes < min_len || op->yes > len + 4 || op->yes & 3)
return -EINVAL;
bc += op->yes;
len -= op->yes;
--
1.7.7.3
^ permalink raw reply related
* [PATCH net 3/3] inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
From: Neal Cardwell @ 2012-12-09 5:43 UTC (permalink / raw)
To: David Miller; +Cc: edumazet, netdev, Neal Cardwell
In-Reply-To: <1355031803-14547-1-git-send-email-ncardwell@google.com>
Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).
This change is needed for two reasons:
(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.
(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
net/ipv4/inet_diag.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 529747d..95f1a45 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -432,25 +432,31 @@ static int inet_diag_bc_run(const struct nlattr *_bc,
break;
}
- if (cond->prefix_len == 0)
- break;
-
if (op->code == INET_DIAG_BC_S_COND)
addr = entry->saddr;
else
addr = entry->daddr;
+ if (cond->family != AF_UNSPEC &&
+ cond->family != entry->family) {
+ if (entry->family == AF_INET6 &&
+ cond->family == AF_INET) {
+ if (addr[0] == 0 && addr[1] == 0 &&
+ addr[2] == htonl(0xffff) &&
+ bitstring_match(addr + 3,
+ cond->addr,
+ cond->prefix_len))
+ break;
+ }
+ yes = 0;
+ break;
+ }
+
+ if (cond->prefix_len == 0)
+ break;
if (bitstring_match(addr, cond->addr,
cond->prefix_len))
break;
- if (entry->family == AF_INET6 &&
- cond->family == AF_INET) {
- if (addr[0] == 0 && addr[1] == 0 &&
- addr[2] == htonl(0xffff) &&
- bitstring_match(addr + 3, cond->addr,
- cond->prefix_len))
- break;
- }
yes = 0;
break;
}
--
1.7.7.3
^ permalink raw reply related
* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: David Miller @ 2012-12-09 5:46 UTC (permalink / raw)
To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355031803-14547-1-git-send-email-ncardwell@google.com>
Thanks a lot for working on a complete fix for these problems, I'll
review these patches soon.
^ permalink raw reply
* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-09 6:01 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121209.004656.468043362420071590.davem@davemloft.net>
On Sun, Dec 9, 2012 at 12:46 AM, David Miller <davem@davemloft.net> wrote:
>
> Thanks a lot for working on a complete fix for these problems, I'll
> review these patches soon.
Thanks, David! I appreciate it.
neal
^ permalink raw reply
* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Linda Walsh @ 2012-12-09 7:48 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <1013.1354914054@death.nxdomain>
Jay Vosburgh wrote:
>> ---
>> If I am running 'rr' on 2 channels -- specifically for the purpose
>> of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
>> I'm not sure I see how miimon would provide benefit. -- if 1 link dies,
>> the other, being on the same card is likely to be dead too, so would
>> it really serve a purpose?
>>
>
> Perhaps, but if the link partner experiences a failure, that may
> be a different situation. Not all failures will necessarily cause both
> links to fail simultaneously.
>
>
>>> Running without it will not detect failure of
>>> the bonding slaves, which is likely not what you want. The mode,
>>> balance-rr in your case, is what selects the load balance to use, and is
>>> separate from the miimon.
>>>
>>>
>> ----
>> Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
>> dies, the entire link goes down?
>>
> No; failure of a single slave does not cause the entire bond to
> fail (unless that is the last available slave). For round robin, a
> failed slave is taken out of the set used to transmit traffic, and any
> remaining slaves continue to round robin amongst themselves.
>
>
>> The other end (windows) doesn't dynamically config for a static-link
>> aggregation, so I don't think it would provide benefit.
>>
> So it (windows) has no means to disable (and discontinue use of)
> one channel of the aggregation should it fail, even in a static link
> aggregation?
>
-----------------
Actually in rereading the docs again, it should, but not w/o packet loss.
It has a static and a dynamic link aggregation, and though only the dynamic
link aggregation had that -- but both do and both claim to balance all
traffic.
FWIW, my cables are direct connect, so only the capabilities of the
end cards (both Intel X540-T2 cards) are at issue, I believe.
I don't know if that is a problem or not, as each of the two ports
on the cards will only see half the traffic (from the wire
that is directly connected to it).
>
> How are you testing the throughput? If you configure the
> aggregation with just one link, how does the throughput compare to the
> aggregation with both links?
>
----
When I did 1 link, I got about 2x faster writes, and reads that were
no faster, but I didn't do extensive testing...not sure how reliable those
figures were -- but they were sufficiently disappointing that I didn't
bother doing more testing and went immediately to trying teaming/bonding.
> It most likely is combining links properly, but any link
> aggregation scheme has tradeoffs, and the best load balance algorithm to
> use depends upon the work load. Two aggregated 10G links are not
> interchangable with a single 20G link.
>
---
Not exactly, but for TCP streams, they mostly should be.
have tried a few TCP bench tests, and they got slower speeds than
my file R/W speeds through samba. So use samba for testings, as
it seems to provide fairly low overhead such that I can get
line-speed writes w/1Gb ethers and >97% line speed reads.
I'm not sure, but I think the scheduler may be coming into play
more on linux (though I would have thought it would have been Windows
slowing things down -- but I guess they got lots of grief over
their perf in WinXP and Vista... As Win7 seems to be better in that
regard. Both cards are using 9k packets, and all possible offloading.
(udp/tcp..send/receive in addition to standard chksum offloading).
> For a round robin transmission scheme, issues arise because
> packets are delivered at the other end out of order. This in turn
> triggers various TCP behaviors to deal with what is perceived to be
> transmission errors or lost packets (TCP fast retransmit being the most
> notable). This usually results in a single TCP connection being unable
> to completely saturate a round-robin aggregated set of links.
>
----
I don't see that much retry traffic ... What appears maybe to be
a period drop -- like some period tic(?)...I do have tpc-low-latency,
but a 10Gb connection should low latency. Have the tcp_reordering set
to 16...which isn't a new change -- had stack tuned for optimal perf
on 1Gb....but 10gb/20gb... -- not really sure where to start...
> There are a few parameters on linux that can be adjusted. I
> don't know what the windows equivalents might be.
>
> On linux, adjusting the net.ipv4.tcp_reordering sysctl value
> will increase the tolerance for out of order delivery.
>
> The sysctl is adjusted via something like
>
> sysctl -w net.ipv4.tcp_reordering=10
>
---
yeah... already got that.
> the default value is 3, and higher values increase the tolerance
> for out of order delivery. If memory serves, the setting is applied to
> connections as they are created, so existing connections will not see
> changes.
>
> Also, adjusting the packet coalescing setting for the receiving
> devices may also permit higher throughput. The packet coalescing setting
> is adjusted via ethtool; the current settings can be viewed via
>
> ethtool -c eth0
>
> and then adjusted via something like
>
> ethtool -C eth0 rx-usecs 30
>
---
Had no clue what to set there....
Besides, wouldn't I need to set it on the bond interface, as
it is the stream coming from the bond interface that need coalescing?
When i try it with the bond interface, I get 'not supported'
(it is on the slave interfaces, but seems like those wouldn't
"fit", as there wouldn't be contiguous i/o to either slave as
they alternate packets... (?)
> I've seen reports that raising the "rx-usecs" parameter at the
> receiver can increase the round-robin throughput. My recollection is
> that the value used was 30, but the best settings will likely be
> dependent upon your particular hardware and configuration.
>
---
Will have to play w/those... right now, all '0's.
Thanks for the patch(s)...and hints on ethtool..
FWIW, windows has 2 timers -- a 1 once/sec status timer and a 1/10sec
load tick -- but I don't see the load tick doing anything on
static aggregation.
Linda W.
^ permalink raw reply
* ixgbe: pci_get_device() call without counterpart call of pci_dev_put()
From: Elena Gurevich @ 2012-12-09 9:47 UTC (permalink / raw)
To: netdev
Hi all,
I am pioneer in linux device drivers here and using Intel 82599 NIC as
reference model,
During investigation to drivers sources I found the suspicious code:
Is code sequence (1) and (2) the possible device reference count leakage
???
Thanks a lot in advance
Lena
--------snipped from ixgbe_main.c file function ixgbe_io_error_detected()
-----------
. . .
/* Find the pci device of the offending VF */
vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, device_id,
NULL);
while (vfdev) {
if (vfdev->devfn == (req_id & 0xFF))
break;
<------------------------------ (1) leaves the loop with successful get
call !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
device_id, vfdev);
}
/*
* There's a slim chance the VF could have been hot plugged,
* so if it is no longer present we don't need to issue the
* VFLR. Just clean up the AER in that case.
*/
if (vfdev) {
e_dev_err("Issuing VFLR to VF %d\n", vf);
pci_write_config_dword(vfdev, 0xA8, 0x00008000);
}
pci_cleanup_aer_uncorrect_error_status(pdev);
}
/*
* Even though the error may have occurred on the other port
* we still need to increment the vf error reference count for
* both ports because the I/O resume function will be called
* for both of them.
*/
adapter->vferr_refcount++;
return PCI_ERS_RESULT_RECOVERED;
<-------------------------------------------- (2) leaves the function
without put call !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
---------------------------------- snipped
-----------------------------------
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox