* [PATCH] ssb: use WARN in main.c
From: Cong Ding @ 2012-12-08 23:11 UTC (permalink / raw)
To: Michael Buesch, netdev, linux-kernel; +Cc: Cong Ding
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
Signed-off-by: Cong Ding <dinggnu@gmail.com>
---
drivers/ssb/main.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/drivers/ssb/main.c b/drivers/ssb/main.c
index bd7115c..c82c5c9 100644
--- a/drivers/ssb/main.c
+++ b/drivers/ssb/main.c
@@ -1133,8 +1133,7 @@ static u32 ssb_tmslow_reject_bitmask(struct ssb_device *dev)
case SSB_IDLOW_SSBREV_27: /* same here */
return SSB_TMSLOW_REJECT; /* this is a guess */
default:
- printk(KERN_INFO "ssb: Backplane Revision 0x%.8X\n", rev);
- WARN_ON(1);
+ WARN(1, KERN_INFO "ssb: Backplane Revision 0x%.8X\n", rev);
}
return (SSB_TMSLOW_REJECT | SSB_TMSLOW_REJECT_23);
}
--
1.7.4.5
^ permalink raw reply related
* ipgre rss is broken since gro
From: Dmitry Kravkov @ 2012-12-08 22:35 UTC (permalink / raw)
To: Eric Dumazet, netdev@vger.kernel.org
Hi Eric,
I'm trying to use GRE with RSS, but it looks broken on net-next since:
60769a5dcd8755715c7143b4571d5c44f01796f1 is the first bad commit
commit 60769a5dcd8755715c7143b4571d5c44f01796f1
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Sep 27 02:48:50 2012 +0000
ipv4: gre: add GRO capability
Add GRO capability to IPv4 GRE tunnels, using the gro_cells
infrastructure.
Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
checking GRO is building large packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
:040000 040000 8eb4f570181b6d72abe24f8c1123b7e49134e662 fa20194bb14d1745e9271c8a962d0f140801a226 M include
:040000 040000 6f605ade7fed9fbe5fd57d4a0c3a8dc687e64ed6 4c06b880a6a6068aa791decb900f7b449c6ec7b5 M net
Multiple TCP streams over the tunnel cause (almost) immediately GRE interface to drop any ingress packet.
Please note that at current net-next head behavior is different - I hit null pointer dereference, I will try to bisect this behavior too.
^ permalink raw reply
* Re: Linux IP forwarding performance benchmarks
From: Ben Greear @ 2012-12-08 18:44 UTC (permalink / raw)
To: Oleg Arkhangelsky; +Cc: netdev
In-Reply-To: <1219381354968093@web3d.yandex.ru>
On 12/08/2012 04:01 AM, Oleg Arkhangelsky wrote:
> Hello,
>
> Does anyone have some Linux IP forwarding performance
> benchmarks on Nehalem Xeon platform versus Sandy
> Bridge Xeon E5? Intel DDIO looks pretty tasty and should
> give significant performance boost (at least in theory)
> but currently we doesn't have this hardware at hand to
> test so asking here.
Well, I don't have forwarding numbers, but using a modified pktgen,
we can send and receive about 800,000 packets per second on each of
4 10G NICs in our E5 test system.
Our modified pktgen is probably slower than the upstream, as it has
a bunch of rx logic in it as well...
And, for our network emulator module (like a bridge, mostly), we can
get 7-9.8Gbps bi-directional throughput, depending on some issues
due to IRQ pinning and the spread among the rx-queues, it seems.
Thanks,
Ben
>
> Thank you!
>
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Daniel Borkmann @ 2012-12-08 16:02 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Willem de Bruijn, netfilter-devel, netdev, Eric Dumazet,
David Miller, kaber
In-Reply-To: <20121208033111.GB28114@1984>
On Sat, Dec 8, 2012 at 4:31 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Dec 07, 2012 at 11:56:05AM -0500, Willem de Bruijn wrote:
>> On Fri, Dec 7, 2012 at 8:16 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Wed, Dec 05, 2012 at 03:10:13PM -0500, Willem de Bruijn wrote:
>> >> On Wed, Dec 5, 2012 at 2:48 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> >> > Hi Willem,
>> >> >
>> >> > On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
>> >> >> A new match that executes sk_run_filter on every packet. BPF filters
>> >> >> can access skbuff fields that are out of scope for existing iptables
>> >> >> rules, allow more expressive logic, and on platforms with JIT support
>> >> >> can even be faster.
>> >> >>
>> >> >> I have a corresponding iptables patch that takes `tcpdump -ddd`
>> >> >> output, as used in the examples below. The two parts communicate
>> >> >> using a variable length structure. This is similar to ebt_among,
>> >> >> but new for iptables.
>> >> >>
>> >> >> Verified functionality by inserting an ip source filter on chain
>> >> >> INPUT and an ip dest filter on chain OUTPUT and noting that ping
>> >> >> failed while a rule was active:
>> >> >>
>> >> >> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
>> >> >> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP
>> >> >
>> >> > I like this BPF idea for iptables.
>> >> >
>> >> > I made a similar extension time ago, but it was taking a file as
>> >> > parameter. That file contained in BPF code. I made a simple bison
>> >> > parser that takes BPF code and put it into the bpf array of
>> >> > instructions. It would be a bit more intuitive to define a filter and
>> >> > we can distribute it with iptables.
>> >>
>> >> That's cleaner, indeed. I actually like how tcpdump operates as a
>> >> code generator if you pass -ddd. Unfortunately, it generates code only
>> >> for link layer types of its supported devices, such as DLT_EN10MB and
>> >> DLT_LINUX_SLL. The network layer interface of basic iptables
>> >> (forgetting device dependent mechanisms as used in xt_mac) is DLT_RAW,
>> >> but that is rarely supported.
>> >
>> > Indeed, you'll have to hack on tcpdump to select the offset. In
>> > iptables the base is the layer 3 header. With that change you could
>> > use tcpdump for generate code automagically from their syntax.
>> >
>> >> > Let me check on my internal trees, I can put that user-space code
>> >> > somewhere in case you're interested.
>> >>
>> >> Absolutely. I'll be happy to revise to get it in. I'm also considering
>> >> sending a patch to tcpdump to make it generate code independent of the
>> >> installed hardware when specifying -y.
>> >
>> > I found a version of the old parser code I made:
>> >
>> > http://1984.lsi.us.es/git/nfbpf/
>> >
>> > It interprets a filter expressed in a similar way to tcpdump -dd but
>> > it's using the BPF constants. It's quite preliminary and simple if you
>> > look at the code.
>> >
>> > Extending it to interpret some syntax similar to tcpdump -d would even
>> > make more readable the BPF filter.
>> >
>> > Time ago I also thought about taking the kernel code that checks that
>> > the filter is correct. Currently you get -EINVAL if you pass a
>> > handcrafted filter which is incorrect, so it's hard task to debug what
>> > you made wrong.
>> >
>> > It could be added to the iptables tree. Or if generic enough for BPF
>> > and the effort is worth, just provide some small library that iptables
>> > can link with and a small compiler/checker to help people develop BPF
>> > filters.
>>
>> Or use pcap_compile? I went with the tcpdump output to avoid
>> introducing a direct dependency on pcap to iptables. One possible
>> downside I see to pcap_compile vs. developing from scratch is that it
>> might lag in supporting the LSF ancillary data fields.
>
> I suggest to put the code of that preliminary nfbpf utility into
> iptables to allow to read the BPF filters from a file and put them
> into the BPF array of instructions. I can help with that.
>
>> > Back to your xt_bpf thing, we can use the file containing the code
>> > instead:
>> >
>> > iptables -v -A INPUT -m bpf --bytecode-file filter1.bpf -j DROP
>> > iptables -v -A OUTPUT -m bpf --bytecode-file filter2.bpf -j DROP
>> >
>> > We can still allow the inlined filter via --bytecode if you want.
>>
>> I'll add that. I'd like to keep --bytecode to able to generate the
>> code inline using backticks.
>
> As said, I'm fine with that, but I'll be really happy if we can
> provide some utility to generate that code using backticks for the
> masses (in case they want to pass it inlined in that format).
If it helps, you could use "bpfc", or rip-off its code to not have a
dependency; it's part of the netsniff-ng toolkit.
It can be used like:
bpfc examples/bpfc/arp.bpf
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00000806 },
{ 0x6, 0, 0, 0xffffffff },
{ 0x6, 0, 0, 0x00000000 },
where arp.bpf is, for instance:
_main:
ldh [12]
jeq #0x806, keep, drop
keep:
ret #0xffffffff
drop:
ret #0
"Core" files are: src/bpf_lexer.l, src/bpf_parser.y
It also supports all Linux ANC-operations that were added to the
kernel (like VLAN, XOR and so on). I started but didn't have time to
continue a higher-level language for that, that would translate to
such an example above (which then translates again to opcodes).
^ permalink raw reply
* Re: [PATCH net-next 2/9] tipc: eliminate aggregate sk_receive_queue limit
From: Neil Horman @ 2012-12-08 14:07 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
In-Reply-To: <1354929558-16948-3-git-send-email-paul.gortmaker@windriver.com>
On Fri, Dec 07, 2012 at 08:19:11PM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
>
> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> global atomic counter for the sum of sk_recv_queue sizes across all
> tipc sockets. When incremented, the counter is compared to an upper
> threshold value, and if this is reached, the message is rejected
> with error code TIPC_OVERLOAD.
>
> This check was originally meant to protect the node against
> buffer exhaustion and general CPU overload. However, all experience
> indicates that the feature not only is redundant on Linux, but even
> harmful. Users run into the limit very often, causing disturbances
> for their applications, while removing it seems to have no negative
> effects at all. We have also seen that overall performance is
> boosted significantly when this bottleneck is removed.
>
> Furthermore, we don't see any other network protocols maintaining
> such a mechanism, something strengthening our conviction that this
> control can be eliminated.
>
> As a result, the atomic variable tipc_queue_size is now unused
> and so it can be deleted. There is a getsockopt call that used
> to allow reading it; we retain that but just return zero for
> maximum compatibility.
>
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> [PG: phase out tipc_queue_size as pointed out by Neil Horman]
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> net/tipc/socket.c | 23 ++++-------------------
> 1 file changed, 4 insertions(+), 19 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 1a720c8..848be69 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2,7 +2,7 @@
> * net/tipc/socket.c: TIPC socket API
> *
> * Copyright (c) 2001-2007, Ericsson AB
> - * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
> + * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
> * All rights reserved.
> *
> * Redistribution and use in source and binary forms, with or without
> @@ -73,8 +73,6 @@ static struct proto tipc_proto;
>
> static int sockets_enabled;
>
> -static atomic_t tipc_queue_size = ATOMIC_INIT(0);
> -
> /*
> * Revised TIPC socket locking policy:
> *
> @@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
> static void advance_rx_queue(struct sock *sk)
> {
> kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
> - atomic_dec(&tipc_queue_size);
> }
>
> /**
> @@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
> {
> struct sk_buff *buf;
>
> - while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
> - atomic_dec(&tipc_queue_size);
> + while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
> kfree_skb(buf);
> - }
> }
>
> /**
> @@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
> {
> struct sk_buff *buf;
>
> - while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
> + while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
> tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
> - atomic_dec(&tipc_queue_size);
> - }
> }
>
> /**
> @@ -280,7 +273,6 @@ static int release(struct socket *sock)
> buf = __skb_dequeue(&sk->sk_receive_queue);
> if (buf == NULL)
> break;
> - atomic_dec(&tipc_queue_size);
> if (TIPC_SKB_CB(buf)->handle != 0)
> kfree_skb(buf);
> else {
> @@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> }
>
> /* Reject message if there isn't room to queue it */
> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> - return TIPC_ERR_OVERLOAD;
> - }
> recv_q_len = skb_queue_len(&sk->sk_receive_queue);
> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
> @@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
>
> /* Enqueue message (finally!) */
> TIPC_SKB_CB(buf)->handle = 0;
> - atomic_inc(&tipc_queue_size);
> __skb_queue_tail(&sk->sk_receive_queue, buf);
>
> /* Initiate connection termination for an incoming 'FIN' */
> @@ -1578,7 +1564,6 @@ restart:
> /* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
> buf = __skb_dequeue(&sk->sk_receive_queue);
> if (buf) {
> - atomic_dec(&tipc_queue_size);
> if (TIPC_SKB_CB(buf)->handle != 0) {
> kfree_skb(buf);
> goto restart;
> @@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
> /* no need to set "res", since already 0 at this point */
> break;
> case TIPC_NODE_RECVQ_DEPTH:
> - value = (u32)atomic_read(&tipc_queue_size);
> + value = 0; /* was tipc_queue_size, now obsolete */
> break;
> case TIPC_SOCK_RECVQ_DEPTH:
> value = skb_queue_len(&sk->sk_receive_queue);
> --
> 1.7.12.1
>
>
Thank you, looks good
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* Linux IP forwarding performance benchmarks
From: Oleg Arkhangelsky @ 2012-12-08 12:01 UTC (permalink / raw)
To: netdev
Hello,
Does anyone have some Linux IP forwarding performance
benchmarks on Nehalem Xeon platform versus Sandy
Bridge Xeon E5? Intel DDIO looks pretty tasty and should
give significant performance boost (at least in theory)
but currently we doesn't have this hardware at hand to
test so asking here.
Thank you!
--
wbr, Oleg.
"Anarchy is about taking complete responsibility for yourself."
Alan Moore.
^ permalink raw reply
* [Patch net-next] rtnetlink: add missing message types to selinux perm table
From: Cong Wang @ 2012-12-08 4:59 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Cong Wang
From: Cong Wang <amwang@redhat.com>
Rebased on the latest net-next tree.
RTM_NEWNETCONF and RTM_GETNETCONF are missing in this table.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 163aaa7..370a646 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -67,6 +67,8 @@ static struct nlmsg_perm nlmsg_route_perms[] =
{ RTM_GETADDRLABEL, NETLINK_ROUTE_SOCKET__NLMSG_READ },
{ RTM_GETDCB, NETLINK_ROUTE_SOCKET__NLMSG_READ },
{ RTM_SETDCB, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+ { RTM_NEWNETCONF, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+ { RTM_GETNETCONF, NETLINK_ROUTE_SOCKET__NLMSG_READ },
{ RTM_GETMDB, NETLINK_ROUTE_SOCKET__NLMSG_READ },
};
^ permalink raw reply related
* Re: [PATCH net] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-08 3:36 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <CADVnQykWjmpiPrSpGmOPc_WuamRNK_eO_ZC_6Q_TktRN2ALWew@mail.gmail.com>
On Fri, Dec 7, 2012 at 9:30 PM, Neal Cardwell <ncardwell@google.com> wrote:
> It also seems like it considers IPv4 and IPv6 with the same prefix as
> matching, which seems bogus; eg IMHO 128.0.0.0 should not match 1::/1.
Oops, I think my example should be "128.0.0.0 should not match 8000::/1".
> In general it seems to me that a mismatch between entry->family and
> cond->family should prevent a match, except for the IPv4-mapped-IPv6
> case it already handles.
But I think that general issue still remains.
neal
^ permalink raw reply
* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Pablo Neira Ayuso @ 2012-12-08 3:31 UTC (permalink / raw)
To: Willem de Bruijn
Cc: netfilter-devel, netdev, Eric Dumazet, David Miller, kaber
In-Reply-To: <CA+FuTSdq0Mfpw6QRaa6LMBYAOcMfc8dGcSwWZBa7rvW1e89qQA@mail.gmail.com>
On Fri, Dec 07, 2012 at 11:56:05AM -0500, Willem de Bruijn wrote:
> On Fri, Dec 7, 2012 at 8:16 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Wed, Dec 05, 2012 at 03:10:13PM -0500, Willem de Bruijn wrote:
> >> On Wed, Dec 5, 2012 at 2:48 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >> > Hi Willem,
> >> >
> >> > On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
> >> >> A new match that executes sk_run_filter on every packet. BPF filters
> >> >> can access skbuff fields that are out of scope for existing iptables
> >> >> rules, allow more expressive logic, and on platforms with JIT support
> >> >> can even be faster.
> >> >>
> >> >> I have a corresponding iptables patch that takes `tcpdump -ddd`
> >> >> output, as used in the examples below. The two parts communicate
> >> >> using a variable length structure. This is similar to ebt_among,
> >> >> but new for iptables.
> >> >>
> >> >> Verified functionality by inserting an ip source filter on chain
> >> >> INPUT and an ip dest filter on chain OUTPUT and noting that ping
> >> >> failed while a rule was active:
> >> >>
> >> >> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
> >> >> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP
> >> >
> >> > I like this BPF idea for iptables.
> >> >
> >> > I made a similar extension time ago, but it was taking a file as
> >> > parameter. That file contained in BPF code. I made a simple bison
> >> > parser that takes BPF code and put it into the bpf array of
> >> > instructions. It would be a bit more intuitive to define a filter and
> >> > we can distribute it with iptables.
> >>
> >> That's cleaner, indeed. I actually like how tcpdump operates as a
> >> code generator if you pass -ddd. Unfortunately, it generates code only
> >> for link layer types of its supported devices, such as DLT_EN10MB and
> >> DLT_LINUX_SLL. The network layer interface of basic iptables
> >> (forgetting device dependent mechanisms as used in xt_mac) is DLT_RAW,
> >> but that is rarely supported.
> >
> > Indeed, you'll have to hack on tcpdump to select the offset. In
> > iptables the base is the layer 3 header. With that change you could
> > use tcpdump for generate code automagically from their syntax.
> >
> >> > Let me check on my internal trees, I can put that user-space code
> >> > somewhere in case you're interested.
> >>
> >> Absolutely. I'll be happy to revise to get it in. I'm also considering
> >> sending a patch to tcpdump to make it generate code independent of the
> >> installed hardware when specifying -y.
> >
> > I found a version of the old parser code I made:
> >
> > http://1984.lsi.us.es/git/nfbpf/
> >
> > It interprets a filter expressed in a similar way to tcpdump -dd but
> > it's using the BPF constants. It's quite preliminary and simple if you
> > look at the code.
> >
> > Extending it to interpret some syntax similar to tcpdump -d would even
> > make more readable the BPF filter.
> >
> > Time ago I also thought about taking the kernel code that checks that
> > the filter is correct. Currently you get -EINVAL if you pass a
> > handcrafted filter which is incorrect, so it's hard task to debug what
> > you made wrong.
> >
> > It could be added to the iptables tree. Or if generic enough for BPF
> > and the effort is worth, just provide some small library that iptables
> > can link with and a small compiler/checker to help people develop BPF
> > filters.
>
> Or use pcap_compile? I went with the tcpdump output to avoid
> introducing a direct dependency on pcap to iptables. One possible
> downside I see to pcap_compile vs. developing from scratch is that it
> might lag in supporting the LSF ancillary data fields.
I suggest to put the code of that preliminary nfbpf utility into
iptables to allow to read the BPF filters from a file and put them
into the BPF array of instructions. I can help with that.
> > Back to your xt_bpf thing, we can use the file containing the code
> > instead:
> >
> > iptables -v -A INPUT -m bpf --bytecode-file filter1.bpf -j DROP
> > iptables -v -A OUTPUT -m bpf --bytecode-file filter2.bpf -j DROP
> >
> > We can still allow the inlined filter via --bytecode if you want.
>
> I'll add that. I'd like to keep --bytecode to able to generate the
> code inline using backticks.
As said, I'm fine with that, but I'll be really happy if we can
provide some utility to generate that code using backticks for the
masses (in case they want to pass it inlined in that format).
^ permalink raw reply
* Re: [PATCH net] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-08 2:30 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121207.132019.1647690876686095833.davem@davemloft.net>
On Fri, Dec 7, 2012 at 1:20 PM, David Miller <davem@davemloft.net> wrote:
> From: Neal Cardwell <ncardwell@google.com>
> Date: Thu, 6 Dec 2012 10:42:26 -0500
>
>> Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
>> instantiated for IPv4 traffic and in the SYN-RECV state were actually
>> created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
>> means that for such connections inet6_rsk(req) returns a pointer to a
>> random spot in memory up to roughly 64KB beyond the end of the
>> request_sock.
>>
>> With this bug, for a server using AF_INET6 TCP sockets and serving
>> IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
>> inet_diag_fill_req() causing an oops or the export to user space of 16
>> bytes of kernel memory as a garbage IPv6 address, depending on where
>> the garbage inet6_rsk(req) pointed.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>
> Thanks for this fix, but it opens up more questions.
>
> We don't seem to make any validations upon inet_diag_hostcond's
> prefix_len. That parameter we pass into bitstring_match() can
> be just about anything.
>
> As another example, what if we do an ipv6 128-bit compare on what's
> actually an ipv4 address in the inet request sock?
>
> I think we need to, using cond->family, make some kind of validations
> upon cond->prefix_len.
OK, sounds good. I will add a patch to fix the adjacent prefix_len
issues you mention.
It also seems like it considers IPv4 and IPv6 with the same prefix as
matching, which seems bogus; eg IMHO 128.0.0.0 should not match 1::/1.
In general it seems to me that a mismatch between entry->family and
cond->family should prevent a match, except for the IPv4-mapped-IPv6
case it already handles.
Would you like these patches against net or net-next?
neal
^ permalink raw reply
* [PATCH net-next 2/9] tipc: eliminate aggregate sk_receive_queue limit
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Neil Horman, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Ying Xue <ying.xue@windriver.com>
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
As a result, the atomic variable tipc_queue_size is now unused
and so it can be deleted. There is a getsockopt call that used
to allow reading it; we retain that but just return zero for
maximum compatibility.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
[PG: phase out tipc_queue_size as pointed out by Neil Horman]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 23 ++++-------------------
1 file changed, 4 insertions(+), 19 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..848be69 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
* net/tipc/socket.c: TIPC socket API
*
* Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -73,8 +73,6 @@ static struct proto tipc_proto;
static int sockets_enabled;
-static atomic_t tipc_queue_size = ATOMIC_INIT(0);
-
/*
* Revised TIPC socket locking policy:
*
@@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
static void advance_rx_queue(struct sock *sk)
{
kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
- atomic_dec(&tipc_queue_size);
}
/**
@@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
- atomic_dec(&tipc_queue_size);
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
kfree_skb(buf);
- }
}
/**
@@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
- atomic_dec(&tipc_queue_size);
- }
}
/**
@@ -280,7 +273,6 @@ static int release(struct socket *sock)
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf == NULL)
break;
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0)
kfree_skb(buf);
else {
@@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
/* Reject message if there isn't room to queue it */
- recv_q_len = (u32)atomic_read(&tipc_queue_size);
- if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
- return TIPC_ERR_OVERLOAD;
- }
recv_q_len = skb_queue_len(&sk->sk_receive_queue);
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
@@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Enqueue message (finally!) */
TIPC_SKB_CB(buf)->handle = 0;
- atomic_inc(&tipc_queue_size);
__skb_queue_tail(&sk->sk_receive_queue, buf);
/* Initiate connection termination for an incoming 'FIN' */
@@ -1578,7 +1564,6 @@ restart:
/* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf) {
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0) {
kfree_skb(buf);
goto restart;
@@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
/* no need to set "res", since already 0 at this point */
break;
case TIPC_NODE_RECVQ_DEPTH:
- value = (u32)atomic_read(&tipc_queue_size);
+ value = 0; /* was tipc_queue_size, now obsolete */
break;
case TIPC_SOCK_RECVQ_DEPTH:
value = skb_queue_len(&sk->sk_receive_queue);
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 9/9] tipc: refactor accept() code for improved readability
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
In TIPC's accept() routine, there is a large block of code relating
to initialization of a new socket, all within an if condition checking
if the allocation succeeded.
Here, we simply flip the check of the if, so that the main execution
path stays at the same indentation level, which improves readability.
If the allocation fails, we jump to an already existing exit label.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 89 ++++++++++++++++++++++++++++++-------------------------
1 file changed, 48 insertions(+), 41 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index b5c9795..9b4e483 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1509,8 +1509,13 @@ static int listen(struct socket *sock, int len)
*/
static int accept(struct socket *sock, struct socket *new_sock, int flags)
{
- struct sock *sk = sock->sk;
+ struct sock *new_sk, *sk = sock->sk;
struct sk_buff *buf;
+ struct tipc_sock *new_tsock;
+ struct tipc_port *new_tport;
+ struct tipc_msg *msg;
+ u32 new_ref;
+
int res;
lock_sock(sk);
@@ -1536,49 +1541,51 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
buf = skb_peek(&sk->sk_receive_queue);
res = tipc_create(sock_net(sock->sk), new_sock, 0, 0);
- if (!res) {
- struct sock *new_sk = new_sock->sk;
- struct tipc_sock *new_tsock = tipc_sk(new_sk);
- struct tipc_port *new_tport = new_tsock->p;
- u32 new_ref = new_tport->ref;
- struct tipc_msg *msg = buf_msg(buf);
-
- /* we lock on new_sk; but lockdep sees the lock on sk */
- lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
-
- /*
- * Reject any stray messages received by new socket
- * before the socket lock was taken (very, very unlikely)
- */
- reject_rx_queue(new_sk);
-
- /* Connect new socket to it's peer */
- new_tsock->peer_name.ref = msg_origport(msg);
- new_tsock->peer_name.node = msg_orignode(msg);
- tipc_connect(new_ref, &new_tsock->peer_name);
- new_sock->state = SS_CONNECTED;
-
- tipc_set_portimportance(new_ref, msg_importance(msg));
- if (msg_named(msg)) {
- new_tport->conn_type = msg_nametype(msg);
- new_tport->conn_instance = msg_nameinst(msg);
- }
+ if (res)
+ goto exit;
- /*
- * Respond to 'SYN-' by discarding it & returning 'ACK'-.
- * Respond to 'SYN+' by queuing it on new socket.
- */
- if (!msg_data_sz(msg)) {
- struct msghdr m = {NULL,};
+ new_sk = new_sock->sk;
+ new_tsock = tipc_sk(new_sk);
+ new_tport = new_tsock->p;
+ new_ref = new_tport->ref;
+ msg = buf_msg(buf);
- advance_rx_queue(sk);
- send_packet(NULL, new_sock, &m, 0);
- } else {
- __skb_dequeue(&sk->sk_receive_queue);
- __skb_queue_head(&new_sk->sk_receive_queue, buf);
- }
- release_sock(new_sk);
+ /* we lock on new_sk; but lockdep sees the lock on sk */
+ lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
+
+ /*
+ * Reject any stray messages received by new socket
+ * before the socket lock was taken (very, very unlikely)
+ */
+ reject_rx_queue(new_sk);
+
+ /* Connect new socket to it's peer */
+ new_tsock->peer_name.ref = msg_origport(msg);
+ new_tsock->peer_name.node = msg_orignode(msg);
+ tipc_connect(new_ref, &new_tsock->peer_name);
+ new_sock->state = SS_CONNECTED;
+
+ tipc_set_portimportance(new_ref, msg_importance(msg));
+ if (msg_named(msg)) {
+ new_tport->conn_type = msg_nametype(msg);
+ new_tport->conn_instance = msg_nameinst(msg);
}
+
+ /*
+ * Respond to 'SYN-' by discarding it & returning 'ACK'-.
+ * Respond to 'SYN+' by queuing it on new socket.
+ */
+ if (!msg_data_sz(msg)) {
+ struct msghdr m = {NULL,};
+
+ advance_rx_queue(sk);
+ send_packet(NULL, new_sock, &m, 0);
+ } else {
+ __skb_dequeue(&sk->sk_receive_queue);
+ __skb_queue_head(&new_sk->sk_receive_queue, buf);
+ }
+ release_sock(new_sk);
+
exit:
release_sock(sk);
return res;
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 8/9] tipc: add lock nesting notation to quiet lockdep warning
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Ying Xue <ying.xue@windriver.com>
TIPC accept() call grabs the socket lock on a newly allocated
socket while holding the socket lock on an old socket. But lockdep
worries that this might be a recursive lock attempt:
[ INFO: possible recursive locking detected ]
---------------------------------------------
kworker/u:0/6 is trying to acquire lock:
(sk_lock-AF_TIPC){+.+.+.}, at: [<c8c1226c>] accept+0x15c/0x310 [tipc]
but task is already holding lock:
(sk_lock-AF_TIPC){+.+.+.}, at: [<c8c12138>] accept+0x28/0x310 [tipc]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_TIPC);
lock(sk_lock-AF_TIPC);
*** DEADLOCK ***
May be due to missing lock nesting notation
[...]
Tell lockdep that this locking is safe by using lock_sock_nested().
This is similar to what was done in commit 5131a184a3458d9 for
SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").
Also note that this is isn't something that is seen normally,
as it was uncovered with some experimental work-in-progress
code not yet ready for mainline. So no need for stable
backports or similar of this commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index ef75b62..b5c9795 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1543,7 +1543,8 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
u32 new_ref = new_tport->ref;
struct tipc_msg *msg = buf_msg(buf);
- lock_sock(new_sk);
+ /* we lock on new_sk; but lockdep sees the lock on sk */
+ lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
/*
* Reject any stray messages received by new socket
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 7/9] tipc: eliminate connection setup for implied connect in recv_msg()
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Ying Xue <ying.xue@windriver.com>
As connection setup is now completed asynchronously in BH context,
in the function filter_connect(), the corresponding code in recv_msg()
becomes redundant.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index dbce274..ef75b62 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -946,13 +946,6 @@ restart:
sz = msg_data_sz(msg);
err = msg_errcode(msg);
- /* Complete connection setup for an implied connect */
- if (unlikely(sock->state == SS_CONNECTING)) {
- res = auto_connect(sock, msg);
- if (res)
- goto exit;
- }
-
/* Discard an empty non-errored message & try again */
if ((!sz) && (!err)) {
advance_rx_queue(sk);
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 4/9] tipc: standardize across connect/disconnect function naming
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
Currently we have tipc_disconnect and tipc_disconnect_port. It is
not clear from the names alone, what they do or how they differ.
It turns out that tipc_disconnect just deals with the port locking
and then calls tipc_disconnect_port which does all the work.
If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
then we will be following typical linux convention, where:
__tipc_disconnect: "raw" function that does all the work.
tipc_disconnect: wrapper that deals with locking and then calls
the real core __tipc_disconnect function
With this, the difference is immediately evident, and locking
violations are more apt to be spotted by chance while working on,
or even just while reading the code.
On the connect side of things, we currently only have the single
"tipc_connect2port" function. It does both the locking at enter/exit,
and the core of the work. Pending changes will make it desireable to
have the connect be a two part locking wrapper + worker function,
just like the disconnect is already.
Here, we make the connect look just like the updated disconnect case,
for the above reason, and for consistency. In the process, we also
get rid of the "2port" suffix that was on the original name, since
it adds no descriptive value.
On close examination, one might notice that the above connect
changes implicitly move the call to tipc_link_get_max_pkt() to be
within the scope of tipc_port_lock() protected region; when it was
not previously. We don't see any issues with this, and it is in
keeping with __tipc_connect doing the work and tipc_connect just
handling the locking.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/port.c | 32 +++++++++++++++++++++++---------
net/tipc/port.h | 6 ++++--
net/tipc/socket.c | 6 +++---
net/tipc/subscr.c | 2 +-
4 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/net/tipc/port.c b/net/tipc/port.c
index 07c42fb..18098ca 100644
--- a/net/tipc/port.c
+++ b/net/tipc/port.c
@@ -726,7 +726,7 @@ static void port_dispatcher_sigh(void *dummy)
if (unlikely(!cb))
goto reject;
if (unlikely(!connected)) {
- if (tipc_connect2port(dref, &orig))
+ if (tipc_connect(dref, &orig))
goto reject;
} else if (peer_invalid)
goto reject;
@@ -1036,15 +1036,30 @@ int tipc_withdraw(u32 ref, unsigned int scope, struct tipc_name_seq const *seq)
return res;
}
-int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
+int tipc_connect(u32 ref, struct tipc_portid const *peer)
{
struct tipc_port *p_ptr;
- struct tipc_msg *msg;
- int res = -EINVAL;
+ int res;
p_ptr = tipc_port_lock(ref);
if (!p_ptr)
return -EINVAL;
+ res = __tipc_connect(ref, p_ptr, peer);
+ tipc_port_unlock(p_ptr);
+ return res;
+}
+
+/*
+ * __tipc_connect - connect to a remote peer
+ *
+ * Port must be locked.
+ */
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+ struct tipc_portid const *peer)
+{
+ struct tipc_msg *msg;
+ int res = -EINVAL;
+
if (p_ptr->published || p_ptr->connected)
goto exit;
if (!peer->ref)
@@ -1067,17 +1082,16 @@ int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
(net_ev_handler)port_handle_node_down);
res = 0;
exit:
- tipc_port_unlock(p_ptr);
p_ptr->max_pkt = tipc_link_get_max_pkt(peer->node, ref);
return res;
}
-/**
- * tipc_disconnect_port - disconnect port from peer
+/*
+ * __tipc_disconnect - disconnect port from peer
*
* Port must be locked.
*/
-int tipc_disconnect_port(struct tipc_port *tp_ptr)
+int __tipc_disconnect(struct tipc_port *tp_ptr)
{
int res;
@@ -1104,7 +1118,7 @@ int tipc_disconnect(u32 ref)
p_ptr = tipc_port_lock(ref);
if (!p_ptr)
return -EINVAL;
- res = tipc_disconnect_port(p_ptr);
+ res = __tipc_disconnect(p_ptr);
tipc_port_unlock(p_ptr);
return res;
}
diff --git a/net/tipc/port.h b/net/tipc/port.h
index 4660e30..fb66e2e 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -190,7 +190,7 @@ int tipc_publish(u32 portref, unsigned int scope,
int tipc_withdraw(u32 portref, unsigned int scope,
struct tipc_name_seq const *name_seq);
-int tipc_connect2port(u32 portref, struct tipc_portid const *port);
+int tipc_connect(u32 portref, struct tipc_portid const *port);
int tipc_disconnect(u32 portref);
@@ -200,7 +200,9 @@ int tipc_shutdown(u32 ref);
/*
* The following routines require that the port be locked on entry
*/
-int tipc_disconnect_port(struct tipc_port *tp_ptr);
+int __tipc_disconnect(struct tipc_port *tp_ptr);
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+ struct tipc_portid const *peer);
int tipc_port_peer_msg(struct tipc_port *p_ptr, struct tipc_msg *msg);
/*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index f553fde..b630f38 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -783,7 +783,7 @@ static int auto_connect(struct socket *sock, struct tipc_msg *msg)
tsock->peer_name.ref = msg_origport(msg);
tsock->peer_name.node = msg_orignode(msg);
- tipc_connect2port(tsock->p->ref, &tsock->peer_name);
+ tipc_connect(tsock->p->ref, &tsock->peer_name);
tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
sock->state = SS_CONNECTED;
return 0;
@@ -1246,7 +1246,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Initiate connection termination for an incoming 'FIN' */
if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
sock->state = SS_DISCONNECTING;
- tipc_disconnect_port(tipc_sk_port(sk));
+ __tipc_disconnect(tipc_sk_port(sk));
}
sk->sk_data_ready(sk, 0);
@@ -1506,7 +1506,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
/* Connect new socket to it's peer */
new_tsock->peer_name.ref = msg_origport(msg);
new_tsock->peer_name.node = msg_orignode(msg);
- tipc_connect2port(new_ref, &new_tsock->peer_name);
+ tipc_connect(new_ref, &new_tsock->peer_name);
new_sock->state = SS_CONNECTED;
tipc_set_portimportance(new_ref, msg_importance(msg));
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 0f7d0d0..6b42d47 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -462,7 +462,7 @@ static void subscr_named_msg_event(void *usr_handle,
kfree(subscriber);
return;
}
- tipc_connect2port(subscriber->port_ref, orig);
+ tipc_connect(subscriber->port_ref, orig);
/* Lock server port (& save lock address for future use) */
subscriber->lock = tipc_port_lock(subscriber->port_ref)->lock;
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 6/9] tipc: introduce non-blocking socket connect
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Ying Xue <ying.xue@windriver.com>
TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.
With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.
The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.
It is also now possible to call select() or poll() to wait for the
completion of a connection.
An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 158 ++++++++++++++++++++++++++++++++----------------------
1 file changed, 93 insertions(+), 65 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index d16a6de..dbce274 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -775,16 +775,19 @@ exit:
static int auto_connect(struct socket *sock, struct tipc_msg *msg)
{
struct tipc_sock *tsock = tipc_sk(sock->sk);
-
- if (msg_errcode(msg)) {
- sock->state = SS_DISCONNECTING;
- return -ECONNREFUSED;
- }
+ struct tipc_port *p_ptr;
tsock->peer_name.ref = msg_origport(msg);
tsock->peer_name.node = msg_orignode(msg);
- tipc_connect(tsock->p->ref, &tsock->peer_name);
- tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
+ p_ptr = tipc_port_deref(tsock->p->ref);
+ if (!p_ptr)
+ return -EINVAL;
+
+ __tipc_connect(tsock->p->ref, p_ptr, &tsock->peer_name);
+
+ if (msg_importance(msg) > TIPC_CRITICAL_IMPORTANCE)
+ return -EINVAL;
+ msg_set_importance(&p_ptr->phdr, (u32)msg_importance(msg));
sock->state = SS_CONNECTED;
return 0;
}
@@ -1198,7 +1201,9 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
{
struct socket *sock = tsock->sk.sk_socket;
struct tipc_msg *msg = buf_msg(*buf);
+ struct sock *sk = &tsock->sk;
u32 retval = TIPC_ERR_NO_PORT;
+ int res;
if (msg_mcast(msg))
return retval;
@@ -1216,8 +1221,36 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
break;
case SS_CONNECTING:
/* Accept only ACK or NACK message */
- if (msg_connected(msg) || (msg_errcode(msg)))
+ if (unlikely(msg_errcode(msg))) {
+ sock->state = SS_DISCONNECTING;
+ sk->sk_err = -ECONNREFUSED;
+ retval = TIPC_OK;
+ break;
+ }
+
+ if (unlikely(!msg_connected(msg)))
+ break;
+
+ res = auto_connect(sock, msg);
+ if (res) {
+ sock->state = SS_DISCONNECTING;
+ sk->sk_err = res;
retval = TIPC_OK;
+ break;
+ }
+
+ /* If an incoming message is an 'ACK-', it should be
+ * discarded here because it doesn't contain useful
+ * data. In addition, we should try to wake up
+ * connect() routine if sleeping.
+ */
+ if (msg_data_sz(msg) == 0) {
+ kfree_skb(*buf);
+ *buf = NULL;
+ if (waitqueue_active(sk_sleep(sk)))
+ wake_up_interruptible(sk_sleep(sk));
+ }
+ retval = TIPC_OK;
break;
case SS_LISTENING:
case SS_UNCONNECTED:
@@ -1361,8 +1394,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
struct sock *sk = sock->sk;
struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
struct msghdr m = {NULL,};
- struct sk_buff *buf;
- struct tipc_msg *msg;
unsigned int timeout;
int res;
@@ -1374,26 +1405,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
goto exit;
}
- /* For now, TIPC does not support the non-blocking form of connect() */
- if (flags & O_NONBLOCK) {
- res = -EOPNOTSUPP;
- goto exit;
- }
-
- /* Issue Posix-compliant error code if socket is in the wrong state */
- if (sock->state == SS_LISTENING) {
- res = -EOPNOTSUPP;
- goto exit;
- }
- if (sock->state == SS_CONNECTING) {
- res = -EALREADY;
- goto exit;
- }
- if (sock->state != SS_UNCONNECTED) {
- res = -EISCONN;
- goto exit;
- }
-
/*
* Reject connection attempt using multicast address
*
@@ -1405,49 +1416,66 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
goto exit;
}
- /* Reject any messages already in receive queue (very unlikely) */
- reject_rx_queue(sk);
+ timeout = (flags & O_NONBLOCK) ? 0 : tipc_sk(sk)->conn_timeout;
+
+ switch (sock->state) {
+ case SS_UNCONNECTED:
+ /* Send a 'SYN-' to destination */
+ m.msg_name = dest;
+ m.msg_namelen = destlen;
+
+ /* If connect is in non-blocking case, set MSG_DONTWAIT to
+ * indicate send_msg() is never blocked.
+ */
+ if (!timeout)
+ m.msg_flags = MSG_DONTWAIT;
+
+ res = send_msg(NULL, sock, &m, 0);
+ if ((res < 0) && (res != -EWOULDBLOCK))
+ goto exit;
- /* Send a 'SYN-' to destination */
- m.msg_name = dest;
- m.msg_namelen = destlen;
- res = send_msg(NULL, sock, &m, 0);
- if (res < 0)
+ /* Just entered SS_CONNECTING state; the only
+ * difference is that return value in non-blocking
+ * case is EINPROGRESS, rather than EALREADY.
+ */
+ res = -EINPROGRESS;
+ break;
+ case SS_CONNECTING:
+ res = -EALREADY;
+ break;
+ case SS_CONNECTED:
+ res = -EISCONN;
+ break;
+ default:
+ res = -EINVAL;
goto exit;
+ }
- /* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
- timeout = tipc_sk(sk)->conn_timeout;
- release_sock(sk);
- res = wait_event_interruptible_timeout(*sk_sleep(sk),
- (!skb_queue_empty(&sk->sk_receive_queue) ||
- (sock->state != SS_CONNECTING)),
- timeout ? (long)msecs_to_jiffies(timeout)
- : MAX_SCHEDULE_TIMEOUT);
- lock_sock(sk);
+ if (sock->state == SS_CONNECTING) {
+ if (!timeout)
+ goto exit;
- if (res > 0) {
- buf = skb_peek(&sk->sk_receive_queue);
- if (buf != NULL) {
- msg = buf_msg(buf);
- res = auto_connect(sock, msg);
- if (!res) {
- if (!msg_data_sz(msg))
- advance_rx_queue(sk);
- }
- } else {
- if (sock->state == SS_CONNECTED)
- res = -EISCONN;
+ /* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+ release_sock(sk);
+ res = wait_event_interruptible_timeout(*sk_sleep(sk),
+ sock->state != SS_CONNECTING,
+ timeout ? (long)msecs_to_jiffies(timeout)
+ : MAX_SCHEDULE_TIMEOUT);
+ lock_sock(sk);
+ if (res <= 0) {
+ if (res == 0)
+ res = -ETIMEDOUT;
else
- res = -ECONNREFUSED;
+ ; /* leave "res" unchanged */
+ goto exit;
}
- } else {
- if (res == 0)
- res = -ETIMEDOUT;
- else
- ; /* leave "res" unchanged */
- sock->state = SS_DISCONNECTING;
}
+ if (unlikely(sock->state == SS_DISCONNECTING))
+ res = sock_error(sk);
+ else
+ res = 0;
+
exit:
release_sock(sk);
return res;
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 5/9] tipc: consolidate connection-oriented message reception in one function
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Ying Xue <ying.xue@windriver.com>
Handling of connection-related message reception is currently scattered
around at different places in the code. This makes it harder to verify
that things are handled correctly in all possible scenarios.
So we consolidate the existing processing of connection-oriented
message reception in a single routine. In the process, we convert the
chain of if/else into a switch/case for improved readability.
A cast on the socket_state in the switch is needed to avoid compile
warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
‘4294967295’ not in enumerated type". This happens because existing
tipc code pseudo extends the default linux socket state values with:
#define SS_LISTENING -1 /* socket is listening */
#define SS_READY -2 /* socket is connectionless */
It may make sense to add these as _positive_ values to the existing
socket state enum list someday, vs. these already existing defines.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: add cast to fix warning; remove returns from middle of switch]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 75 +++++++++++++++++++++++++++++++++++++------------------
1 file changed, 51 insertions(+), 24 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index b630f38..d16a6de 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1187,6 +1187,53 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
}
/**
+ * filter_connect - Handle all incoming messages for a connection-based socket
+ * @tsock: TIPC socket
+ * @msg: message
+ *
+ * Returns TIPC error status code and socket error status code
+ * once it encounters some errors
+ */
+static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
+{
+ struct socket *sock = tsock->sk.sk_socket;
+ struct tipc_msg *msg = buf_msg(*buf);
+ u32 retval = TIPC_ERR_NO_PORT;
+
+ if (msg_mcast(msg))
+ return retval;
+
+ switch ((int)sock->state) {
+ case SS_CONNECTED:
+ /* Accept only connection-based messages sent by peer */
+ if (msg_connected(msg) && tipc_port_peer_msg(tsock->p, msg)) {
+ if (unlikely(msg_errcode(msg))) {
+ sock->state = SS_DISCONNECTING;
+ __tipc_disconnect(tsock->p);
+ }
+ retval = TIPC_OK;
+ }
+ break;
+ case SS_CONNECTING:
+ /* Accept only ACK or NACK message */
+ if (msg_connected(msg) || (msg_errcode(msg)))
+ retval = TIPC_OK;
+ break;
+ case SS_LISTENING:
+ case SS_UNCONNECTED:
+ /* Accept only SYN message */
+ if (!msg_connected(msg) && !(msg_errcode(msg)))
+ retval = TIPC_OK;
+ break;
+ case SS_DISCONNECTING:
+ break;
+ default:
+ pr_err("Unknown socket state %u\n", sock->state);
+ }
+ return retval;
+}
+
+/**
* filter_rcv - validate incoming message
* @sk: socket
* @buf: message
@@ -1203,6 +1250,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
struct socket *sock = sk->sk_socket;
struct tipc_msg *msg = buf_msg(buf);
u32 recv_q_len;
+ u32 res = TIPC_OK;
/* Reject message if it is wrong sort of message for socket */
if (msg_type(msg) > TIPC_DIRECT_MSG)
@@ -1212,24 +1260,9 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
if (msg_connected(msg))
return TIPC_ERR_NO_PORT;
} else {
- if (msg_mcast(msg))
- return TIPC_ERR_NO_PORT;
- if (sock->state == SS_CONNECTED) {
- if (!msg_connected(msg) ||
- !tipc_port_peer_msg(tipc_sk_port(sk), msg))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_CONNECTING) {
- if (!msg_connected(msg) && (msg_errcode(msg) == 0))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_LISTENING) {
- if (msg_connected(msg) || msg_errcode(msg))
- return TIPC_ERR_NO_PORT;
- } else if (sock->state == SS_DISCONNECTING) {
- return TIPC_ERR_NO_PORT;
- } else /* (sock->state == SS_UNCONNECTED) */ {
- if (msg_connected(msg) || msg_errcode(msg))
- return TIPC_ERR_NO_PORT;
- }
+ res = filter_connect(tipc_sk(sk), &buf);
+ if (res != TIPC_OK || buf == NULL)
+ return res;
}
/* Reject message if there isn't room to queue it */
@@ -1243,12 +1276,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
TIPC_SKB_CB(buf)->handle = 0;
__skb_queue_tail(&sk->sk_receive_queue, buf);
- /* Initiate connection termination for an incoming 'FIN' */
- if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
- sock->state = SS_DISCONNECTING;
- __tipc_disconnect(tipc_sk_port(sk));
- }
-
sk->sk_data_ready(sk, 0);
return TIPC_OK;
}
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 3/9] tipc: change sk_receive_queue upper limit
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Jon Maloy <jon.maloy@ericsson.com>
The sk_recv_queue upper limit for connectionless sockets has empirically
turned out to be too low. When we double the current limit we get much
fewer rejected messages and no noticable negative side-effects.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/socket.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 848be69..f553fde 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
/*
* net/tipc/socket.c: TIPC socket API
*
- * Copyright (c) 2001-2007, Ericsson AB
+ * Copyright (c) 2001-2007, 2012 Ericsson AB
* Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
@@ -43,7 +43,7 @@
#define SS_LISTENING -1 /* socket is listening */
#define SS_READY -2 /* socket is connectionless */
-#define OVERLOAD_LIMIT_BASE 5000
+#define OVERLOAD_LIMIT_BASE 10000
#define CONN_TIMEOUT_DEFAULT 8000 /* default connect timeout = 8s */
struct tipc_sock {
--
1.7.12.1
^ permalink raw reply related
* [PATCH net-next 1/9] tipc: remove obsolete flush of stale reassembly buffer
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Erik Hugne, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>
From: Erik Hugne <erik.hugne@ericsson.com>
Each link instance has a periodic job checking if there is a stale
ongoing message reassembly associated to the link. If no new
fragment has been received during the last 4*[link_tolerance] period,
it is assumed the missing fragment will never arrive. As a consequence,
the reassembly buffer is discarded, and a gap in the message sequence
occurs.
This assumption is wrong. After we abandoned our ambition to develop
packet routing for multi-cluster networks, only single-hop packet
transfer remains as an option. For those, all packets are guaranteed
to be delivered in sequence to the defragmentation layer. Any failure
to achieve sequenced delivery will eventually lead to link reset, and
the reassembly buffer will be flushed anyway.
So we just remove this periodic check, which is now obsolete.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: also delete get/inc_timer count, since they are now unused]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
net/tipc/link.c | 44 --------------------------------------------
1 file changed, 44 deletions(-)
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 87bf5aa..daa6080 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -97,7 +97,6 @@ static int link_send_sections_long(struct tipc_port *sender,
struct iovec const *msg_sect,
u32 num_sect, unsigned int total_len,
u32 destnode);
-static void link_check_defragm_bufs(struct tipc_link *l_ptr);
static void link_state_event(struct tipc_link *l_ptr, u32 event);
static void link_reset_statistics(struct tipc_link *l_ptr);
static void link_print(struct tipc_link *l_ptr, const char *str);
@@ -271,7 +270,6 @@ static void link_timeout(struct tipc_link *l_ptr)
}
/* do all other link processing performed on a periodic basis */
- link_check_defragm_bufs(l_ptr);
link_state_event(l_ptr, TIMEOUT_EVT);
@@ -2497,16 +2495,6 @@ static void set_expected_frags(struct sk_buff *buf, u32 exp)
msg_set_bcast_ack(buf_msg(buf), exp);
}
-static u32 get_timer_cnt(struct sk_buff *buf)
-{
- return msg_reroute_cnt(buf_msg(buf));
-}
-
-static void incr_timer_cnt(struct sk_buff *buf)
-{
- msg_incr_reroute_cnt(buf_msg(buf));
-}
-
/*
* tipc_link_recv_fragment(): Called with node lock on. Returns
* the reassembled buffer if message is complete.
@@ -2585,38 +2573,6 @@ int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb,
return 0;
}
-/**
- * link_check_defragm_bufs - flush stale incoming message fragments
- * @l_ptr: pointer to link
- */
-static void link_check_defragm_bufs(struct tipc_link *l_ptr)
-{
- struct sk_buff *prev = NULL;
- struct sk_buff *next = NULL;
- struct sk_buff *buf = l_ptr->defragm_buf;
-
- if (!buf)
- return;
- if (!link_working_working(l_ptr))
- return;
- while (buf) {
- u32 cnt = get_timer_cnt(buf);
-
- next = buf->next;
- if (cnt < 4) {
- incr_timer_cnt(buf);
- prev = buf;
- } else {
- if (prev)
- prev->next = buf->next;
- else
- l_ptr->defragm_buf = buf->next;
- kfree_skb(buf);
- }
- buf = next;
- }
-}
-
static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tolerance)
{
if ((tolerance < TIPC_MIN_LINK_TOL) || (tolerance > TIPC_MAX_LINK_TOL))
--
1.7.12.1
^ permalink raw reply related
* [PATCH v2 net-next 0/9] tipc: more updates for the v3.8 content
From: Paul Gortmaker @ 2012-12-08 1:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
Changes since v1:
-get rid of essentially unused variable spotted by
Neil Horman (patch #2)
-drop patch #3; defer it for 3.9 content, so Neil,
Jon and Ying can discuss its specifics at their
leisure while net-next is closed. (It had no
direct dependencies to the rest of the series, and
was just an optimization)
-fix indentation of accept() code directly in place
vs. forking it out to a separate function (was patch
#10, now patch #9).
Rebuilt and re-ran tests just to ensure nothing odd happened.
Original v1 text follows, updated pull information follows that.
---------
Here is another batch of TIPC changes. The most interesting
thing is probably the non-blocking socket connect - I'm told
there were several users looking forward to seeing this.
Also there were some resource limitation changes that had
the right intent back in 2005, but were now apparently causing
needless limitations to people's real use cases; those have
been relaxed/removed.
There is a lockdep splat fix, but no need for a stable backport,
since it is virtually impossible to trigger in mainline; you
have to essentially modify code to force the probabilities
in your favour to see it.
The rest can largely be categorized as general cleanup of things
seen in the process of getting the above changes done.
Tested between 64 and 32 bit nodes with the test suite. I've
also compile tested all the individual commits on the chain.
I'd originally figured on this queue not being ready for 3.8, but
the extended stabilization window of 3.7 has changed that. On
the other hand, this can still be 3.9 material, if that simply
works better for folks - no problem for me to defer it to 2013.
If anyone spots any problems then I'll definitely defer it,
rather than rush a last minute respin.
Thanks,
Paul.
---
The following changes since commit b93196dc5af7729ff7cc50d3d322ab1a364aa14f:
net: fix some compiler warning in net/core/neighbour.c (2012-12-05 21:50:37 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next_v2
for you to fetch changes up to 0fef8f205f6f4cdff1869e54e44f317a79902785:
tipc: refactor accept() code for improved readability (2012-12-07 17:23:24 -0500)
----------------------------------------------------------------
Erik Hugne (1):
tipc: remove obsolete flush of stale reassembly buffer
Jon Maloy (1):
tipc: change sk_receive_queue upper limit
Paul Gortmaker (2):
tipc: standardize across connect/disconnect function naming
tipc: refactor accept() code for improved readability
Ying Xue (5):
tipc: eliminate aggregate sk_receive_queue limit
tipc: consolidate connection-oriented message reception in one function
tipc: introduce non-blocking socket connect
tipc: eliminate connection setup for implied connect in recv_msg()
tipc: add lock nesting notation to quiet lockdep warning
net/tipc/link.c | 44 -------
net/tipc/port.c | 32 +++--
net/tipc/port.h | 6 +-
net/tipc/socket.c | 353 ++++++++++++++++++++++++++++++------------------------
net/tipc/subscr.c | 2 +-
5 files changed, 225 insertions(+), 212 deletions(-)
^ permalink raw reply
* loan offer
From: SEC Capitals Loan @ 2012-12-08 0:49 UTC (permalink / raw)
Loan Offer at 3%, Feel Free to REPLY back to us for more info
^ permalink raw reply
* [PATCH v4 5/5] vxlan: Add capability of Rx checksum offload for inner packet
From: Joseph Gasparakis @ 2012-12-08 0:14 UTC (permalink / raw)
To: davem, shemminger, chrisw, gospo
Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
---
drivers/net/vxlan.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 88b31f2..3b3fdf6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -607,7 +607,17 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
__skb_tunnel_rx(skb, vxlan->dev);
skb_reset_network_header(skb);
- skb->ip_summed = CHECKSUM_NONE;
+
+ /* If the NIC driver gave us an encapsulated packet with
+ * CHECKSUM_UNNECESSARY and Rx checksum feature is enabled,
+ * leave the CHECKSUM_UNNECESSARY, the device checksummed it
+ * for us. Otherwise force the upper layers to verify it.
+ */
+ if (skb->ip_summed != CHECKSUM_UNNECESSARY || !skb->encapsulation ||
+ !(vxlan->dev->features & NETIF_F_RXCSUM))
+ skb->ip_summed = CHECKSUM_NONE;
+
+ skb->encapsulation = 0;
err = IP_ECN_decapsulate(oip, skb);
if (unlikely(err)) {
@@ -1175,7 +1185,9 @@ static void vxlan_setup(struct net_device *dev)
dev->features |= NETIF_F_LLTX;
dev->features |= NETIF_F_NETNS_LOCAL;
dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM;
- dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
+ dev->features |= NETIF_F_RXCSUM;
+
+ dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
spin_lock_init(&vxlan->hash_lock);
--
1.7.11.7
^ permalink raw reply related
* [PATCH v4 4/5] ixgbe: Adding tx encapsulation capability
From: Joseph Gasparakis @ 2012-12-08 0:14 UTC (permalink / raw)
To: davem, shemminger, chrisw, gospo
Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
bhutchings, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>
This patch allows ixgbe to recognize encapsulated packets and do the tx
checksum offload in hardware. This patch is only for demonstration
purposes and should not be applied.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 46 +++++++++++++++++++++------
1 file changed, 37 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fb165b6..62a7d6e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5972,17 +5972,42 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
if (!(first->tx_flags & IXGBE_TX_FLAGS_TXSW))
return;
}
+ vlan_macip_lens |= skb_network_offset(skb)
+ << IXGBE_ADVTXD_MACLEN_SHIFT;
} else {
u8 l4_hdr = 0;
- switch (first->protocol) {
- case __constant_htons(ETH_P_IP):
- vlan_macip_lens |= skb_network_header_len(skb);
+ union {
+ struct iphdr *ipv4;
+ struct ipv6hdr *ipv6;
+ u8 *raw;
+ } network_hdr;
+ union {
+ struct tcphdr *tcphdr;
+ u8 *raw;
+ } transport_hdr;
+
+ if (skb->encapsulation) {
+ network_hdr.raw = skb_inner_network_header(skb);
+ transport_hdr.raw = skb_inner_transport_header(skb);
+ vlan_macip_lens |= skb_inner_network_offset(skb) <<
+ IXGBE_ADVTXD_MACLEN_SHIFT;
+ } else {
+ network_hdr.raw = skb_network_header(skb);
+ transport_hdr.raw = skb_transport_header(skb);
+ vlan_macip_lens |= skb_network_offset(skb) <<
+ IXGBE_ADVTXD_MACLEN_SHIFT;
+ }
+
+ /* use first 4 bits to determine IP version */
+ switch (network_hdr.ipv4->version) {
+ case 4:
+ vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
- l4_hdr = ip_hdr(skb)->protocol;
+ l4_hdr = network_hdr.ipv4->protocol;
break;
- case __constant_htons(ETH_P_IPV6):
- vlan_macip_lens |= skb_network_header_len(skb);
- l4_hdr = ipv6_hdr(skb)->nexthdr;
+ case 6:
+ vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
+ l4_hdr = network_hdr.ipv6->nexthdr;
break;
default:
if (unlikely(net_ratelimit())) {
@@ -5996,7 +6021,7 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
switch (l4_hdr) {
case IPPROTO_TCP:
type_tucmd |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
- mss_l4len_idx = tcp_hdrlen(skb) <<
+ mss_l4len_idx = (transport_hdr.tcphdr->doff * 4) <<
IXGBE_ADVTXD_L4LEN_SHIFT;
break;
case IPPROTO_SCTP:
@@ -6022,7 +6047,6 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
}
/* vlan_macip_lens: MACLEN, VLAN tag */
- vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0,
@@ -7383,6 +7407,10 @@ static int ixgbe_probe(struct pci_dev *pdev,
netdev->hw_features = netdev->features;
+ netdev->hw_enc_features = NETIF_F_IP_CSUM |
+ NETIF_F_IPV6_CSUM |
+ NETIF_F_SG;
+
switch (adapter->hw.mac.type) {
case ixgbe_mac_82599EB:
case ixgbe_mac_X540:
--
1.7.11.7
^ permalink raw reply related
* [PATCH v4 3/5] vxlan: capture inner headers during encapsulation
From: Joseph Gasparakis @ 2012-12-08 0:14 UTC (permalink / raw)
To: davem, shemminger, chrisw, gospo
Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
bhutchings, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>
Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
The advantage to these two changes is that it also allows the VXLAN to
make use of GSO.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/vxlan.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ce77b8b..88b31f2 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -876,6 +876,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
goto drop;
}
+ if (!skb->encapsulation) {
+ skb_reset_inner_headers(skb);
+ skb->encapsulation = 1;
+ }
+
/* Need space for new headers (invalidates iph ptr) */
if (skb_cow_head(skb, VXLAN_HEADROOM))
goto drop;
@@ -947,7 +952,8 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
vxlan_set_owner(dev, skb);
/* See iptunnel_xmit() */
- skb->ip_summed = CHECKSUM_NONE;
+ if (skb->ip_summed != CHECKSUM_PARTIAL)
+ skb->ip_summed = CHECKSUM_NONE;
ip_select_ident(iph, &rt->dst, NULL);
err = ip_local_out(skb);
@@ -1168,6 +1174,8 @@ static void vxlan_setup(struct net_device *dev)
dev->tx_queue_len = 0;
dev->features |= NETIF_F_LLTX;
dev->features |= NETIF_F_NETNS_LOCAL;
+ dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM;
+ dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
spin_lock_init(&vxlan->hash_lock);
--
1.7.11.7
^ permalink raw reply related
* [PATCH v4 2/5] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Joseph Gasparakis @ 2012-12-08 0:14 UTC (permalink / raw)
To: davem, shemminger, chrisw, gospo
Cc: Alexander Duyck, netdev, linux-kernel, dmitry, saeed.bishara,
bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>
From: Alexander Duyck <alexander.h.duyck@intel.com>
This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
net/core/dev.c | 15 +++++++++++++--
net/ipv4/ip_output.c | 4 ++++
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 307142a..a4c4a1b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2324,6 +2324,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
skb->vlan_tci = 0;
}
+ /* If encapsulation offload request, verify we are testing
+ * hardware encapsulation features instead of standard
+ * features for the netdev
+ */
+ if (skb->encapsulation)
+ features &= dev->hw_enc_features;
+
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features)))
goto out_kfree_skb;
@@ -2339,8 +2346,12 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
* checksumming here.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL) {
- skb_set_transport_header(skb,
- skb_checksum_start_offset(skb));
+ if (skb->encapsulation)
+ skb_set_inner_transport_header(skb,
+ skb_checksum_start_offset(skb));
+ else
+ skb_set_transport_header(skb,
+ skb_checksum_start_offset(skb));
if (!(features & NETIF_F_ALL_CSUM) &&
skb_checksum_help(skb))
goto out_kfree_skb;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6537a40..3e98ed2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -595,6 +595,10 @@ slow_path_clean:
}
slow_path:
+ /* for offloaded checksums cleanup checksum before fragmentation */
+ if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+ goto fail;
+
left = skb->len - hlen; /* Space per frame */
ptr = hlen; /* Where to start from */
--
1.7.11.7
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox