Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Eric Leblond @ 2012-12-07 21:46 UTC (permalink / raw)
  To: Johannes Berg; +Cc: David Miller, netdev, linux-wireless, linville
In-Reply-To: <1354915824.9124.11.camel@jlt4.sipsolutions.net>

Hi,

On Fri, 2012-12-07 at 22:30 +0100, Johannes Berg wrote:
> 
> > Wireless folks, please take a look.  The issue is that,
> > under the circumstances listed below, we get SKBs in
> > the AF_PACKET input path that are shared.
> 
> Ok so I took a look, but I can't see where the wireless stack is going
> wrong.
> 
> > Given the logic present in ieee80211_deliver_skb() I think
> > the mac80211 code doesn't expect this either.
> 
> This is correct, but the driver should never give us a shared skb. From
> the other mail it seems Eric is using iwlwifi, which is definitely not
> creating shared SKBs. Nothing in mac80211 creates them either.
> 
> > > I've observed this crash under the following condition:
> > >  1. a program is listening to an wifi interface (let say wlan0)
> > >  2. it is using fanout capture in flow load balancing mode
> > >  3. defrag option is on on the fanout socket
> 
> How do you set this up, and what does it do? I'd like to try to
> reproduce this.



> > >  4. the interface disconnect (radio down for example)
> > >  5. the interface reconnect (radio switched up)
> > >  6. once reconnected a single packet is seen with skb->users=2
> 
> That's interesting. A single one seems odd. I might have expected two,
> but not one. Well, since you removed the crash ... I guess I'll have to
> believe that there's just one and the second one doesn't show up because
> we crashed before :-)

It was the case with initial code but I've suppressed the BUG() call and
replaced it with a return ;)

> > So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> > packet headers, wherein it memoves() the data into a better aligned location.
> > 
> > But if these SKBs really are skb_shared(), this packet data
> > modification is illegal.
> > 
> > I suspect that the assumptions built into this unaligned data handling
> > code, and AF_PACKET, are correct.  Meaning that we should never see
> > skb_shared() packets here.  We just have a missing skb_copy()
> > somewhere in mac80211, Johannes can you please take a look?
> 
> My first theory was related to multiple virtual interfaces, but Eric
> didn't say he was running that, but we use skb_copy() for that in
> ieee80211_prepare_and_rx_handle(). That's not necessarily the most
> efficient (another reason for drivers to use paged RX here) but clearly
> not causing the issue.
> 
> The only other theory I can come up with right now is that the skb_get()
> happens in deliver_skb via __netif_receive_skb. Keeping in mind that
> wpa_supplicant might have another packet socket open for authentication
> packets, that seems like a possibility. I'll test it once I figure out
> how to do this "defrag" option you speak of :)

I've no simple code available to test it. I've add the problem when
running suricata. Maybe you could use it. It is packaged in most
distribution now.
To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
have something like:
af-packet:
  - interface: wlan0
    # Number of receive threads (>1 will enable experimental flow pinned
    # runmode)
    threads: 3

Start it with: suricata --af-packet=wlan0
Then get wlan0 interface down and up. After a few seconds, the crash
will occur.
It is a bit complicated for a simple test case. I can cook a little
example code if you want.

BR,
-- 
Eric Leblond <eric@regit.org>
Blog: https://home.regit.org/

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:56 UTC (permalink / raw)
  To: Eric Leblond
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ
In-Reply-To: <1354916768.4530.18.camel@tiger2>

HI,

> > That's interesting. A single one seems odd. I might have expected two,
> > but not one. Well, since you removed the crash ... I guess I'll have to
> > believe that there's just one and the second one doesn't show up because
> > we crashed before :-)
> 
> It was the case with initial code but I've suppressed the BUG() call and
> replaced it with a return ;)

Right. Well, does it actually still work then? I wonder if then you
don't see a second packet because the first one doesn't make it
through ... I'm thinking that this is just an internal af_packet problem
with multiple listeners.


> I've no simple code available to test it. I've add the problem when
> running suricata. Maybe you could use it. It is packaged in most
> distribution now.
> To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
> have something like:
> af-packet:
>   - interface: wlan0
>     # Number of receive threads (>1 will enable experimental flow pinned
>     # runmode)
>     threads: 3

That'll should do, thanks.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:12 UTC (permalink / raw)
  To: David Miller
  Cc: eric-EVVnsjFE0OfYtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ, Eric Dumazet
In-Reply-To: <1354916502.9124.18.camel-8Nb76shvtaUJvtFkdXX2HixXY32XiHfO@public.gmane.org>

On Fri, 2012-12-07 at 22:41 +0100, Johannes Berg wrote:

> wpa_supplicant opens a packet socket for ETH_P_EAPOL, which indirectly
> eventually calls dev_add_pack(). But if you do the same for another
> socket, you'll get the same again, and then deliver_skb() will deliver
> only a refcounted packet to the prot_hook->func().
> 
> This seems like it could very well cause the problem?

Ok I couldn't reproduce it because suricata didn't work this way, but I
did try using tcpdump (without the fanout) and deliver_skb() *is* called
twice for each packet as I thought. It thus seems the problem is
entirely in af_packet itself. It was changed a bit by Eric Dumazet in
bc416d9768 but the original code goes back to the original defrag
support in 7736d33f4, as far as I can tell. aec27311c changed the code
to not do skb_clone(), but it seems the skb_share_check() should be
before the pskb_pull().

Well, it seems ip_check_defrag() should simply not modify the SKB before
it unshares it ... like this maybe:

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 448e685..8d5cc75 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -707,28 +707,27 @@ EXPORT_SYMBOL(ip_defrag);
 
 struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
 {
-	const struct iphdr *iph;
+	struct iphdr iph;
 	u32 len;
 
 	if (skb->protocol != htons(ETH_P_IP))
 		return skb;
 
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+	if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
 		return skb;
 
-	iph = ip_hdr(skb);
-	if (iph->ihl < 5 || iph->version != 4)
+	if (iph.ihl < 5 || iph.version != 4)
 		return skb;
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		return skb;
-	iph = ip_hdr(skb);
-	len = ntohs(iph->tot_len);
-	if (skb->len < len || len < (iph->ihl * 4))
+
+	len = ntohs(iph.tot_len);
+	if (skb->len < len || len < (iph.ihl * 4))
 		return skb;
 
-	if (ip_is_fragment(ip_hdr(skb))) {
+	if (ip_is_fragment(&iph)) {
 		skb = skb_share_check(skb, GFP_ATOMIC);
 		if (skb) {
+			if (!pskb_may_pull(skb, iph.ihl*4))
+				return skb;
 			if (pskb_trim_rcsum(skb, len))
 				return skb;
 			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:23 UTC (permalink / raw)
  To: David Miller; +Cc: eric, netdev, linux-wireless, linville, Eric Dumazet
In-Reply-To: <1354918363.9124.29.camel@jlt4.sipsolutions.net>

Oh I should say ... IP is like black magic to me ;-)

> -	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> +	if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
>  		return skb;
>  
> -	iph = ip_hdr(skb);
> -	if (iph->ihl < 5 || iph->version != 4)
> +	if (iph.ihl < 5 || iph.version != 4)
>  		return skb;
> -	if (!pskb_may_pull(skb, iph->ihl*4))
> -		return skb;
> -	iph = ip_hdr(skb);
> -	len = ntohs(iph->tot_len);
> -	if (skb->len < len || len < (iph->ihl * 4))
> +
> +	len = ntohs(iph.tot_len);
> +	if (skb->len < len || len < (iph.ihl * 4))
>  		return skb;
>  
> -	if (ip_is_fragment(ip_hdr(skb))) {
> +	if (ip_is_fragment(&iph)) {
>  		skb = skb_share_check(skb, GFP_ATOMIC);
>  		if (skb) {
> +			if (!pskb_may_pull(skb, iph.ihl*4))
> +				return skb;

I moved this here but I have no idea what it does. Asking if we can pull
this here seems a bit pointless, and in the previous place it seemed
similarly pointless since we only use the static part of the IP header
and don't look at any options... So to me it seems this pskb_may_pull()
could just be removed, and that most likely means I'm messing with code
I don't understand :-)

johannes

^ permalink raw reply

* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Jon Maloy @ 2012-12-07 22:30 UTC (permalink / raw)
  To: Neil Horman; +Cc: Paul Gortmaker, David Miller, netdev, Ying Xue
In-Reply-To: <20121207192030.GA30339@hmsreliant.think-freely.org>

On 12/07/2012 02:20 PM, Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>> From: Ying Xue <ying.xue@windriver.com>
>>
>> The sk_receive_queue limit control is currently performed for
>> all arriving messages, disregarding socket and message type.
>> But for connected sockets this check is redundant, since the protocol
>> flow control already makes queue overflow impossible.
>>
> Can you explain where that occurs?  

It happens in the functions port_dispatcher_sigh() and tipc_send(), 
among other places. Both are to be found in the file port.c, which 
was supposed to contain the 'generic' (i.e., API independent) part 
of the send/receive code.
Now that we have only one API left, the socket API, we are 
planning to merge the code in socket.c and port.c, and get rid of 
some code overhead.

The flow control in TIPC is message based, where the sender
requires to receive an explicit acknowledge message for each 
512 message the receiver reads to user space.
If the sender has more than 1024 messages outstanding without having
received an acknowledge he will be suspended or receive EAGAIN until 
he does.
The plan going forward is to replace this mechanism with a more 
standard looking byte based flow control, while maintaining 
backwards compatibility.

> I see where the tipc dispatch function calls
> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> the receiving socket is connection oriented or connectionless), but if the
> receiver doesn't call receive very often, This just adds a check against your
> global limit, doing nothing for your per-socket limits. 

OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
our per-socket limit. In fact, TIPC connectionless overflow control currently 
is a kind of a hybrid, based on a message counter when the socket is not locked, 
and based on sk_rcv_queue's byte limit when a message has to be added to the 
backlog.
We are planning to fix this inconsistency too.

 In fact it seems to
> repeat the same check twice, as in the worst case of the incomming message being
> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> OVERLOAD_LIMIT_BASE/2 again.

Yes, you are right. The intention is that only the first test, 
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
will be run for the vast majority of messages, since we must assume
that there is no overload most of the time.
An inelegant optimization, perhaps, but not logically wrong.

///jon

> 
> Neil
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info
From: Stephen Hemminger @ 2012-12-07 23:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
	bhutchings, jwhan, davem, shiyer
In-Reply-To: <1354899897-10423-2-git-send-email-jasowang@redhat.com>

Minor style issue reported by checkpatch which can be fixed after merge.
Although sizeof is actually an operator in C, it is considered correct
style to treat it as a function.


WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#293: FILE: drivers/net/virtio_net.c:395:
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);

WARNING: sizeof hdr->mhdr should be sizeof(hdr->mhdr)
#552: FILE: drivers/net/virtio_net.c:641:
+		sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);

WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#555: FILE: drivers/net/virtio_net.c:643:
+		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: Stephen Hemminger @ 2012-12-07 23:13 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
	bhutchings, jwhan, shiyer
In-Reply-To: <20121207.153556.989159003987459765.davem@davemloft.net>

On Fri, 07 Dec 2012 15:35:56 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Jason Wang <jasowang@redhat.com>
> Date: Sat,  8 Dec 2012 01:04:54 +0800
> 
> > This series is an update version (hope the final version) of multiqueue
> > (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> > addressed, the work were based on Krishna Kumar's work to let virtio-net use
> > multiple rx/tx queues to do the packets reception and transmission. Performance
> > test show the aggregate latency were increased greately but may get some
> > regression in small packet transmission. Due to this, multiqueue were disabled
> > by default. If user want to benefit form the multiqueue, ethtool -L could be
> > used to enable the feature.
> > 
> > Please review and comments.
> > 
> > A protype implementation of qemu-kvm support could by found in
> > git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
> > could specify the queues parameters to both tap and virtio-net like:
> > 
> > ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
> > 
> > then enable the multiqueue through ethtool by:
> > 
> > ethtool -L eth0 combined 2
> 
> It seems like most, if not all, of the feedback given for this series
> has been addressed by Jason.
> 
> Can I get some ACKs?

Other than the minor style nit in the first patch, I see no issues.
This is really needed by Virtual Routers.

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

^ permalink raw reply

* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
	saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-2-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> This patch adds support in the kernel for offloading in the NIC Tx and Rx
> checksumming for encapsulated packets (such as VXLAN and IP GRE).
>
> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  include/linux/ip.h        |    5 ++
>  include/linux/ipv6.h      |    5 ++
>  include/linux/netdevice.h |    2 +
>  include/linux/skbuff.h    |   90 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/tcp.h       |   10 +++++
>  include/linux/udp.h       |    5 ++
>  net/core/skbuff.c         |    9 ++++
>  7 files changed, 125 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Re: [PATCH v3 2/4] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, Alexander Duyck, netdev,
	linux-kernel, dmitry, saeed.bishara
In-Reply-To: <1354845419-22483-3-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This change allows the VXLAN to enable Tx checksum offloading even on
> devices that do not support encapsulated checksum offloads. The
> advantage to this is that it allows for the lower device to change due
> to routing table changes without impacting features on the VXLAN itself.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  net/core/dev.c       |   15 +++++++++++++--
>  net/ipv4/ip_output.c |    4 ++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Re: [PATCH v3 3/4] vxlan: capture inner headers during encapsulation
From: Jeff Kirsher @ 2012-12-07 23:16 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
	saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-4-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
> The advantage to these two changes is that it also allows the VXLAN to
> make use of GSO.
>
> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  drivers/net/vxlan.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Experimental network protocol produces "dropping frag"
From: Templin, Fred L @ 2012-12-07 23:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <1354897227.2707.10.camel@bwh-desktop.uk.solarflarecom.com>

Hello,

Under the 3.2.17 kernel, I am working with an experimental
tunneling protocol. I am using ip-proto-253 for the experiment,
and have coded the tunnel code in as a module under net/ipv4
by cloning net/ipv4/ip_gre.c. I have been testing the module
by using modprobe to reload the module without rebooting the
machines.

I was successfully able to send and receive packets via the
tunnel and verified that both directions were working using
tcpdump and wireshark which showed ip-proto-253 packets going
in both directions. Since then, however, I recompiled and
reinistalled the entire kernel and rebooted the machines.

Now, I can still send my ip-proto-253 packets, but on the
receiving side I see dmesg output that says:

  "dropping frag"

for every packet received. The input packets appear to be
accounted for when I type "ifconfig eth0" (i.e., the input
packet count is incrementing) but printk's from my tunnel
driver code are not showing up when I type dmesg and
"ifconfig tun0" on the tunnel interface does not show any
input packets. It is as if the receiver has "forgotten"
how to deal with ip-proto-253.

Any ideas on how I may have shot myself in the foot, and
ways to recover?

Thanks - Fred
fred.l.templin@boeing.com

^ permalink raw reply

* [PATCH v4 net-next 0/5] tunneling: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings

The series contains updates to add in the NIC Rx and Tx checksumming support
for encapsulated packets.

The sk_buff needs to somehow have information of the inner packet, and adding
three fields for the inner mac, network and transport headers was the prefered
approach. 

Not adding these fields would mean that the drivers would need to parse the
sk_buff data in hot-path, having a negative impact in the performance.

Adding in sk_buff a pointer to the skbuff of the inner packet made sense, but
would be a complicated change as assumptions needed to be made with regards to
helper functions such as skb_clone() skb_copy(). Also code for the existing
encapsulation protocols (such as VXLAN and IP GRE) had to be reworked, so the
decision was to have the simple approach of adding these three fields.

v2 Makes sure that checksumming for IP GRE does not take place if the offload
   flag is set in the skb's netdev features

v3 Fixes issues picked up by the community in v2 and is intended to provide
   ability to demo vxlan Tx offloading with Intel's ixgbe. As part of this, 
   it provides an RFC patch for ixgbe to take advantage of the offloading
   mechanism

   Now it is possible to create a vxlan interface like this:
    #ip link add vxlan0 type vxlan id 40 ttl 10 group 239.1.1.1 dev eth0

   Then turn on/off the encapsulation offload mechanism by doing:
    #ethtool -K eth0 tx-checksum-ip-generic on

   In v3 ipgre work got paused (and therefore patches not included) and I will
   come back to it when vxlan is accepted by the community.

v4 Added more detailed commit logs and code comments as per request in v3
   Also now the Rx offload encapsulation patch is included in the series.

^ permalink raw reply

* [PATCH v4 1/5] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

This patch adds support in the kernel for offloading in the NIC Tx and Rx
checksumming for encapsulated packets (such as VXLAN and IP GRE).

For Tx encapsulation offload, the driver will need to set the right bits
in netdev->hw_enc_features. The protocol driver will have to set the
skb->encapsulation bit and populate the inner headers, so the NIC driver will
use those inner headers to calculate the csum in hardware.

For Rx encapsulation offload, the driver will need to set again the
skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
In that case the protocol driver should push the decapsulated packet up
to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
driver should set the skb->encapsulation flag back to zero. Fianlly the
protocol driver should have NETIF_F_RXCSUM flag set in its features.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/ip.h        |  5 +++
 include/linux/ipv6.h      |  5 +++
 include/linux/netdevice.h |  6 +++
 include/linux/skbuff.h    | 95 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/tcp.h       | 10 +++++
 include/linux/udp.h       |  5 +++
 net/core/skbuff.c         |  9 +++++
 7 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 58b82a2..492bc65 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -25,6 +25,11 @@ static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
 	return (struct iphdr *)skb_network_header(skb);
 }
 
+static inline struct iphdr *inner_ip_hdr(const struct sk_buff *skb)
+{
+	return (struct iphdr *)skb_inner_network_header(skb);
+}
+
 static inline struct iphdr *ipip_hdr(const struct sk_buff *skb)
 {
 	return (struct iphdr *)skb_transport_header(skb);
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 12729e9..faed1e3 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -67,6 +67,11 @@ static inline struct ipv6hdr *ipv6_hdr(const struct sk_buff *skb)
 	return (struct ipv6hdr *)skb_network_header(skb);
 }
 
+static inline struct ipv6hdr *inner_ipv6_hdr(const struct sk_buff *skb)
+{
+	return (struct ipv6hdr *)skb_inner_network_header(skb);
+}
+
 static inline struct ipv6hdr *ipipv6_hdr(const struct sk_buff *skb)
 {
 	return (struct ipv6hdr *)skb_transport_header(skb);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18c5dc9..c6a14d4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1063,6 +1063,12 @@ struct net_device {
 	netdev_features_t	wanted_features;
 	/* mask of features inheritable by VLAN devices */
 	netdev_features_t	vlan_features;
+	/* mask of features inherited by encapsulating devices
+	 * This field indicates what encapsulation offloads
+	 * the hardware is capable of doing, and drivers will
+	 * need to set them appropriately.
+	 */
+	netdev_features_t	hw_enc_features;
 
 	/* Interface index. Unique device identifier	*/
 	int			ifindex;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..320e976 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -376,6 +376,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@mark: Generic packet mark
  *	@dropcount: total number of sk_receive_queue overflows
  *	@vlan_tci: vlan tag control information
+ *	@inner_transport_header: Inner transport layer header (encapsulation)
+ *	@inner_network_header: Network layer header (encapsulation)
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -471,7 +473,13 @@ struct sk_buff {
 	__u8			wifi_acked:1;
 	__u8			no_fcs:1;
 	__u8			head_frag:1;
-	/* 8/10 bit hole (depending on ndisc_nodetype presence) */
+	/* Encapsulation protocol and NIC drivers should use
+	 * this flag to indicate to each other if the skb contains
+	 * encapsulated packet or not and maybe use the inner packet
+	 * headers if needed
+	 */
+	__u8			encapsulation:1;
+	/* 7/9 bit hole (depending on ndisc_nodetype presence) */
 	kmemcheck_bitfield_end(flags2);
 
 #ifdef CONFIG_NET_DMA
@@ -486,6 +494,8 @@ struct sk_buff {
 		__u32		avail_size;
 	};
 
+	sk_buff_data_t		inner_transport_header;
+	sk_buff_data_t		inner_network_header;
 	sk_buff_data_t		transport_header;
 	sk_buff_data_t		network_header;
 	sk_buff_data_t		mac_header;
@@ -1435,12 +1445,53 @@ static inline void skb_reserve(struct sk_buff *skb, int len)
 	skb->tail += len;
 }
 
+static inline void skb_reset_inner_headers(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->network_header;
+	skb->inner_transport_header = skb->transport_header;
+}
+
 static inline void skb_reset_mac_len(struct sk_buff *skb)
 {
 	skb->mac_len = skb->network_header - skb->mac_header;
 }
 
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
+static inline unsigned char *skb_inner_transport_header(const struct sk_buff
+							*skb)
+{
+	return skb->head + skb->inner_transport_header;
+}
+
+static inline void skb_reset_inner_transport_header(struct sk_buff *skb)
+{
+	skb->inner_transport_header = skb->data - skb->head;
+}
+
+static inline void skb_set_inner_transport_header(struct sk_buff *skb,
+						   const int offset)
+{
+	skb_reset_inner_transport_header(skb);
+	skb->inner_transport_header += offset;
+}
+
+static inline unsigned char *skb_inner_network_header(const struct sk_buff *skb)
+{
+	return skb->head + skb->inner_network_header;
+}
+
+static inline void skb_reset_inner_network_header(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->data - skb->head;
+}
+
+static inline void skb_set_inner_network_header(struct sk_buff *skb,
+						const int offset)
+{
+	skb_reset_inner_network_header(skb);
+	skb->inner_network_header += offset;
+}
+
 static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
 {
 	return skb->head + skb->transport_header;
@@ -1496,6 +1547,38 @@ static inline void skb_set_mac_header(struct sk_buff *skb, const int offset)
 }
 
 #else /* NET_SKBUFF_DATA_USES_OFFSET */
+static inline unsigned char *skb_inner_transport_header(const struct sk_buff
+							*skb)
+{
+	return skb->inner_transport_header;
+}
+
+static inline void skb_reset_inner_transport_header(struct sk_buff *skb)
+{
+	skb->inner_transport_header = skb->data;
+}
+
+static inline void skb_set_inner_transport_header(struct sk_buff *skb,
+						   const int offset)
+{
+	skb->inner_transport_header = skb->data + offset;
+}
+
+static inline unsigned char *skb_inner_network_header(const struct sk_buff *skb)
+{
+	return skb->inner_network_header;
+}
+
+static inline void skb_reset_inner_network_header(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->data;
+}
+
+static inline void skb_set_inner_network_header(struct sk_buff *skb,
+						const int offset)
+{
+	skb->inner_network_header = skb->data + offset;
+}
 
 static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
 {
@@ -1574,11 +1657,21 @@ static inline u32 skb_network_header_len(const struct sk_buff *skb)
 	return skb->transport_header - skb->network_header;
 }
 
+static inline u32 skb_inner_network_header_len(const struct sk_buff *skb)
+{
+	return skb->inner_transport_header - skb->inner_network_header;
+}
+
 static inline int skb_network_offset(const struct sk_buff *skb)
 {
 	return skb_network_header(skb) - skb->data;
 }
 
+static inline int skb_inner_network_offset(const struct sk_buff *skb)
+{
+	return skb_inner_network_header(skb) - skb->data;
+}
+
 static inline int pskb_network_may_pull(struct sk_buff *skb, unsigned int len)
 {
 	return pskb_may_pull(skb, skb_network_offset(skb) + len);
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 60b7aac..4e1d228 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -35,6 +35,16 @@ static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 	return tcp_hdr(skb)->doff * 4;
 }
 
+static inline struct tcphdr *inner_tcp_hdr(const struct sk_buff *skb)
+{
+	return (struct tcphdr *)skb_inner_transport_header(skb);
+}
+
+static inline unsigned int inner_tcp_hdrlen(const struct sk_buff *skb)
+{
+	return inner_tcp_hdr(skb)->doff * 4;
+}
+
 static inline unsigned int tcp_optlen(const struct sk_buff *skb)
 {
 	return (tcp_hdr(skb)->doff - 5) * 4;
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0b67d77..9d81de1 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -27,6 +27,11 @@ static inline struct udphdr *udp_hdr(const struct sk_buff *skb)
 	return (struct udphdr *)skb_transport_header(skb);
 }
 
+static inline struct udphdr *inner_udp_hdr(const struct sk_buff *skb)
+{
+	return (struct udphdr *)skb_inner_transport_header(skb);
+}
+
 #define UDP_HTABLE_SIZE_MIN		(CONFIG_BASE_SMALL ? 128 : 256)
 
 static inline int udp_hashfn(struct net *net, unsigned num, unsigned mask)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 880722e2..ccbabf5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -682,11 +682,14 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->transport_header	= old->transport_header;
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
+	new->inner_transport_header = old->inner_transport_header;
+	new->inner_network_header = old->inner_transport_header;
 	skb_dst_copy(new, old);
 	new->rxhash		= old->rxhash;
 	new->ooo_okay		= old->ooo_okay;
 	new->l4_rxhash		= old->l4_rxhash;
 	new->no_fcs		= old->no_fcs;
+	new->encapsulation	= old->encapsulation;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -892,6 +895,8 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->network_header   += offset;
 	if (skb_mac_header_was_set(new))
 		new->mac_header	      += offset;
+	new->inner_transport_header += offset;
+	new->inner_network_header   += offset;
 #endif
 	skb_shinfo(new)->gso_size = skb_shinfo(old)->gso_size;
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
@@ -1089,6 +1094,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->network_header   += off;
 	if (skb_mac_header_was_set(skb))
 		skb->mac_header += off;
+	skb->inner_transport_header += off;
+	skb->inner_network_header += off;
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
 		skb->csum_start += nhead;
@@ -1188,6 +1195,8 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	n->network_header   += off;
 	if (skb_mac_header_was_set(skb))
 		n->mac_header += off;
+	n->inner_transport_header += off;
+	n->inner_network_header	   += off;
 #endif
 
 	return n;
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 2/5] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Alexander Duyck, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/dev.c       | 15 +++++++++++++--
 net/ipv4/ip_output.c |  4 ++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 307142a..a4c4a1b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2324,6 +2324,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			skb->vlan_tci = 0;
 		}
 
+		/* If encapsulation offload request, verify we are testing
+		 * hardware encapsulation features instead of standard
+		 * features for the netdev
+		 */
+		if (skb->encapsulation)
+			features &= dev->hw_enc_features;
+
 		if (netif_needs_gso(skb, features)) {
 			if (unlikely(dev_gso_segment(skb, features)))
 				goto out_kfree_skb;
@@ -2339,8 +2346,12 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			 * checksumming here.
 			 */
 			if (skb->ip_summed == CHECKSUM_PARTIAL) {
-				skb_set_transport_header(skb,
-					skb_checksum_start_offset(skb));
+				if (skb->encapsulation)
+					skb_set_inner_transport_header(skb,
+						skb_checksum_start_offset(skb));
+				else
+					skb_set_transport_header(skb,
+						skb_checksum_start_offset(skb));
 				if (!(features & NETIF_F_ALL_CSUM) &&
 				     skb_checksum_help(skb))
 					goto out_kfree_skb;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6537a40..3e98ed2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -595,6 +595,10 @@ slow_path_clean:
 	}
 
 slow_path:
+	/* for offloaded checksums cleanup checksum before fragmentation */
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+		goto fail;
+
 	left = skb->len - hlen;		/* Space per frame */
 	ptr = hlen;		/* Where to start from */
 
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 3/5] vxlan: capture inner headers during encapsulation
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
The advantage to these two changes is that it also allows the VXLAN to
make use of GSO.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/vxlan.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ce77b8b..88b31f2 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -876,6 +876,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
+	if (!skb->encapsulation) {
+		skb_reset_inner_headers(skb);
+		skb->encapsulation = 1;
+	}
+
 	/* Need space for new headers (invalidates iph ptr) */
 	if (skb_cow_head(skb, VXLAN_HEADROOM))
 		goto drop;
@@ -947,7 +952,8 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 	vxlan_set_owner(dev, skb);
 
 	/* See iptunnel_xmit() */
-	skb->ip_summed = CHECKSUM_NONE;
+	if (skb->ip_summed != CHECKSUM_PARTIAL)
+		skb->ip_summed = CHECKSUM_NONE;
 	ip_select_ident(iph, &rt->dst, NULL);
 
 	err = ip_local_out(skb);
@@ -1168,6 +1174,8 @@ static void vxlan_setup(struct net_device *dev)
 	dev->tx_queue_len = 0;
 	dev->features	|= NETIF_F_LLTX;
 	dev->features	|= NETIF_F_NETNS_LOCAL;
+	dev->features	|= NETIF_F_SG | NETIF_F_HW_CSUM;
+	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
 	dev->priv_flags	&= ~IFF_XMIT_DST_RELEASE;
 
 	spin_lock_init(&vxlan->hash_lock);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 4/5] ixgbe: Adding tx encapsulation capability
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

This patch allows ixgbe to recognize encapsulated packets and do the tx
checksum offload in hardware. This patch is only for demonstration
purposes and should not be applied.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 46 +++++++++++++++++++++------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fb165b6..62a7d6e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5972,17 +5972,42 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
 			if (!(first->tx_flags & IXGBE_TX_FLAGS_TXSW))
 				return;
 		}
+		vlan_macip_lens |= skb_network_offset(skb)
+				   << IXGBE_ADVTXD_MACLEN_SHIFT;
 	} else {
 		u8 l4_hdr = 0;
-		switch (first->protocol) {
-		case __constant_htons(ETH_P_IP):
-			vlan_macip_lens |= skb_network_header_len(skb);
+		union {
+			struct iphdr *ipv4;
+			struct ipv6hdr *ipv6;
+			u8 *raw;
+		} network_hdr;
+		union {
+			struct tcphdr *tcphdr;
+			u8 *raw;
+		} transport_hdr;
+
+		if (skb->encapsulation) {
+			network_hdr.raw = skb_inner_network_header(skb);
+			transport_hdr.raw = skb_inner_transport_header(skb);
+			vlan_macip_lens |= skb_inner_network_offset(skb) <<
+					   IXGBE_ADVTXD_MACLEN_SHIFT;
+		} else {
+			network_hdr.raw = skb_network_header(skb);
+			transport_hdr.raw = skb_transport_header(skb);
+			vlan_macip_lens |= skb_network_offset(skb) <<
+					   IXGBE_ADVTXD_MACLEN_SHIFT;
+		}
+
+		/* use first 4 bits to determine IP version */
+		switch (network_hdr.ipv4->version) {
+		case 4:
+			vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
 			type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
-			l4_hdr = ip_hdr(skb)->protocol;
+			l4_hdr = network_hdr.ipv4->protocol;
 			break;
-		case __constant_htons(ETH_P_IPV6):
-			vlan_macip_lens |= skb_network_header_len(skb);
-			l4_hdr = ipv6_hdr(skb)->nexthdr;
+		case 6:
+			vlan_macip_lens |= transport_hdr.raw - network_hdr.raw;
+			l4_hdr = network_hdr.ipv6->nexthdr;
 			break;
 		default:
 			if (unlikely(net_ratelimit())) {
@@ -5996,7 +6021,7 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
 		switch (l4_hdr) {
 		case IPPROTO_TCP:
 			type_tucmd |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
-			mss_l4len_idx = tcp_hdrlen(skb) <<
+			mss_l4len_idx = (transport_hdr.tcphdr->doff * 4) <<
 					IXGBE_ADVTXD_L4LEN_SHIFT;
 			break;
 		case IPPROTO_SCTP:
@@ -6022,7 +6047,6 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
 	}
 
 	/* vlan_macip_lens: MACLEN, VLAN tag */
-	vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
 
 	ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0,
@@ -7383,6 +7407,10 @@ static int ixgbe_probe(struct pci_dev *pdev,
 
 	netdev->hw_features = netdev->features;
 
+	netdev->hw_enc_features = NETIF_F_IP_CSUM |
+				  NETIF_F_IPV6_CSUM |
+				  NETIF_F_SG;
+
 	switch (adapter->hw.mac.type) {
 	case ixgbe_mac_82599EB:
 	case ixgbe_mac_X540:
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 5/5] vxlan: Add capability of Rx checksum offload for inner packet
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
---
 drivers/net/vxlan.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 88b31f2..3b3fdf6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -607,7 +607,17 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 
 	__skb_tunnel_rx(skb, vxlan->dev);
 	skb_reset_network_header(skb);
-	skb->ip_summed = CHECKSUM_NONE;
+
+	/* If the NIC driver gave us an encapsulated packet with
+	 * CHECKSUM_UNNECESSARY and Rx checksum feature is enabled,
+	 * leave the CHECKSUM_UNNECESSARY, the device checksummed it
+	 * for us. Otherwise force the upper layers to verify it.
+	 */
+	if (skb->ip_summed != CHECKSUM_UNNECESSARY || !skb->encapsulation ||
+	    !(vxlan->dev->features & NETIF_F_RXCSUM))
+		skb->ip_summed = CHECKSUM_NONE;
+
+	skb->encapsulation = 0;
 
 	err = IP_ECN_decapsulate(oip, skb);
 	if (unlikely(err)) {
@@ -1175,7 +1185,9 @@ static void vxlan_setup(struct net_device *dev)
 	dev->features	|= NETIF_F_LLTX;
 	dev->features	|= NETIF_F_NETNS_LOCAL;
 	dev->features	|= NETIF_F_SG | NETIF_F_HW_CSUM;
-	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM;
+	dev->features   |= NETIF_F_RXCSUM;
+
+	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
 	dev->priv_flags	&= ~IFF_XMIT_DST_RELEASE;
 
 	spin_lock_init(&vxlan->hash_lock);
-- 
1.7.11.7

^ permalink raw reply related

* loan offer
From: SEC Capitals Loan @ 2012-12-08  0:49 UTC (permalink / raw)



Loan Offer at 3%, Feel Free to REPLY back to us for more info

^ permalink raw reply

* [PATCH v2 net-next 0/9] tipc: more updates for the v3.8 content
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker

Changes since v1:
	-get rid of essentially unused variable spotted by
	 Neil Horman (patch #2)

	-drop patch #3; defer it for 3.9 content, so Neil,
	 Jon and Ying can discuss its specifics at their
	 leisure while net-next is closed.  (It had no
	 direct dependencies to the rest of the series, and
	 was just an optimization)

	-fix indentation of accept() code directly in place
	 vs. forking it out to a separate function (was patch
	 #10, now patch #9).

Rebuilt and re-ran tests just to ensure nothing odd happened.

Original v1 text follows, updated pull information follows that.

           ---------

Here is another batch of TIPC changes.  The most interesting
thing is probably the non-blocking socket connect - I'm told
there were several users looking forward to seeing this.

Also there were some resource limitation changes that had
the right intent back in 2005, but were now apparently causing
needless limitations to people's real use cases; those have
been relaxed/removed.

There is a lockdep splat fix, but no need for a stable backport,
since it is virtually impossible to trigger in mainline; you
have to essentially modify code to force the probabilities
in your favour to see it.

The rest can largely be categorized as general cleanup of things
seen in the process of getting the above changes done.

Tested between 64 and 32 bit nodes with the test suite.  I've
also compile tested all the individual commits on the chain.

I'd originally figured on this queue not being ready for 3.8, but
the extended stabilization window of 3.7 has changed that.  On
the other hand, this can still be 3.9 material, if that simply
works better for folks - no problem for me to defer it to 2013.
If anyone spots any problems then I'll definitely defer it,
rather than rush a last minute respin.

Thanks,
Paul.
---

The following changes since commit b93196dc5af7729ff7cc50d3d322ab1a364aa14f:

  net: fix some compiler warning in net/core/neighbour.c (2012-12-05 21:50:37 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next_v2

for you to fetch changes up to 0fef8f205f6f4cdff1869e54e44f317a79902785:

  tipc: refactor accept() code for improved readability (2012-12-07 17:23:24 -0500)

----------------------------------------------------------------
Erik Hugne (1):
      tipc: remove obsolete flush of stale reassembly buffer

Jon Maloy (1):
      tipc: change sk_receive_queue upper limit

Paul Gortmaker (2):
      tipc: standardize across connect/disconnect function naming
      tipc: refactor accept() code for improved readability

Ying Xue (5):
      tipc: eliminate aggregate sk_receive_queue limit
      tipc: consolidate connection-oriented message reception in one function
      tipc: introduce non-blocking socket connect
      tipc: eliminate connection setup for implied connect in recv_msg()
      tipc: add lock nesting notation to quiet lockdep warning

 net/tipc/link.c   |  44 -------
 net/tipc/port.c   |  32 +++--
 net/tipc/port.h   |   6 +-
 net/tipc/socket.c | 353 ++++++++++++++++++++++++++++++------------------------
 net/tipc/subscr.c |   2 +-
 5 files changed, 225 insertions(+), 212 deletions(-)

^ permalink raw reply

* [PATCH net-next 1/9] tipc: remove obsolete flush of stale reassembly buffer
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Erik Hugne, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

From: Erik Hugne <erik.hugne@ericsson.com>

Each link instance has a periodic job checking if there is a stale
ongoing message reassembly associated to the link. If no new
fragment has been received during the last 4*[link_tolerance] period,
it is assumed the missing fragment will never arrive. As a consequence,
the reassembly buffer is discarded, and a gap in the message sequence
occurs.

This assumption is wrong. After we abandoned our ambition to develop
packet routing for multi-cluster networks, only single-hop packet
transfer remains as an option. For those, all packets are guaranteed
to be delivered in sequence to the defragmentation layer. Any failure
to achieve sequenced delivery will eventually lead to link reset, and
the reassembly buffer will be flushed anyway.

So we just remove this periodic check, which is now obsolete.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: also delete get/inc_timer count, since they are now unused]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/link.c | 44 --------------------------------------------
 1 file changed, 44 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 87bf5aa..daa6080 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -97,7 +97,6 @@ static int  link_send_sections_long(struct tipc_port *sender,
 				    struct iovec const *msg_sect,
 				    u32 num_sect, unsigned int total_len,
 				    u32 destnode);
-static void link_check_defragm_bufs(struct tipc_link *l_ptr);
 static void link_state_event(struct tipc_link *l_ptr, u32 event);
 static void link_reset_statistics(struct tipc_link *l_ptr);
 static void link_print(struct tipc_link *l_ptr, const char *str);
@@ -271,7 +270,6 @@ static void link_timeout(struct tipc_link *l_ptr)
 	}
 
 	/* do all other link processing performed on a periodic basis */
-	link_check_defragm_bufs(l_ptr);
 
 	link_state_event(l_ptr, TIMEOUT_EVT);
 
@@ -2497,16 +2495,6 @@ static void set_expected_frags(struct sk_buff *buf, u32 exp)
 	msg_set_bcast_ack(buf_msg(buf), exp);
 }
 
-static u32 get_timer_cnt(struct sk_buff *buf)
-{
-	return msg_reroute_cnt(buf_msg(buf));
-}
-
-static void incr_timer_cnt(struct sk_buff *buf)
-{
-	msg_incr_reroute_cnt(buf_msg(buf));
-}
-
 /*
  * tipc_link_recv_fragment(): Called with node lock on. Returns
  * the reassembled buffer if message is complete.
@@ -2585,38 +2573,6 @@ int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb,
 	return 0;
 }
 
-/**
- * link_check_defragm_bufs - flush stale incoming message fragments
- * @l_ptr: pointer to link
- */
-static void link_check_defragm_bufs(struct tipc_link *l_ptr)
-{
-	struct sk_buff *prev = NULL;
-	struct sk_buff *next = NULL;
-	struct sk_buff *buf = l_ptr->defragm_buf;
-
-	if (!buf)
-		return;
-	if (!link_working_working(l_ptr))
-		return;
-	while (buf) {
-		u32 cnt = get_timer_cnt(buf);
-
-		next = buf->next;
-		if (cnt < 4) {
-			incr_timer_cnt(buf);
-			prev = buf;
-		} else {
-			if (prev)
-				prev->next = buf->next;
-			else
-				l_ptr->defragm_buf = buf->next;
-			kfree_skb(buf);
-		}
-		buf = next;
-	}
-}
-
 static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tolerance)
 {
 	if ((tolerance < TIPC_MIN_LINK_TOL) || (tolerance > TIPC_MAX_LINK_TOL))
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 3/9] tipc: change sk_receive_queue upper limit
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

From: Jon Maloy <jon.maloy@ericsson.com>

The sk_recv_queue upper limit for connectionless sockets has empirically
turned out to be too low. When we double the current limit we get much
fewer rejected messages and no noticable negative side-effects.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 848be69..f553fde 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/socket.c: TIPC socket API
  *
- * Copyright (c) 2001-2007, Ericsson AB
+ * Copyright (c) 2001-2007, 2012 Ericsson AB
  * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
  * All rights reserved.
  *
@@ -43,7 +43,7 @@
 #define SS_LISTENING	-1	/* socket is listening */
 #define SS_READY	-2	/* socket is connectionless */
 
-#define OVERLOAD_LIMIT_BASE	5000
+#define OVERLOAD_LIMIT_BASE	10000
 #define CONN_TIMEOUT_DEFAULT	8000	/* default connect timeout = 8s */
 
 struct tipc_sock {
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 5/9] tipc: consolidate connection-oriented message reception in one function
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

Handling of connection-related message reception is currently scattered
around at different places in the code. This makes it harder to verify
that things are handled correctly in all possible scenarios.
So we consolidate the existing processing of connection-oriented
message reception in a single routine.  In the process, we convert the
chain of if/else into a switch/case for improved readability.

A cast on the socket_state in the switch is needed to avoid compile
warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
‘4294967295’ not in enumerated type".  This happens because existing
tipc code pseudo extends the default linux socket state values with:

	#define SS_LISTENING    -1      /* socket is listening */
	#define SS_READY        -2      /* socket is connectionless */

It may make sense to add these as _positive_ values to the existing
socket state enum list someday, vs. these already existing defines.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: add cast to fix warning; remove returns from middle of switch]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 75 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index b630f38..d16a6de 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1187,6 +1187,53 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
 }
 
 /**
+ * filter_connect - Handle all incoming messages for a connection-based socket
+ * @tsock: TIPC socket
+ * @msg: message
+ *
+ * Returns TIPC error status code and socket error status code
+ * once it encounters some errors
+ */
+static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
+{
+	struct socket *sock = tsock->sk.sk_socket;
+	struct tipc_msg *msg = buf_msg(*buf);
+	u32 retval = TIPC_ERR_NO_PORT;
+
+	if (msg_mcast(msg))
+		return retval;
+
+	switch ((int)sock->state) {
+	case SS_CONNECTED:
+		/* Accept only connection-based messages sent by peer */
+		if (msg_connected(msg) && tipc_port_peer_msg(tsock->p, msg)) {
+			if (unlikely(msg_errcode(msg))) {
+				sock->state = SS_DISCONNECTING;
+				__tipc_disconnect(tsock->p);
+			}
+			retval = TIPC_OK;
+		}
+		break;
+	case SS_CONNECTING:
+		/* Accept only ACK or NACK message */
+		if (msg_connected(msg) || (msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_LISTENING:
+	case SS_UNCONNECTED:
+		/* Accept only SYN message */
+		if (!msg_connected(msg) && !(msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_DISCONNECTING:
+		break;
+	default:
+		pr_err("Unknown socket state %u\n", sock->state);
+	}
+	return retval;
+}
+
+/**
  * filter_rcv - validate incoming message
  * @sk: socket
  * @buf: message
@@ -1203,6 +1250,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	struct socket *sock = sk->sk_socket;
 	struct tipc_msg *msg = buf_msg(buf);
 	u32 recv_q_len;
+	u32 res = TIPC_OK;
 
 	/* Reject message if it is wrong sort of message for socket */
 	if (msg_type(msg) > TIPC_DIRECT_MSG)
@@ -1212,24 +1260,9 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 		if (msg_connected(msg))
 			return TIPC_ERR_NO_PORT;
 	} else {
-		if (msg_mcast(msg))
-			return TIPC_ERR_NO_PORT;
-		if (sock->state == SS_CONNECTED) {
-			if (!msg_connected(msg) ||
-			    !tipc_port_peer_msg(tipc_sk_port(sk), msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_CONNECTING) {
-			if (!msg_connected(msg) && (msg_errcode(msg) == 0))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_LISTENING) {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_DISCONNECTING) {
-			return TIPC_ERR_NO_PORT;
-		} else /* (sock->state == SS_UNCONNECTED) */ {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		}
+		res = filter_connect(tipc_sk(sk), &buf);
+		if (res != TIPC_OK || buf == NULL)
+			return res;
 	}
 
 	/* Reject message if there isn't room to queue it */
@@ -1243,12 +1276,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	TIPC_SKB_CB(buf)->handle = 0;
 	__skb_queue_tail(&sk->sk_receive_queue, buf);
 
-	/* Initiate connection termination for an incoming 'FIN' */
-	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
-		sock->state = SS_DISCONNECTING;
-		__tipc_disconnect(tipc_sk_port(sk));
-	}
-
 	sk->sk_data_ready(sk, 0);
 	return TIPC_OK;
 }
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 6/9] tipc: introduce non-blocking socket connect
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.

With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.

The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.

It is also now possible to call select() or poll() to wait for the
completion of a connection.

An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 158 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 65 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index d16a6de..dbce274 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -775,16 +775,19 @@ exit:
 static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 {
 	struct tipc_sock *tsock = tipc_sk(sock->sk);
-
-	if (msg_errcode(msg)) {
-		sock->state = SS_DISCONNECTING;
-		return -ECONNREFUSED;
-	}
+	struct tipc_port *p_ptr;
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect(tsock->p->ref, &tsock->peer_name);
-	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
+	p_ptr = tipc_port_deref(tsock->p->ref);
+	if (!p_ptr)
+		return -EINVAL;
+
+	__tipc_connect(tsock->p->ref, p_ptr, &tsock->peer_name);
+
+	if (msg_importance(msg) > TIPC_CRITICAL_IMPORTANCE)
+		return -EINVAL;
+	msg_set_importance(&p_ptr->phdr, (u32)msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
 }
@@ -1198,7 +1201,9 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 {
 	struct socket *sock = tsock->sk.sk_socket;
 	struct tipc_msg *msg = buf_msg(*buf);
+	struct sock *sk = &tsock->sk;
 	u32 retval = TIPC_ERR_NO_PORT;
+	int res;
 
 	if (msg_mcast(msg))
 		return retval;
@@ -1216,8 +1221,36 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 		break;
 	case SS_CONNECTING:
 		/* Accept only ACK or NACK message */
-		if (msg_connected(msg) || (msg_errcode(msg)))
+		if (unlikely(msg_errcode(msg))) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = -ECONNREFUSED;
+			retval = TIPC_OK;
+			break;
+		}
+
+		if (unlikely(!msg_connected(msg)))
+			break;
+
+		res = auto_connect(sock, msg);
+		if (res) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = res;
 			retval = TIPC_OK;
+			break;
+		}
+
+		/* If an incoming message is an 'ACK-', it should be
+		 * discarded here because it doesn't contain useful
+		 * data. In addition, we should try to wake up
+		 * connect() routine if sleeping.
+		 */
+		if (msg_data_sz(msg) == 0) {
+			kfree_skb(*buf);
+			*buf = NULL;
+			if (waitqueue_active(sk_sleep(sk)))
+				wake_up_interruptible(sk_sleep(sk));
+		}
+		retval = TIPC_OK;
 		break;
 	case SS_LISTENING:
 	case SS_UNCONNECTED:
@@ -1361,8 +1394,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 	struct sock *sk = sock->sk;
 	struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
 	struct msghdr m = {NULL,};
-	struct sk_buff *buf;
-	struct tipc_msg *msg;
 	unsigned int timeout;
 	int res;
 
@@ -1374,26 +1405,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* For now, TIPC does not support the non-blocking form of connect() */
-	if (flags & O_NONBLOCK) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-
-	/* Issue Posix-compliant error code if socket is in the wrong state */
-	if (sock->state == SS_LISTENING) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-	if (sock->state == SS_CONNECTING) {
-		res = -EALREADY;
-		goto exit;
-	}
-	if (sock->state != SS_UNCONNECTED) {
-		res = -EISCONN;
-		goto exit;
-	}
-
 	/*
 	 * Reject connection attempt using multicast address
 	 *
@@ -1405,49 +1416,66 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* Reject any messages already in receive queue (very unlikely) */
-	reject_rx_queue(sk);
+	timeout = (flags & O_NONBLOCK) ? 0 : tipc_sk(sk)->conn_timeout;
+
+	switch (sock->state) {
+	case SS_UNCONNECTED:
+		/* Send a 'SYN-' to destination */
+		m.msg_name = dest;
+		m.msg_namelen = destlen;
+
+		/* If connect is in non-blocking case, set MSG_DONTWAIT to
+		 * indicate send_msg() is never blocked.
+		 */
+		if (!timeout)
+			m.msg_flags = MSG_DONTWAIT;
+
+		res = send_msg(NULL, sock, &m, 0);
+		if ((res < 0) && (res != -EWOULDBLOCK))
+			goto exit;
 
-	/* Send a 'SYN-' to destination */
-	m.msg_name = dest;
-	m.msg_namelen = destlen;
-	res = send_msg(NULL, sock, &m, 0);
-	if (res < 0)
+		/* Just entered SS_CONNECTING state; the only
+		 * difference is that return value in non-blocking
+		 * case is EINPROGRESS, rather than EALREADY.
+		 */
+		res = -EINPROGRESS;
+		break;
+	case SS_CONNECTING:
+		res = -EALREADY;
+		break;
+	case SS_CONNECTED:
+		res = -EISCONN;
+		break;
+	default:
+		res = -EINVAL;
 		goto exit;
+	}
 
-	/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
-	timeout = tipc_sk(sk)->conn_timeout;
-	release_sock(sk);
-	res = wait_event_interruptible_timeout(*sk_sleep(sk),
-			(!skb_queue_empty(&sk->sk_receive_queue) ||
-			(sock->state != SS_CONNECTING)),
-			timeout ? (long)msecs_to_jiffies(timeout)
-				: MAX_SCHEDULE_TIMEOUT);
-	lock_sock(sk);
+	if (sock->state == SS_CONNECTING) {
+		if (!timeout)
+			goto exit;
 
-	if (res > 0) {
-		buf = skb_peek(&sk->sk_receive_queue);
-		if (buf != NULL) {
-			msg = buf_msg(buf);
-			res = auto_connect(sock, msg);
-			if (!res) {
-				if (!msg_data_sz(msg))
-					advance_rx_queue(sk);
-			}
-		} else {
-			if (sock->state == SS_CONNECTED)
-				res = -EISCONN;
+		/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+		release_sock(sk);
+		res = wait_event_interruptible_timeout(*sk_sleep(sk),
+				sock->state != SS_CONNECTING,
+				timeout ? (long)msecs_to_jiffies(timeout)
+					: MAX_SCHEDULE_TIMEOUT);
+		lock_sock(sk);
+		if (res <= 0) {
+			if (res == 0)
+				res = -ETIMEDOUT;
 			else
-				res = -ECONNREFUSED;
+				; /* leave "res" unchanged */
+			goto exit;
 		}
-	} else {
-		if (res == 0)
-			res = -ETIMEDOUT;
-		else
-			; /* leave "res" unchanged */
-		sock->state = SS_DISCONNECTING;
 	}
 
+	if (unlikely(sock->state == SS_DISCONNECTING))
+		res = sock_error(sk);
+	else
+		res = 0;
+
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 4/9] tipc: standardize across connect/disconnect function naming
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

Currently we have tipc_disconnect and tipc_disconnect_port.  It is
not clear from the names alone, what they do or how they differ.
It turns out that tipc_disconnect just deals with the port locking
and then calls tipc_disconnect_port which does all the work.

If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
then we will be following typical linux convention, where:

   __tipc_disconnect: "raw" function that does all the work.

   tipc_disconnect: wrapper that deals with locking and then calls
		    the real core __tipc_disconnect function

With this, the difference is immediately evident, and locking
violations are more apt to be spotted by chance while working on,
or even just while reading the code.

On the connect side of things, we currently only have the single
"tipc_connect2port" function.  It does both the locking at enter/exit,
and the core of the work.  Pending changes will make it desireable to
have the connect be a two part locking wrapper + worker function,
just like the disconnect is already.

Here, we make the connect look just like the updated disconnect case,
for the above reason, and for consistency.  In the process, we also
get rid of the "2port" suffix that was on the original name, since
it adds no descriptive value.

On close examination, one might notice that the above connect
changes implicitly move the call to tipc_link_get_max_pkt() to be
within the scope of tipc_port_lock() protected region; when it was
not previously.  We don't see any issues with this, and it is in
keeping with __tipc_connect doing the work and tipc_connect just
handling the locking.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/port.c   | 32 +++++++++++++++++++++++---------
 net/tipc/port.h   |  6 ++++--
 net/tipc/socket.c |  6 +++---
 net/tipc/subscr.c |  2 +-
 4 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/net/tipc/port.c b/net/tipc/port.c
index 07c42fb..18098ca 100644
--- a/net/tipc/port.c
+++ b/net/tipc/port.c
@@ -726,7 +726,7 @@ static void port_dispatcher_sigh(void *dummy)
 				if (unlikely(!cb))
 					goto reject;
 				if (unlikely(!connected)) {
-					if (tipc_connect2port(dref, &orig))
+					if (tipc_connect(dref, &orig))
 						goto reject;
 				} else if (peer_invalid)
 					goto reject;
@@ -1036,15 +1036,30 @@ int tipc_withdraw(u32 ref, unsigned int scope, struct tipc_name_seq const *seq)
 	return res;
 }
 
-int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
+int tipc_connect(u32 ref, struct tipc_portid const *peer)
 {
 	struct tipc_port *p_ptr;
-	struct tipc_msg *msg;
-	int res = -EINVAL;
+	int res;
 
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
+	res = __tipc_connect(ref, p_ptr, peer);
+	tipc_port_unlock(p_ptr);
+	return res;
+}
+
+/*
+ * __tipc_connect - connect to a remote peer
+ *
+ * Port must be locked.
+ */
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+			struct tipc_portid const *peer)
+{
+	struct tipc_msg *msg;
+	int res = -EINVAL;
+
 	if (p_ptr->published || p_ptr->connected)
 		goto exit;
 	if (!peer->ref)
@@ -1067,17 +1082,16 @@ int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
 			  (net_ev_handler)port_handle_node_down);
 	res = 0;
 exit:
-	tipc_port_unlock(p_ptr);
 	p_ptr->max_pkt = tipc_link_get_max_pkt(peer->node, ref);
 	return res;
 }
 
-/**
- * tipc_disconnect_port - disconnect port from peer
+/*
+ * __tipc_disconnect - disconnect port from peer
  *
  * Port must be locked.
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr)
+int __tipc_disconnect(struct tipc_port *tp_ptr)
 {
 	int res;
 
@@ -1104,7 +1118,7 @@ int tipc_disconnect(u32 ref)
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
-	res = tipc_disconnect_port(p_ptr);
+	res = __tipc_disconnect(p_ptr);
 	tipc_port_unlock(p_ptr);
 	return res;
 }
diff --git a/net/tipc/port.h b/net/tipc/port.h
index 4660e30..fb66e2e 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -190,7 +190,7 @@ int tipc_publish(u32 portref, unsigned int scope,
 int tipc_withdraw(u32 portref, unsigned int scope,
 		struct tipc_name_seq const *name_seq);
 
-int tipc_connect2port(u32 portref, struct tipc_portid const *port);
+int tipc_connect(u32 portref, struct tipc_portid const *port);
 
 int tipc_disconnect(u32 portref);
 
@@ -200,7 +200,9 @@ int tipc_shutdown(u32 ref);
 /*
  * The following routines require that the port be locked on entry
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr);
+int __tipc_disconnect(struct tipc_port *tp_ptr);
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+		   struct tipc_portid const *peer);
 int tipc_port_peer_msg(struct tipc_port *p_ptr, struct tipc_msg *msg);
 
 /*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index f553fde..b630f38 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -783,7 +783,7 @@ static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect2port(tsock->p->ref, &tsock->peer_name);
+	tipc_connect(tsock->p->ref, &tsock->peer_name);
 	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
@@ -1246,7 +1246,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	/* Initiate connection termination for an incoming 'FIN' */
 	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
 		sock->state = SS_DISCONNECTING;
-		tipc_disconnect_port(tipc_sk_port(sk));
+		__tipc_disconnect(tipc_sk_port(sk));
 	}
 
 	sk->sk_data_ready(sk, 0);
@@ -1506,7 +1506,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 		/* Connect new socket to it's peer */
 		new_tsock->peer_name.ref = msg_origport(msg);
 		new_tsock->peer_name.node = msg_orignode(msg);
-		tipc_connect2port(new_ref, &new_tsock->peer_name);
+		tipc_connect(new_ref, &new_tsock->peer_name);
 		new_sock->state = SS_CONNECTED;
 
 		tipc_set_portimportance(new_ref, msg_importance(msg));
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 0f7d0d0..6b42d47 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -462,7 +462,7 @@ static void subscr_named_msg_event(void *usr_handle,
 		kfree(subscriber);
 		return;
 	}
-	tipc_connect2port(subscriber->port_ref, orig);
+	tipc_connect(subscriber->port_ref, orig);
 
 	/* Lock server port (& save lock address for future use) */
 	subscriber->lock = tipc_port_lock(subscriber->port_ref)->lock;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 7/9] tipc: eliminate connection setup for implied connect in recv_msg()
From: Paul Gortmaker @ 2012-12-08  1:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354929558-16948-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

As connection setup is now completed asynchronously in BH context,
in the function filter_connect(), the corresponding code in recv_msg()
becomes redundant.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index dbce274..ef75b62 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -946,13 +946,6 @@ restart:
 	sz = msg_data_sz(msg);
 	err = msg_errcode(msg);
 
-	/* Complete connection setup for an implied connect */
-	if (unlikely(sock->state == SS_CONNECTING)) {
-		res = auto_connect(sock, msg);
-		if (res)
-			goto exit;
-	}
-
 	/* Discard an empty non-errored message & try again */
 	if ((!sz) && (!err)) {
 		advance_rx_queue(sk);
-- 
1.7.12.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox