Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-10 20:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010133646.1bdd42c1@nehalam.linuxnetplumber.net>

On 2013-10-10 10:36 PM, Stephen Hemminger wrote:
> On Thu, 10 Oct 2013 14:52:50 +0200
> Felix Fietkau <nbd@openwrt.org> wrote:
> 
>> When an ethernet device is enslaved to a bridge, and the bridge STP
>> detects loss of carrier (or operational state down), then normally
>> packet receiption is blocked.
>> 
>> This breaks control applications like WPA which maybe expecting to
>> receive packets to negotiate to bring link up. The bridge needs to
>> block forwarding packets from these disabled ports, but there is no
>> hard requirement to not allow local packet delivery.
>> 
>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> 
> No. This will cause duplicate packets to be delivered.
How? I haven't observed any duplications in my tests with this patch.

> If doing a link layer protocol like WPA then it should be done directly
> on the underlying device, not the bridge itself.
When the ETH_P_PAE protocol is set for the packet socket inside
wpa_supplicant, the bridge steals all packets before the protocol
handler gets them.
In __netif_receive_skb_core, only ptype_all gets processed before the rx
handler, not ptype_base.


- Felix

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-10 20:48 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CALnjE+oXJQoPG4io2JOn1zBbf7XtPKaBE+vHPNKXm2TBygvB5Q@mail.gmail.com>

On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> The combination of two commits
>>>>>>
>>>>>> commit 8e4e1713e4
>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>
>>>>>> and
>>>>>>
>>>>>> commit 2537b4dd0a
>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>
>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>> netdev_unregister notification
>>>>>>
>>>>>> The following steps:
>>>>>>
>>>>>>   modprobe openvswitch
>>>>>>   ovs-dpctl add-dp test
>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>   ovs-dpctl add-if test tap1
>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>
>>>>>> are causing multiple warnings:
>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>> index c323567..e9380bd 100644
>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>                 return NOTIFY_DONE;
>>>>>>
>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>> +
>>>>>
>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>> and let workq do rest of work.
>>>>
>>>> isn't it what it's doing?
>>>
>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>> rest of vport destroy can be done in workq.
>>
>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>> that's dangerous.
> why is it dangerous? ovs already had ref to net-device.

comment from dev.c:
                /* Notify protocols, that we are about to destroy
                   this device. They should clean all the things.
                */
                call_netdevice_notifiers(NETDEV_UNREGISTER, dev);

so here you're suggesting to just netdev_upper_dev_unlink() to silence
the warning,
but then do dev_set_promisc(-1) in workqueue?

well, as a minimum audit of promiscuity will be wrong.
ndo_change_rx_flags will be called after ndo_uninit,
all sorts of other cleanups are done.
I cannot track all possible scenarios, but it seems much safer to
cleanup everything possible
as soon as ovs received NETDEV_UNREGISTER event.

May be all these risks are worth taking, then please explain what is
the problem with the proposed patch?

Thanks
Alex

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Stephen Hemminger @ 2013-10-10 20:36 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: netdev
In-Reply-To: <1381409570-1892-1-git-send-email-nbd@openwrt.org>

On Thu, 10 Oct 2013 14:52:50 +0200
Felix Fietkau <nbd@openwrt.org> wrote:

> When an ethernet device is enslaved to a bridge, and the bridge STP
> detects loss of carrier (or operational state down), then normally
> packet receiption is blocked.
> 
> This breaks control applications like WPA which maybe expecting to
> receive packets to negotiate to bring link up. The bridge needs to
> block forwarding packets from these disabled ports, but there is no
> hard requirement to not allow local packet delivery.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Felix Fietkau <nbd@openwrt.org>

No. This will cause duplicate packets to be delivered.

If doing a link layer protocol like WPA then it should be done directly
on the underlying device, not the bridge itself.

^ permalink raw reply

* [PATCH] bridge: update mdb expiration timer upon reports.
From: Vlad Yasevich @ 2013-10-10 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Yasevich, Cong Wang, Herbert Xu, Stephen Hemminger

commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
	bridge: only expire the mdb entry when query is received
changed the mdb expiration timer to be armed only when QUERY is
received.  Howerver, this causes issues in an environment where
the multicast server socket comes and goes very fast while a client
is trying to send traffic to it.

The root cause is a race where a sequence of LEAVE followed by REPORT
messages can race against QUERY messages generated in response to LEAVE.
The QUERY ends up starting the expiration timer, and that timer can
potentially expire after the new REPORT message has been received signaling
the new join operation.  This leads to a significant drop in multicast
traffic and possible complete stall. 

The solution is to have REPORT messages update the expiration timer
on entries that already exist.

CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/bridge/br_multicast.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index d1c5786..1085f21 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -611,6 +611,9 @@ rehash:
 		break;

 	default:
+		/* If we have an existing entry, update it's expire timer */
+		mod_timer(&mp->timer,
+			  jiffies + br->multicast_membership_interval);
 		goto out;
 	}

@@ -680,8 +683,12 @@ static int br_multicast_add_group(struct net_bridge *br,
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
-		if (p->port == port)
+		if (p->port == port) {
+			/* We already have a portgroup, update the timer.  */
+			mod_timer(&p->timer,
+				  jiffies + br->multicast_membership_interval);
 			goto out;
+		}
 		if ((unsigned long)p->port < (unsigned long)port)
 			break;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2013-10-10 19:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1381387271-29605-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  9 Oct 2013 23:40:58 -0700

> This series contains updates to i40e only.
> 
> Alex provides the majority of the patches against i40e, where he does
> cleanup of the Tx and RX queues and to align the code with the known
> good Tx/Rx queue code in the ixgbe driver.
> 
> Anjali provides an i40e patch to update link events to not print to
> the log until the device is administratively up.
> 
> Catherine provides a patch to update the driver version.

Pulled, thanks a lot Jeff.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Paul E. McKenney @ 2013-10-10 19:05 UTC (permalink / raw)
  To: Eric Dumazet, Josh Triplett, linux-kernel, mingo, laijs, dipankar,
	akpm, mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells,
	edumazet, darren, fweisbec, sbw, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev
In-Reply-To: <20131010020422.GB24368@order.stressinduktion.org>

On Thu, Oct 10, 2013 at 04:04:22AM +0200, Hannes Frederic Sowa wrote:
> On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > > 
> > > > that.  Constructs like list_del_rcu are much clearer, and not
> > > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > > Idea.
> > > 
> > > OK, so you think there is synchronization code.
> > > 
> > > I will shut up then, no need to waste time.
> > 
> > As you said earlier, we should at least get rid of the memory barrier
> > as long as we are changing the code.
> 
> Interesting thread!
> 
> Sorry to chime in and asking a question:
> 
> Why do we need an ACCESS_ONCE here if rcu_assign_pointer can do without one?
> In other words I wonder why rcu_assign_pointer is not a static inline function
> to use the sequence point in argument evaluation (if I remember correctly this
> also holds for inline functions) to not allow something like this:
> 
> E.g. we want to publish which lock to take first to prevent an ABBA problem
> (extreme example):
> 
> rcu_assign_pointer(lockptr, min(lptr1, lptr2));
> 
> Couldn't a compiler spill the lockptr memory location as a temporary buffer
> if the compiler is under register pressure? (yes, this seems unlikely if we
> flushed out most registers to memory because of the barrier, but still... ;) )
> 
> This seems to be also the case if we publish a multi-dereferencing pointers
> e.g. ptr->ptr->ptr.

IIRC, sequence points only confine volatile accesses.  For non-volatile
accesses, the so-called "as-if rule" allows compiler writers to do some
surprisingly global reordering.

The reason that rcu_assign_pointer() isn't an inline function is because
it needs to be type-generic, in other words, it needs to be OK to use
it on any type of pointers as long as the C types of the two pointers
match (the sparse types can vary a bit).

One of the reasons for wanting a volatile cast in rcu_assign_pointer() is
to prevent compiler mischief such as you described in your last two
paragraphs.  That said, it would take a very brave compiler to pull
a pointer-referenced memory location into a register and keep it there.
Unfortunately, increasing compiler bravery seems to be a solid long-term
trend.

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH net-next] inet: rename ir_loc_port to ir_num
From: David Miller @ 2013-10-10 18:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1381388677.4971.65.camel@edumazet-glaptop.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 10 Oct 2013 00:04:37 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
> I forgot that the two ports in sock_common do not have same byte order :
> 
> skc_dport is __be16 (network order), but skc_num is __u16 (host order)
> 
> So sparse complains because ir_loc_port (mapped into skc_num) is
> considered as __u16 while it should be __be16
> 
> Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
> and perform appropriate htons/ntohs conversions.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Wu Fengguang <fengguang.wu@intel.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Pravin Shelar @ 2013-10-10 18:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jesse Gross, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CAMEtUuzP1b_VNqP1texkrQ7kvCRxEukUoRz=Dj-pMX9gsO9AWw@mail.gmail.com>

On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> The combination of two commits
>>>>>
>>>>> commit 8e4e1713e4
>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>
>>>>> and
>>>>>
>>>>> commit 2537b4dd0a
>>>>> ("openvswitch:: link upper device for port devices")
>>>>>
>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>> netdev_unregister notification
>>>>>
>>>>> The following steps:
>>>>>
>>>>>   modprobe openvswitch
>>>>>   ovs-dpctl add-dp test
>>>>>   ip tuntap add dev tap1 mode tap
>>>>>   ovs-dpctl add-if test tap1
>>>>>   ip tuntap del dev tap1 mode tap
>>>>>
>>>>> are causing multiple warnings:
>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>> index c323567..e9380bd 100644
>>>>> --- a/net/openvswitch/dp_notify.c
>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>                 return NOTIFY_DONE;
>>>>>
>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>> +
>>>>
>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>> and let workq do rest of work.
>>>
>>> isn't it what it's doing?
>>
>> I meant just call netdev_upper_dev_unlink() here in event handler and
>> rest of vport destroy can be done in workq.
>
> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
> that's dangerous.
why is it dangerous? ovs already had ref to net-device.

> If that is acceptable, then there was no reason to link them in the first place.
>
I do not see any harm in linking device hierarchy for ovs.

> notifier asks to unregister. imo the only acceptable deferred task
> here is to delay dev_put,
> since ovs structures are still referring to it.

^ permalink raw reply

* Re: [PATCH] carl9170: fix leaks at failure path in carl9170_usb_probe()
From: Christian Lamparter @ 2013-10-10 18:17 UTC (permalink / raw)
  To: John W. Linville
  Cc: Alexey Khoroshilov, Fabio Estevam, linux-wireless,
	netdev@vger.kernel.org, linux-kernel, ldv-project
In-Reply-To: <20131010175952.GG2691@tuxdriver.com>

On Thursday, October 10, 2013 01:59:52 PM John W. Linville wrote:
> On Sat, Sep 28, 2013 at 01:16:20AM -0400, Alexey Khoroshilov wrote:
> > On 28.09.2013 00:17, Fabio Estevam wrote:
> > >On Sat, Sep 28, 2013 at 12:51 AM, Alexey Khoroshilov
> > ><khoroshilov@ispras.ru> wrote:
> > >
> > >>-       return request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> > >>+       err = request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> > >>                 &ar->udev->dev, GFP_KERNEL, ar, carl9170_usb_firmware_step2);
> > >>+       if (err) {
> > >>+               usb_put_dev(udev);
> > >>+               usb_put_dev(udev);
> > >You are doing the same free twice.
> > Yes, because it was get twice.
> > >I guess you meant to also free: usb_put_dev(ar->udev)
> > udev and ar->udev are equal, so technically the patch is correct.
> > 
> > I agree that there is some inconsistency, but I would prefer to fix
> > it at usb_get_dev() side with a comment about reasons for the double
> > get.
> 
> What is the reason for the double get?

The idea is:
One (extra) reference protects the asynchronous firmware loader callback
from disappearing "udev".

Regards,
Chr
 
 

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Alexander Gordeev @ 2013-10-10 18:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <5256D5AB.4050105@zytor.com>

On Thu, Oct 10, 2013 at 09:28:27AM -0700, H. Peter Anvin wrote:
> On 10/10/2013 03:17 AM, Alexander Gordeev wrote:
> > On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote:
> > 
> > Ok, this suggestion sounded in one or another form by several people.
> > What about name it pcim_enable_msix_range() and wrap in couple more
> > helpers to complete an API:
> > 
> > int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where minvec >= result <= nvec
> > 
> > int pcim_enable_msix(pdev, msix_entries, nvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where 1 >= result <= nvec 
> > 
> > int pcim_enable_msix_exact(pdev, msix_entries, nvec);
> > 	<0 - error code
> > 	>0 - number of MSIs allocated, where result == nvec
> > 
> > The latter's return value seems odd, but I can not help to make
> > it consistent with the first two.
> > 
> 
> Is there a reason for the wrappers, as opposed to just specifying either
> 1 or nvec as the minimum?

The wrappers are more handy IMO.

I.e. can imagine people start struggling to figure out what minvec to provide:
1 or 0? Why 1? Oh.. okay.. Or should we tolerate 0 (as opposite to -ERANGE)?

Well, do not know.. pcim_enable_msix(pdev, msix_entries, nvec, nvec) is 
less readable for me than just pcim_enable_msix_exact(pdev, msix_entries,
nvec).

> 	-hpa

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply

* Re: [PATCH] carl9170: fix leaks at failure path in carl9170_usb_probe()
From: John W. Linville @ 2013-10-10 17:59 UTC (permalink / raw)
  To: Alexey Khoroshilov
  Cc: Fabio Estevam, Christian Lamparter, linux-wireless,
	netdev@vger.kernel.org, linux-kernel, ldv-project
In-Reply-To: <52466624.4040106@ispras.ru>

On Sat, Sep 28, 2013 at 01:16:20AM -0400, Alexey Khoroshilov wrote:
> On 28.09.2013 00:17, Fabio Estevam wrote:
> >On Sat, Sep 28, 2013 at 12:51 AM, Alexey Khoroshilov
> ><khoroshilov@ispras.ru> wrote:
> >
> >>-       return request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> >>+       err = request_firmware_nowait(THIS_MODULE, 1, CARL9170FW_NAME,
> >>                 &ar->udev->dev, GFP_KERNEL, ar, carl9170_usb_firmware_step2);
> >>+       if (err) {
> >>+               usb_put_dev(udev);
> >>+               usb_put_dev(udev);
> >You are doing the same free twice.
> Yes, because it was get twice.
> >I guess you meant to also free: usb_put_dev(ar->udev)
> udev and ar->udev are equal, so technically the patch is correct.
> 
> I agree that there is some inconsistency, but I would prefer to fix
> it at usb_get_dev() side with a comment about reasons for the double
> get.

What is the reason for the double get?

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* [PATCH] ipv6: Initialize ip6_tnl.hlen in gre tunnel even if no route is found
From: Oussama Ghorbel @ 2013-10-10 17:50 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev, linux-kernel, Hannes Frederic Sowa, Oussama Ghorbel

The ip6_tnl.hlen (gre and ipv6 headers length) is independent from the
outgoing interface, so it would be better to initialize it even when no
route is found, otherwise its value will be zero.
While I'm not sure if this could happen in real life, but doing that
will avoid to call the skb_push function with a zero in ip6gre_header
function.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>
---
 net/ipv6/ip6_gre.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 90747f1..8e4d42e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -978,6 +978,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu)
 		if (t->parms.o_flags&GRE_SEQ)
 			addend += 4;
 	}
+	t->hlen = addend;
 
 	if (p->flags & IP6_TNL_F_CAP_XMIT) {
 		int strict = (ipv6_addr_type(&p->raddr) &
@@ -1004,8 +1005,6 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu)
 		}
 		ip6_rt_put(rt);
 	}
-
-	t->hlen = addend;
 }
 
 static int ip6gre_tnl_change(struct ip6_tnl *t,
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: H. Peter Anvin @ 2013-10-10 16:28 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Benjamin Herrenschmidt, linux-kernel, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Martin Schwidefsky, Ingo Molnar, Tejun Heo,
	Dan Williams, Andy King, Jon Mason, Matt Porter, linux-pci,
	linux-mips, linuxppc-dev, linux390, linux-s390, x86, linux-ide,
	iss_storagedev, linux-nvme, linux-rdma, netdev, e1000-devel,
	linux-dr
In-Reply-To: <20131010101704.GC11874@dhcp-26-207.brq.redhat.com>

On 10/10/2013 03:17 AM, Alexander Gordeev wrote:
> On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote:
> 
> Ok, this suggestion sounded in one or another form by several people.
> What about name it pcim_enable_msix_range() and wrap in couple more
> helpers to complete an API:
> 
> int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec);
> 	<0 - error code
> 	>0 - number of MSIs allocated, where minvec >= result <= nvec
> 
> int pcim_enable_msix(pdev, msix_entries, nvec);
> 	<0 - error code
> 	>0 - number of MSIs allocated, where 1 >= result <= nvec 
> 
> int pcim_enable_msix_exact(pdev, msix_entries, nvec);
> 	<0 - error code
> 	>0 - number of MSIs allocated, where result == nvec
> 
> The latter's return value seems odd, but I can not help to make
> it consistent with the first two.
> 

Is there a reason for the wrappers, as opposed to just specifying either
1 or nvec as the minimum?

	-hpa

^ permalink raw reply

* Re: [PATCH net-next] tcp: tcp_transmit_skb() optimizations
From: Yuchung Cheng @ 2013-10-10 15:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, MF Nowlan, Neal Cardwell
In-Reply-To: <1381419780.4971.84.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 8:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> 1) We need to take a timestamp only for skb that should be cloned.
>
> Other skbs are not in write queue and no rtt estimation is done on them.
>
> 2) the unlikely() hint is wrong for receivers (they send pure ACK)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: MF Nowlan <fitz@cs.yale.edu>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_output.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index faec813..c40f608 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -850,15 +850,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
>         BUG_ON(!skb || !tcp_skb_pcount(skb));
>
> -       /* If congestion control is doing timestamping, we must
> -        * take such a timestamp before we potentially clone/copy.
> -        */
> -       if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
> -               __net_timestamp(skb);
> -
> -       if (likely(clone_it)) {
> +       if (clone_it) {
>                 const struct sk_buff *fclone = skb + 1;
>
> +               /* If congestion control is doing timestamping, we must
> +                * take such a timestamp before we potentially clone/copy.
> +                */
> +               if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
> +                       __net_timestamp(skb);
> +
>                 if (unlikely(skb->fclone == SKB_FCLONE_ORIG &&
>                              fclone->fclone == SKB_FCLONE_CLONE))
>                         NET_INC_STATS_BH(sock_net(sk),
>
>

^ permalink raw reply

* [PATCH net-next] tcp: tcp_transmit_skb() optimizations
From: Eric Dumazet @ 2013-10-10 15:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, MF Nowlan, Yuchung Cheng, Neal Cardwell

From: Eric Dumazet <edumazet@google.com>

1) We need to take a timestamp only for skb that should be cloned.

Other skbs are not in write queue and no rtt estimation is done on them.

2) the unlikely() hint is wrong for receivers (they send pure ACK)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: MF Nowlan <fitz@cs.yale.edu>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_output.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index faec813..c40f608 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -850,15 +850,15 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	BUG_ON(!skb || !tcp_skb_pcount(skb));
 
-	/* If congestion control is doing timestamping, we must
-	 * take such a timestamp before we potentially clone/copy.
-	 */
-	if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
-		__net_timestamp(skb);
-
-	if (likely(clone_it)) {
+	if (clone_it) {
 		const struct sk_buff *fclone = skb + 1;
 
+		/* If congestion control is doing timestamping, we must
+		 * take such a timestamp before we potentially clone/copy.
+		 */
+		if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
+			__net_timestamp(skb);
+
 		if (unlikely(skb->fclone == SKB_FCLONE_ORIG &&
 			     fclone->fclone == SKB_FCLONE_CLONE))
 			NET_INC_STATS_BH(sock_net(sk),

^ permalink raw reply related

* Re: [PATCH RFC 50/77] mlx5: Update MSI/MSI-X interrupts enablement code
From: Eli Cohen @ 2013-10-10 15:29 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Ralf Baechle,
	Michael Ellerman, Benjamin Herrenschmidt, Martin Schwidefsky,
	Ingo Molnar, Tejun Heo, Dan Williams, Andy King, Jon Mason,
	Matt Porter, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux390-tA70FqPdS9bQT0dZR+AlfA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, iss_storagedev-VXdhtT5mjnY,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	e1000-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-driver-h88ZbnxC6KDQT0dZR+AlfA, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <20131003194837.GA27636-hdGaXg0bp3uRXgp2RCiI5R/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>

On Thu, Oct 03, 2013 at 09:48:39PM +0200, Alexander Gordeev wrote:
> 
> pci_enable_msix() may fail, but it can not return a positive number.
>

That is true according to the current logic but the comment on top of
pci_enable_msix() still says:
"A return of < 0 indicates a failure.  Or a return of > 0 indicates
that driver request is exceeding the number of irqs or MSI-X vectors
available"

So you're counting on an implementation that may change in the future.
I think leaving the code as it is now is safer.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v3 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-1-git-send-email-paul.durrant@citrix.com>

For performance of VM to VM traffic on a single host it is better to avoid
calculation of TCP/UDP checksum in the sending frontend. To allow this this
patch adds the code necessary to set up partial checksum for IPv6 packets
and xenstore flag feature-ipv6-csum-offload to advertise that fact to
frontends.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/xen-netback/netback.c |  243 +++++++++++++++++++++++++++++++------
 drivers/net/xen-netback/xenbus.c  |    9 ++
 2 files changed, 213 insertions(+), 39 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index f3e591c..4c12030 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -109,15 +109,10 @@ static inline unsigned long idx_to_kaddr(struct xenvif *vif,
 	return (unsigned long)pfn_to_kaddr(idx_to_pfn(vif, idx));
 }
 
-/*
- * This is the amount of packet we copy rather than map, so that the
- * guest can't fiddle with the contents of the headers while we do
- * packet processing on them (netfilter, routing, etc).
+/* This is a miniumum size for the linear area to avoid lots of
+ * calls to __pskb_pull_tail() as we set up checksum offsets.
  */
-#define PKT_PROT_LEN    (ETH_HLEN + \
-			 VLAN_HLEN + \
-			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
-			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
+#define PKT_PROT_LEN 128
 
 static u16 frag_get_pending_idx(skb_frag_t *frag)
 {
@@ -1118,61 +1113,74 @@ static int xenvif_set_skb_gso(struct xenvif *vif,
 	return 0;
 }
 
-static int checksum_setup(struct xenvif *vif, struct sk_buff *skb)
+static inline void maybe_pull_tail(struct sk_buff *skb, unsigned int len)
+{
+	if (skb_is_nonlinear(skb) && skb_headlen(skb) < len) {
+		/* If we need to pullup then pullup to the max, so we
+		 * won't need to do it again.
+		 */
+		int target = min_t(int, skb->len, MAX_TCP_HEADER);
+		__pskb_pull_tail(skb, target - skb_headlen(skb));
+	}
+}
+
+static int checksum_setup_ip(struct xenvif *vif, struct sk_buff *skb,
+			     int recalculate_partial_csum)
 {
-	struct iphdr *iph;
+	struct iphdr *iph = (void *)skb->data;
+	unsigned int header_size;
+	unsigned int off;
 	int err = -EPROTO;
-	int recalculate_partial_csum = 0;
 
-	/*
-	 * A GSO SKB must be CHECKSUM_PARTIAL. However some buggy
-	 * peers can fail to set NETRXF_csum_blank when sending a GSO
-	 * frame. In this case force the SKB to CHECKSUM_PARTIAL and
-	 * recalculate the partial checksum.
-	 */
-	if (skb->ip_summed != CHECKSUM_PARTIAL && skb_is_gso(skb)) {
-		vif->rx_gso_checksum_fixup++;
-		skb->ip_summed = CHECKSUM_PARTIAL;
-		recalculate_partial_csum = 1;
-	}
+	off = sizeof(struct iphdr);
 
-	/* A non-CHECKSUM_PARTIAL SKB does not require setup. */
-	if (skb->ip_summed != CHECKSUM_PARTIAL)
-		return 0;
+	header_size = skb->network_header + off + MAX_IPOPTLEN;
+	maybe_pull_tail(skb, header_size);
 
-	if (skb->protocol != htons(ETH_P_IP))
-		goto out;
+	off = iph->ihl * 4;
 
-	iph = (void *)skb->data;
 	switch (iph->protocol) {
 	case IPPROTO_TCP:
-		if (!skb_partial_csum_set(skb, 4 * iph->ihl,
+		if (!skb_partial_csum_set(skb, off,
 					  offsetof(struct tcphdr, check)))
 			goto out;
 
 		if (recalculate_partial_csum) {
 			struct tcphdr *tcph = tcp_hdr(skb);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct tcphdr);
+			maybe_pull_tail(skb, header_size);
+
 			tcph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
-							 skb->len - iph->ihl*4,
+							 skb->len - off,
 							 IPPROTO_TCP, 0);
 		}
 		break;
 	case IPPROTO_UDP:
-		if (!skb_partial_csum_set(skb, 4 * iph->ihl,
+		if (!skb_partial_csum_set(skb, off,
 					  offsetof(struct udphdr, check)))
 			goto out;
 
 		if (recalculate_partial_csum) {
 			struct udphdr *udph = udp_hdr(skb);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct udphdr);
+			maybe_pull_tail(skb, header_size);
+
 			udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
-							 skb->len - iph->ihl*4,
+							 skb->len - off,
 							 IPPROTO_UDP, 0);
 		}
 		break;
 	default:
 		if (net_ratelimit())
 			netdev_err(vif->dev,
-				   "Attempting to checksum a non-TCP/UDP packet, dropping a protocol %d packet\n",
+				   "Attempting to checksum a non-TCP/UDP packet, "
+				   "dropping a protocol %d packet\n",
 				   iph->protocol);
 		goto out;
 	}
@@ -1183,6 +1191,168 @@ out:
 	return err;
 }
 
+static int checksum_setup_ipv6(struct xenvif *vif, struct sk_buff *skb,
+			       int recalculate_partial_csum)
+{
+	int err = -EPROTO;
+	struct ipv6hdr *ipv6h = (void *)skb->data;
+	u8 nexthdr;
+	unsigned int header_size;
+	unsigned int off;
+	bool done;
+	bool found;
+
+	done = false;
+	found = false;
+
+	off = sizeof(struct ipv6hdr);
+
+	header_size = skb->network_header + off;
+	maybe_pull_tail(skb, header_size);
+
+	nexthdr = ipv6h->nexthdr;
+
+	while ((off <= sizeof(struct ipv6hdr) + ntohs(ipv6h->payload_len)) &&
+	       !done) {
+		switch (nexthdr) {
+		case IPPROTO_FRAGMENT: {
+			struct frag_hdr *hp = (void *)(skb->data + off);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct frag_hdr);
+			maybe_pull_tail(skb, header_size);
+
+			nexthdr = hp->nexthdr;
+			off += 8;
+			break;
+		}
+		case IPPROTO_DSTOPTS:
+		case IPPROTO_HOPOPTS:
+		case IPPROTO_ROUTING: {
+			struct ipv6_opt_hdr *hp = (void *)(skb->data + off);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct ipv6_opt_hdr);
+			maybe_pull_tail(skb, header_size);
+
+			nexthdr = hp->nexthdr;
+			off += ipv6_optlen(hp);
+			break;
+		}
+		case IPPROTO_AH: {
+			struct ip_auth_hdr *hp = (void *)(skb->data + off);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct ip_auth_hdr);
+			maybe_pull_tail(skb, header_size);
+
+			nexthdr = hp->nexthdr;
+			off += (hp->hdrlen+2)<<2;
+			break;
+		}
+		case IPPROTO_TCP:
+		case IPPROTO_UDP:
+			found = true;
+			/* Fall through */
+		default:
+			done = true;
+			break;
+		}
+	}
+
+	if (!done) {
+		if (net_ratelimit())
+			netdev_err(vif->dev, "Failed to parse packet header\n");
+		goto out;
+	}
+
+	if (!found) {
+		if (net_ratelimit())
+			netdev_err(vif->dev,
+				   "Attempting to checksum a non-TCP/UDP packet, "
+				   "dropping a protocol %d packet\n",
+				   nexthdr);
+		goto out;
+	}
+
+	switch (nexthdr) {
+	case IPPROTO_TCP:
+		if (!skb_partial_csum_set(skb, off,
+					  offsetof(struct tcphdr, check)))
+			goto out;
+
+		if (recalculate_partial_csum) {
+			struct tcphdr *tcph = tcp_hdr(skb);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct tcphdr);
+			maybe_pull_tail(skb, header_size);
+
+			tcph->check = ~csum_ipv6_magic(&ipv6h->saddr,
+						       &ipv6h->daddr,
+						       skb->len - off,
+						       IPPROTO_TCP, 0);
+		}
+		break;
+	case IPPROTO_UDP:
+		if (!skb_partial_csum_set(skb, off,
+					  offsetof(struct udphdr, check)))
+			goto out;
+
+		if (recalculate_partial_csum) {
+			struct udphdr *udph = udp_hdr(skb);
+
+			header_size = skb->network_header +
+				off +
+				sizeof(struct udphdr);
+			maybe_pull_tail(skb, header_size);
+
+			udph->check = ~csum_ipv6_magic(&ipv6h->saddr,
+						       &ipv6h->daddr,
+						       skb->len - off,
+						       IPPROTO_UDP, 0);
+		}
+		break;
+	}
+
+	err = 0;
+
+out:
+	return err;
+}
+
+static int checksum_setup(struct xenvif *vif, struct sk_buff *skb)
+{
+	int err = -EPROTO;
+	int recalculate_partial_csum = 0;
+
+	/* A GSO SKB must be CHECKSUM_PARTIAL. However some buggy
+	 * peers can fail to set NETRXF_csum_blank when sending a GSO
+	 * frame. In this case force the SKB to CHECKSUM_PARTIAL and
+	 * recalculate the partial checksum.
+	 */
+	if (skb->ip_summed != CHECKSUM_PARTIAL && skb_is_gso(skb)) {
+		vif->rx_gso_checksum_fixup++;
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		recalculate_partial_csum = 1;
+	}
+
+	/* A non-CHECKSUM_PARTIAL SKB does not require setup. */
+	if (skb->ip_summed != CHECKSUM_PARTIAL)
+		return 0;
+
+	if (skb->protocol == htons(ETH_P_IP))
+		err = checksum_setup_ip(vif, skb, recalculate_partial_csum);
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		err = checksum_setup_ipv6(vif, skb, recalculate_partial_csum);
+
+	return err;
+}
+
 static bool tx_credit_exceeded(struct xenvif *vif, unsigned size)
 {
 	unsigned long now = jiffies;
@@ -1428,12 +1598,7 @@ static int xenvif_tx_submit(struct xenvif *vif, int budget)
 
 		xenvif_fill_frags(vif, skb);
 
-		/*
-		 * If the initial fragment was < PKT_PROT_LEN then
-		 * pull through some bytes from the other fragments to
-		 * increase the linear region to PKT_PROT_LEN bytes.
-		 */
-		if (skb_headlen(skb) < PKT_PROT_LEN && skb_is_nonlinear(skb)) {
+		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
 			int target = min_t(int, skb->len, PKT_PROT_LEN);
 			__pskb_pull_tail(skb, target - skb_headlen(skb));
 		}
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 99343ca..9c9b37d 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -105,6 +105,15 @@ static int netback_probe(struct xenbus_device *dev,
 			goto abort_transaction;
 		}
 
+		/* We support partial checksum setup for IPv6 packets */
+		err = xenbus_printf(xbt, dev->nodename,
+				    "feature-ipv6-csum-offload",
+				    "%d", 1);
+		if (err) {
+			message = "writing feature-ipv6-csum-offload";
+			goto abort_transaction;
+		}
+
 		/* We support rx-copy path. */
 		err = xenbus_printf(xbt, dev->nodename,
 				    "feature-rx-copy", "%d", 1);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 5/5] xen-netback: enable IPv6 TCP GSO to the guest
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-1-git-send-email-paul.durrant@citrix.com>

This patch adds code to handle SKB_GSO_TCPV6 skbs and construct appropriate
extra or prefix segments to pass the large packet to the frontend. New
xenstore flags, feature-gso-tcpv6 and feature-gso-tcpv6-prefix, are sampled
to determine if the frontend is capable of handling such packets.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/xen-netback/common.h    |    9 +++++--
 drivers/net/xen-netback/interface.c |    6 +++--
 drivers/net/xen-netback/netback.c   |   48 +++++++++++++++++++++++++++--------
 drivers/net/xen-netback/xenbus.c    |   29 +++++++++++++++++++--
 include/xen/interface/io/netif.h    |    1 +
 5 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index b4a9a3c..55b8dec 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -87,9 +87,13 @@ struct pending_tx_info {
 struct xenvif_rx_meta {
 	int id;
 	int size;
+	int gso_type;
 	int gso_size;
 };
 
+#define GSO_BIT(type) \
+	(1 << XEN_NETIF_GSO_TYPE_ ## type)
+
 /* Discriminate from any valid pending_idx value. */
 #define INVALID_PENDING_IDX 0xFFFF
 
@@ -150,9 +154,10 @@ struct xenvif {
 	u8               fe_dev_addr[6];
 
 	/* Frontend feature information. */
+	int gso_mask;
+	int gso_prefix_mask;
+
 	u8 can_sg:1;
-	u8 gso:1;
-	u8 gso_prefix:1;
 	u8 ip_csum:1;
 	u8 ipv6_csum:1;
 
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index cb0d8ea..e4aa267 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -214,8 +214,10 @@ static netdev_features_t xenvif_fix_features(struct net_device *dev,
 
 	if (!vif->can_sg)
 		features &= ~NETIF_F_SG;
-	if (!vif->gso && !vif->gso_prefix)
+	if (~(vif->gso_mask | vif->gso_prefix_mask) & GSO_BIT(TCPV4))
 		features &= ~NETIF_F_TSO;
+	if (~(vif->gso_mask | vif->gso_prefix_mask) & GSO_BIT(TCPV6))
+		features &= ~NETIF_F_TSO6;
 	if (!vif->ip_csum)
 		features &= ~NETIF_F_IP_CSUM;
 	if (!vif->ipv6_csum)
@@ -320,7 +322,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	dev->netdev_ops	= &xenvif_netdev_ops;
 	dev->hw_features = NETIF_F_SG |
 		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
-		NETIF_F_TSO;
+		NETIF_F_TSO | NETIF_F_TSO6;
 	dev->features = dev->hw_features | NETIF_F_RXCSUM;
 	SET_ETHTOOL_OPS(dev, &xenvif_ethtool_ops);
 
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index afc055c..bb31921 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -140,7 +140,7 @@ static int max_required_rx_slots(struct xenvif *vif)
 	int max = DIV_ROUND_UP(vif->dev->mtu, PAGE_SIZE);
 
 	/* XXX FIXME: RX path dependent on MAX_SKB_FRAGS */
-	if (vif->can_sg || vif->gso || vif->gso_prefix)
+	if (vif->can_sg || vif->gso_mask || vif->gso_prefix_mask)
 		max += MAX_SKB_FRAGS + 1; /* extra_info + frags */
 
 	return max;
@@ -312,6 +312,7 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif *vif,
 	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
 
 	meta = npo->meta + npo->meta_prod++;
+	meta->gso_type = 0;
 	meta->gso_size = 0;
 	meta->size = 0;
 	meta->id = req->id;
@@ -334,6 +335,7 @@ static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 	struct gnttab_copy *copy_gop;
 	struct xenvif_rx_meta *meta;
 	unsigned long bytes;
+	int gso_type;
 
 	/* Data must not cross a page boundary. */
 	BUG_ON(size + offset > PAGE_SIZE<<compound_order(page));
@@ -392,7 +394,14 @@ static void xenvif_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb,
 		}
 
 		/* Leave a gap for the GSO descriptor. */
-		if (*head && skb_shinfo(skb)->gso_size && !vif->gso_prefix)
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+			gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
+		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
+			gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
+		else
+			gso_type = XEN_NETIF_GSO_TYPE_NONE;
+
+		if (*head && ((1 << gso_type) & vif->gso_mask))
 			vif->rx.req_cons++;
 
 		*head = 0; /* There must be something in this buffer now. */
@@ -423,14 +432,28 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	unsigned char *data;
 	int head = 1;
 	int old_meta_prod;
+	int gso_type;
+	int gso_size;
 
 	old_meta_prod = npo->meta_prod;
 
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
+		gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
+		gso_size = skb_shinfo(skb)->gso_size;
+	} else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
+		gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
+		gso_size = skb_shinfo(skb)->gso_size;
+	} else {
+		gso_type = 0;
+		gso_size = 0;
+	}
+
 	/* Set up a GSO prefix descriptor, if necessary */
-	if (skb_shinfo(skb)->gso_size && vif->gso_prefix) {
+	if ((1 << skb_shinfo(skb)->gso_type) & vif->gso_prefix_mask) {
 		req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
 		meta = npo->meta + npo->meta_prod++;
-		meta->gso_size = skb_shinfo(skb)->gso_size;
+		meta->gso_type = gso_type;
+		meta->gso_size = gso_size;
 		meta->size = 0;
 		meta->id = req->id;
 	}
@@ -438,10 +461,13 @@ static int xenvif_gop_skb(struct sk_buff *skb,
 	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
 	meta = npo->meta + npo->meta_prod++;
 
-	if (!vif->gso_prefix)
-		meta->gso_size = skb_shinfo(skb)->gso_size;
-	else
+	if ((1 << gso_type) & vif->gso_mask) {
+		meta->gso_type = gso_type;
+		meta->gso_size = gso_size;
+	} else {
+		meta->gso_type = 0;
 		meta->gso_size = 0;
+	}
 
 	meta->size = 0;
 	meta->id = req->id;
@@ -587,7 +613,8 @@ void xenvif_rx_action(struct xenvif *vif)
 
 		vif = netdev_priv(skb->dev);
 
-		if (vif->meta[npo.meta_cons].gso_size && vif->gso_prefix) {
+		if ((1 << vif->meta[npo.meta_cons].gso_type) &
+		    vif->gso_prefix_mask) {
 			resp = RING_GET_RESPONSE(&vif->rx,
 						 vif->rx.rsp_prod_pvt++);
 
@@ -624,7 +651,8 @@ void xenvif_rx_action(struct xenvif *vif)
 					vif->meta[npo.meta_cons].size,
 					flags);
 
-		if (vif->meta[npo.meta_cons].gso_size && !vif->gso_prefix) {
+		if ((1 << vif->meta[npo.meta_cons].gso_type) &
+		    vif->gso_mask) {
 			struct xen_netif_extra_info *gso =
 				(struct xen_netif_extra_info *)
 				RING_GET_RESPONSE(&vif->rx,
@@ -632,8 +660,8 @@ void xenvif_rx_action(struct xenvif *vif)
 
 			resp->flags |= XEN_NETRXF_extra_info;
 
+			gso->u.gso.type = vif->meta[npo.meta_cons].gso_type;
 			gso->u.gso.size = vif->meta[npo.meta_cons].gso_size;
-			gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
 			gso->u.gso.pad = 0;
 			gso->u.gso.features = 0;
 
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 7e4dcc9..6a005f1 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -577,15 +577,40 @@ static int connect_rings(struct backend_info *be)
 		val = 0;
 	vif->can_sg = !!val;
 
+	vif->gso_mask = 0;
+	vif->gso_prefix_mask = 0;
+
 	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-gso-tcpv4",
 			 "%d", &val) < 0)
 		val = 0;
-	vif->gso = !!val;
+	if (val)
+		vif->gso_mask |= GSO_BIT(TCPV4);
 
 	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-gso-tcpv4-prefix",
 			 "%d", &val) < 0)
 		val = 0;
-	vif->gso_prefix = !!val;
+	if (val)
+		vif->gso_prefix_mask |= GSO_BIT(TCPV4);
+
+	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-gso-tcpv6",
+			 "%d", &val) < 0)
+		val = 0;
+	if (val)
+		vif->gso_mask |= GSO_BIT(TCPV6);
+
+	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-gso-tcpv6-prefix",
+			 "%d", &val) < 0)
+		val = 0;
+	if (val)
+		vif->gso_prefix_mask |= GSO_BIT(TCPV6);
+
+	if (vif->gso_mask & vif->gso_prefix_mask) {
+		xenbus_dev_fatal(dev, err,
+				 "%s: gso and gso prefix flags are not "
+				 "mutually exclusive",
+				 dev->otherend);
+		return -EOPNOTSUPP;
+	}
 
 	/* Before feature-ipv6-csum-offload was introduced, checksum offload
 	 * was turned on by a zero value in (or lack of)
diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h
index d7dd8d7..41674bc 100644
--- a/include/xen/interface/io/netif.h
+++ b/include/xen/interface/io/netif.h
@@ -113,6 +113,7 @@ struct xen_netif_tx_request {
 #define  XEN_NETIF_EXTRA_FLAG_MORE	(1U<<_XEN_NETIF_EXTRA_FLAG_MORE)
 
 /* GSO types */
+#define XEN_NETIF_GSO_TYPE_NONE         (0)
 #define XEN_NETIF_GSO_TYPE_TCPV4	(1)
 #define XEN_NETIF_GSO_TYPE_TCPV6	(2)
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 4/5] xen-netback: handle IPv6 TCP GSO packets from the guest
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-1-git-send-email-paul.durrant@citrix.com>

This patch adds a xenstore feature flag, festure-gso-tcpv6, to advertise
that netback can handle IPv6 TCP GSO packets. It creates SKB_GSO_TCPV6 skbs
if the frontend passes an extra segment with the new type
XEN_NETIF_GSO_TYPE_TCPV6 added to netif.h.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/xen-netback/netback.c |   11 ++++++++---
 drivers/net/xen-netback/xenbus.c  |    7 +++++++
 include/xen/interface/io/netif.h  |   10 +++++++++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 4c12030..afc055c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1096,15 +1096,20 @@ static int xenvif_set_skb_gso(struct xenvif *vif,
 		return -EINVAL;
 	}
 
-	/* Currently only TCPv4 S.O. is supported. */
-	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
+	switch (gso->u.gso.type) {
+	case XEN_NETIF_GSO_TYPE_TCPV4:
+		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+		break;
+	case XEN_NETIF_GSO_TYPE_TCPV6:
+		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+		break;
+	default:
 		netdev_err(vif->dev, "Bad GSO type %d.\n", gso->u.gso.type);
 		xenvif_fatal_tx_err(vif);
 		return -EINVAL;
 	}
 
 	skb_shinfo(skb)->gso_size = gso->u.gso.size;
-	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
 
 	/* Header must be checked, and gso_segs computed. */
 	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 9c9b37d..7e4dcc9 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -105,6 +105,13 @@ static int netback_probe(struct xenbus_device *dev,
 			goto abort_transaction;
 		}
 
+		err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6",
+				    "%d", sg);
+		if (err) {
+			message = "writing feature-gso-tcpv6";
+			goto abort_transaction;
+		}
+
 		/* We support partial checksum setup for IPv6 packets */
 		err = xenbus_printf(xbt, dev->nodename,
 				    "feature-ipv6-csum-offload",
diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h
index d9fb44739..d7dd8d7 100644
--- a/include/xen/interface/io/netif.h
+++ b/include/xen/interface/io/netif.h
@@ -61,6 +61,13 @@
  */
 
 /*
+ * "feature-gso-tcpv4" and "feature-gso-tcpv6" advertise the capability to
+ * handle large TCP packets (in IPv4 or IPv6 form respectively). Neither
+ * frontends nor backends are assumed to be capable unless the flags are
+ * present.
+ */
+
+/*
  * This is the 'wire' format for packets:
  *  Request 1: xen_netif_tx_request  -- XEN_NETTXF_* (any flags)
  * [Request 2: xen_netif_extra_info]    (only if request 1 has XEN_NETTXF_extra_info)
@@ -105,8 +112,9 @@ struct xen_netif_tx_request {
 #define _XEN_NETIF_EXTRA_FLAG_MORE	(0)
 #define  XEN_NETIF_EXTRA_FLAG_MORE	(1U<<_XEN_NETIF_EXTRA_FLAG_MORE)
 
-/* GSO types - only TCPv4 currently supported. */
+/* GSO types */
 #define XEN_NETIF_GSO_TYPE_TCPV4	(1)
+#define XEN_NETIF_GSO_TYPE_TCPV6	(2)
 
 /*
  * This structure needs to fit within both netif_tx_request and
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 3/5] xen-netback: Unconditionally set NETIF_F_RXCSUM
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-1-git-send-email-paul.durrant@citrix.com>

There is no mechanism to insist that a guest always generates a packet
with good checksum (at least for IPv4) so we must handle checksum
offloading from the guest and hence should set NETIF_F_RXCSUM.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/xen-netback/interface.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 8e92783..cb0d8ea 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -321,7 +321,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	dev->hw_features = NETIF_F_SG |
 		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
 		NETIF_F_TSO;
-	dev->features = dev->hw_features;
+	dev->features = dev->hw_features | NETIF_F_RXCSUM;
 	SET_ETHTOOL_OPS(dev, &xenvif_ethtool_ops);
 
 	dev->tx_queue_len = XENVIF_QUEUE_LENGTH;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 1/5] xen-netback: add support for IPv6 checksum offload to guest
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-1-git-send-email-paul.durrant@citrix.com>

Check xenstore flag feature-ipv6-csum-offload to determine if a
guest is happy to accept IPv6 packets with only partial checksum.
Also check analogous feature-ip-csum-offload to determine if a
guest is happy to accept IPv4 packets with only partial checksum
as a replacement for a negated feature-no-csum-offload value and
add a comment to deprecate use of feature-no-csum-offload.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 drivers/net/xen-netback/common.h    |    3 ++-
 drivers/net/xen-netback/interface.c |   10 +++++++---
 drivers/net/xen-netback/xenbus.c    |   24 +++++++++++++++++++++++-
 include/xen/interface/io/netif.h    |   10 ++++++++++
 4 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 5715318..b4a9a3c 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -153,7 +153,8 @@ struct xenvif {
 	u8 can_sg:1;
 	u8 gso:1;
 	u8 gso_prefix:1;
-	u8 csum:1;
+	u8 ip_csum:1;
+	u8 ipv6_csum:1;
 
 	/* Internal feature information. */
 	u8 can_queue:1;	    /* can queue packets for receiver? */
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 01bb854..8e92783 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -216,8 +216,10 @@ static netdev_features_t xenvif_fix_features(struct net_device *dev,
 		features &= ~NETIF_F_SG;
 	if (!vif->gso && !vif->gso_prefix)
 		features &= ~NETIF_F_TSO;
-	if (!vif->csum)
+	if (!vif->ip_csum)
 		features &= ~NETIF_F_IP_CSUM;
+	if (!vif->ipv6_csum)
+		features &= ~NETIF_F_IPV6_CSUM;
 
 	return features;
 }
@@ -306,7 +308,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->domid  = domid;
 	vif->handle = handle;
 	vif->can_sg = 1;
-	vif->csum = 1;
+	vif->ip_csum = 1;
 	vif->dev = dev;
 
 	vif->credit_bytes = vif->remaining_credit = ~0UL;
@@ -316,7 +318,9 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->credit_timeout.expires = jiffies;
 
 	dev->netdev_ops	= &xenvif_netdev_ops;
-	dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
+	dev->hw_features = NETIF_F_SG |
+		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+		NETIF_F_TSO;
 	dev->features = dev->hw_features;
 	SET_ETHTOOL_OPS(dev, &xenvif_ethtool_ops);
 
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 1b08d87..99343ca 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -571,10 +571,32 @@ static int connect_rings(struct backend_info *be)
 		val = 0;
 	vif->gso_prefix = !!val;
 
+	/* Before feature-ipv6-csum-offload was introduced, checksum offload
+	 * was turned on by a zero value in (or lack of)
+	 * feature-no-csum-offload. Therefore, for compatibility, assume
+	 * that if feature-ip-csum-offload is missing that IP (v4) offload
+	 * is turned on. If this is not the case then the flag will be
+	 * cleared when we subsequently sample feature-no-csum-offload.
+	 */
+	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-ip-csum-offload",
+			 "%d", &val) < 0)
+		val = 1;
+	vif->ip_csum = !!val;
+
+	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-ipv6-csum-offload",
+			 "%d", &val) < 0)
+		val = 0;
+	vif->ipv6_csum = !!val;
+
+	/* This is deprecated. Frontends should set IP v4 and v6 checksum
+	 * offload using feature-ip-csum-offload and
+	 * feature-ipv6-csum-offload respectively.
+	 */
 	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-no-csum-offload",
 			 "%d", &val) < 0)
 		val = 0;
-	vif->csum = !val;
+	if (val)
+		vif->ip_csum = 0;
 
 	/* Map the shared frame, irq etc. */
 	err = xenvif_connect(vif, tx_ring_ref, rx_ring_ref,
diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h
index eb262e3..d9fb44739 100644
--- a/include/xen/interface/io/netif.h
+++ b/include/xen/interface/io/netif.h
@@ -51,6 +51,16 @@
  */
 
 /*
+ * "feature-no-csum-offload" was used to turn off IPv4 TCP/UDP checksum
+ * offload but is now deprecated. Two new feature flags should now be used
+ * to control checksum offload:
+ * "feature-ip-csum-offload" should be used to turn IPv4 TCP/UDP checksum
+ * offload on or off. If it is missing then the feature is assumed to be on.
+ * "feature-ipv6-csum-offload" should be used to turn IPv6 TCP/UDP checksum
+ * offload on or off. If it is missing then the feature is assumed to be off.
+ */
+
+/*
  * This is the 'wire' format for packets:
  *  Request 1: xen_netif_tx_request  -- XEN_NETTXF_* (any flags)
  * [Request 2: xen_netif_extra_info]    (only if request 1 has XEN_NETTXF_extra_info)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next v3 0/5] xen-netback: IPv6 offload support
From: Paul Durrant @ 2013-10-10 15:25 UTC (permalink / raw)
  To: xen-devel, netdev

This patch series adds support for checksum and large packet offloads into
xen-netback.
Testing has mainly been done using the Microsoft network hardware
certification suite running in Server 2008R2 VMs with Citrix PV frontends.

v2:
- Fixed Wei's email address in Cc lines

v3:
- Responded to Wei's comments:
 - netif.h now updated with comments and a definition of
   XEN_NETIF_GSO_TYPE_NONE.
 - limited number of pullups
- Responded to Annie's comments:
 - New GSO_BIT macro

^ permalink raw reply

* LOAN OFFER.
From: tenger @ 2013-10-10 12:53 UTC (permalink / raw)
  To: Recipients

Apply for loan at 3% interest rate. Contact us with your requirement via Email; tenger.g@mail.com, for more information. Try us and be convinced.

^ permalink raw reply

* Re: [PATCH net] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Vlad Yasevich @ 2013-10-10 14:11 UTC (permalink / raw)
  To: Fan Du, nhorman, steffen.klassert; +Cc: davem, netdev
In-Reply-To: <1381384296-1821-1-git-send-email-fan.du@windriver.com>

On 10/10/2013 01:51 AM, Fan Du wrote:
> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
> and also IPsec is armed to protect sctp traffic, ugly things happened as
> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
> up and pack the 16bits result in the checksum field). The result is fail
> establishment of sctp communication.
>
> And I saw another point in this part of code, when IPsec is not armed,
> sctp communication is good, however setting setting CHECKSUM_PARTIAL will
> make xfrm_output compute dummy checksum values which will be overwritten by
> hardware lately.
>
> So this patch try to solve above two issues together.
>
> Signed-off-by: Fan Du <fan.du@windriver.com>
> ---
> note:
>    igb/ixgbe hardware is not handy on my side, so just build test only.
>
> ---
>   net/sctp/output.c |   27 +++++++++++++++++++--------
>   1 file changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 0ac3a65..f0b9cc5 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -372,6 +372,16 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
>   	atomic_inc(&sk->sk_wmem_alloc);
>   }
>
> +static int is_xfrm_armed(struct dst_entry *dst)
> +{
> +#ifdef CONFIG_XFRM
> +	/* If dst->xfrm is valid, this skb needs to be transformed */
> +	return dst->xfrm != NULL;
> +#else
> +	return 0;
> +#endif
> +}
> +
>   /* All packets are sent to the network through this function from
>    * sctp_outq_tail().
>    *
> @@ -536,20 +546,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>   	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
>   	 */
>   	if (!sctp_checksum_disable) {
> -		if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) {
> +		if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
> +			is_xfrm_armed(dst)) {
> +
>   			__u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);
>
>   			/* 3) Put the resultant value into the checksum field in the
>   			 *    common header, and leave the rest of the bits unchanged.
>   			 */
>   			sh->checksum = sctp_end_cksum(crc32);
> -		} else {
> -			/* no need to seed pseudo checksum for SCTP */
> -			nskb->ip_summed = CHECKSUM_PARTIAL;
> -			nskb->csum_start = (skb_transport_header(nskb) -
> -			                    nskb->head);
> -			nskb->csum_offset = offsetof(struct sctphdr, checksum);
> -		}
> +		} else
> +			/* Mark skb as CHECKSUM_UNNECESSARY to let hardware compute
> +			 * the checksum, and also avoid xfrm_output to do unceccessary
> +			 * checksum.
> +			 */
> +			nskb->ip_summed = CHECKSUM_UNNECESSARY;
>   	}

In addition to what Niel said, the use of CHECKSUM_UNNECESSARY is
incorrect as it will cause the nic to not compute the checksum.
The checksum offload depends on the use of CHECKSUM_PARTIAL.

-vlad


>
>   	/* IP layer ECN support
>

^ permalink raw reply

* LOAN OFFER.
From: tenger @ 2013-10-10 12:53 UTC (permalink / raw)
  To: Recipients

Apply for loan at 3% interest rate. Contact us with your requirement via Email; tenger.g@mail.com, for more information. Try us and be convinced.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox