Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net] packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
From: Mathias Krause @ 2016-04-10 10:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eric W. Biederman, Pavel Emelyanov

Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.

Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
The bug itself precedes commit eea68e2f1a00 but the list wasn't exposed
to userland before the introduction of the packet_diag interface.
Therefore the "Fixes:" line on that commit.

 net/packet/af_packet.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 992396aa635c..86a408cf38d5 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3441,6 +3441,7 @@ static int packet_mc_add(struct sock *sk, struct packet_mreq_max *mreq)
 	i->ifindex = mreq->mr_ifindex;
 	i->alen = mreq->mr_alen;
 	memcpy(i->addr, mreq->mr_address, i->alen);
+	memset(i->addr + i->alen, 0, sizeof(i->addr) - i->alen);
 	i->count = 1;
 	i->next = po->mclist;
 	po->mclist = i;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] ath9k: remove duplicate assignment of variable ah
From: Colin King @ 2016-04-10 11:25 UTC (permalink / raw)
  To: QCA ath9k Development, Kalle Valo, linux-wireless, ath9k-devel,
	netdev
  Cc: linux-kernel

From: Colin Ian King <colin.king@canonical.com>

ah is written twice with the same value, remove one of the
redundant assignments to ah.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/wireless/ath/ath9k/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath9k/init.c b/drivers/net/wireless/ath/ath9k/init.c
index 77ace8d..fb702c4 100644
--- a/drivers/net/wireless/ath/ath9k/init.c
+++ b/drivers/net/wireless/ath/ath9k/init.c
@@ -477,7 +477,7 @@ static void ath9k_eeprom_request_cb(const struct firmware *eeprom_blob,
 static int ath9k_eeprom_request(struct ath_softc *sc, const char *name)
 {
 	struct ath9k_eeprom_ctx ec;
-	struct ath_hw *ah = ah = sc->sc_ah;
+	struct ath_hw *ah = sc->sc_ah;
 	int err;
 
 	/* try to load the EEPROM content asynchronously */
-- 
2.7.4

^ permalink raw reply related

* Re: AP firmware for TI wl1251 wifi chip (wl1251-fw-ap.bin)
From: Pavel Machek @ 2016-04-10 11:51 UTC (permalink / raw)
  To: Machani, Yaniv
  Cc: Pali Rohár, Luciano Coelho, Balbi, Felipe, Chalmers, Kevin,
	Shahar Levi, Kalle Valo, Davis, Andrew, Mishol, Guy, Arik Nemtsov,
	Gery Kahn, Felipe Balbi, Luciano Coelho, David Woodhouse,
	Aaro Koskinen, Ben Hutchings, David Gnedt, Ivaylo Dimitrov,
	Sebastian Reichel, Tony Lindgren, Menon, Nishanth,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <AE1C82FB3D0EC64DB1F752C81CBD1101390F685E-1tpBd5JUCm6IQmiDNMet8wC/G2K4zDHf@public.gmane.org>

Hi!

> > > wl1251 does not support AP mode, so there is no firmware for it in 
> > > the tree.
> > >
> > > Regards,
> > > Yaniv
> > 
> > Hi Yaniv! I read on some TI whitepaper, that wl1251 hardware supports 
> > some Soft-AP mode. So I expect that either special FW is needed for it 
> > or somehow it is possible to use current released. Do you have any information about it?
> 
> Hi Pali,
> This must be some typo, the device does not support Soft-AP.
> More than that, wl1251 family is not officially supported via the mainline Linux.


> For Soft-AP, and other new features based on Linux you should use WiLink8 chip family.

Too late for that, we already have the devices, and they were manufactured quite long
time ago.

Is it "hardware can't do AP", "firmware can't do AP" or "current drivers 
do not support AP"?

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Jamal Hadi Salim @ 2016-04-10 13:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Xin Long, network dev, linux-sctp, Marcelo Ricardo Leitner,
	Vlad Yasevich, daniel, davem
In-Reply-To: <1460222507.6473.493.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-04-09 01:21 PM, Eric Dumazet wrote:


> Well, once a hole is there, nothing we can do really, because of
> compatibility with old kernels / old binaries.
>
>
> But when a _new_ structure is defined, this is the time where we can ask
> for doing sensible things ;)
>

This one is fixable. sizeof() already includes the accounting of
the pad. something like:

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index fe95446..52542eb 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -158,6 +158,7 @@ struct tcp_info {
         __u8    tcpi_options;
         __u8    tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;

+       __u8    pad;    /*reuse this space if you need 8bits for something*/
         __u32   tcpi_rto;
         __u32   tcpi_ato;
         __u32   tcpi_snd_mss;

cheers,
jamal

^ permalink raw reply related

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Jamal Hadi Salim @ 2016-04-10 13:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm, Robert Olsson, kuznet
In-Reply-To: <20160409172634.GA55330@ast-mbp.thefacebook.com>

On 16-04-09 01:26 PM, Alexei Starovoitov wrote:

>
> yeah, no stack, no queues in bpf.

Thanks.

>
>> If this is _forwarding only_ it maybe useful to look at
>> Alexey's old code in particular the DMA bits;
>> he built his own lookup algorithm but sounds like bpf is
>> a much better fit today.
>
> a link to these old bits?
>

Dang. Trying to remember exact name (I think it has been gone for at
least 10 years now). I know it is not CONFIG_NET_FASTROUTE although
it could have been that depending on the driver (tulip had some
nice DMA properties - which by todays standards would be considered
primitive ;->).
+Cc Robert and Alexey (Trying to figure out name of driver based
routing code that DMAed from ingress to egress port)

> Just to be clear: this rfc is not the only thing we're considering.
> In particular huawei guys did a monster effort to improve performance
> in this area as well. We'll try to blend all the code together and
> pick what's the best.
>

Sounds very interesting.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Jamal Hadi Salim @ 2016-04-10 13:48 UTC (permalink / raw)
  To: roopa; +Cc: netdev, davem
In-Reply-To: <57094326.9040009@cumulusnetworks.com>

On 16-04-09 02:00 PM, roopa wrote:

> This EXTENDED_HW_STATS is for ethtool like extended hw stats. This is keeping in
> mind that we want to also move ethtool to netlink in the future and with switchdev
> it becomes more necessary that we provide all stats closer to the other netdev stats.
> So far hw extended stats have always been available through this separate ethtool
> channel. The intent here is to unify the api for extended hw and software only stats.

Ok, so these are _not_ the stats which are broken down by packet size
ranges which are quiet common; but rather "proprietary" per port
type h/w stats? I browsed a couple of users of ethtool_stats and i see
they return proprietary looking 64 bit counters (batman for example
has a very strange meaning to those stats).
What i meant is a lot of ASICS have counters for byte ranges
[64bytes-128bytes], [128bytes-256bytes], etc - sorry i cant pin a name
to those but i am sure you have seen them and i thought those at minimal
need their own TLV since they are always fixed.

> XSTATS is per netdev can be included as a nested attribute inside IFLA_EXTENDED_STATS
> which are per netdev. bridge vlan stats will also fall here.
>

And you are going to distinguish which come from hardware and which are
software derived?

> And this api  provides netdev specific stats. We have always mapped all
> asic stats to the switch port netdev stats. and this api does not cover the non-netdev specific stats.
> If you are for example asking for stats for a hardware offloaded bridge, then yes, they will fall here
> and be available on the bridge netdev. For asic stats that don't map to any netdev, devlink will be an
> appropriate infrastructure IMO.
>
> I am not sure if I answered your question :).
>

It is useful to have this discussion; unfortunately these user APIS once 
are in will never be allowed to change. The answers could come
in time.

>> Should such a command then not be rejected with an error code?
> Dumps with no data are not rejected with an error code AFAIK. ie they don't return
> -ENODATA. This is consistent with all other dumps (unless i missed it).
>   But, if there is a need for an error code, i can certainly check again.
>

It is mostly because you chose a whitelist filter i.e you list what
is allowed to be sent back. And if such a list is missing there
needs to be an opposite default (which is a deny all).

>>> +/* STATS section */
>>> +
>>> +struct if_stats_msg {
>>> +    __u8  family;
>>> +    __u32 ifindex;
>>> +    __u32 filter_mask;
>>> +};
>>
>> Needs to be 32 bit aligned.
>> Do you need 32 bits for the filter mask?
> yes, i think we should keep it minimum 32 bits.

Ok, that is fine then. I thought it wont exceed
3-4 per port type but i could be wrong. 32 bits
should be safer.

>> Perhaps a 16bit mask and an 8bit pad for future use.
>>
>> struct if_stats_msg {
>>             __u32 ifindex;
>>         __u16 filter_mask;
>>         __u8  family;
>>             __u8 pad; /* future use */
>> };
>>
>> Or you could reverse those (from smallest to largest).
>
> The __u8 family needs to be the first field in the structure and at the first byte in the header data.
> hence family is first and i added the others after that. It follows the format for existing such structs (for other message types).
>

True, but it must be 32 bit aligned. So something like:
struct if_stats_msg {
      __u8  family;
      __u8 pad1;
      __u16 pad2;
     __u32 ifindex;
     __u32 filter_mask;
}


> Yeah i remember :). But deferring it for a later incremental feature. That needs some more thought.

NP ;->

> Right now there is an urgent need to get the basic get stats api for a bunch of other stats: mpls, bridge vlan etc.
> Because it is not clean to extend the current stats infra part of the link message for this. So trying to get this in first.
>

Agreed.
The only thing i strongly feel about is the if_stats_msg - please fix
that and lets get at least the basics in. We can resolve other things
with further discussions.

> And this patchset only adds a handler for RTM_NEWSTATS dump and get stats. Your stats events request should be part of the  RTM_NEWSTATS handler and can include other attributes (like timeout) in the future.
>

Ok.

cheers,
jamal
> Thanks,
> Roopa
>

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Tom Herbert @ 2016-04-10 16:53 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree,
	john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm
In-Reply-To: <20160410075550.GA22873@pox.localdomain>

On Sun, Apr 10, 2016 at 12:55 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 04/09/16 at 10:26am, Alexei Starovoitov wrote:
>> On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote:
>> > If this is _forwarding only_ it maybe useful to look at
>> > Alexey's old code in particular the DMA bits;
>> > he built his own lookup algorithm but sounds like bpf is
>> > a much better fit today.
>>
>> a link to these old bits?
>>
>> Just to be clear: this rfc is not the only thing we're considering.
>> In particular huawei guys did a monster effort to improve performance
>> in this area as well. We'll try to blend all the code together and
>> pick what's the best.
>
> What's the plan on opening the discussion on this? Can we get a peek?
> Is it an alternative to XDP and the driver hook? Different architecture
> or just different implementation? I understood it as another pseudo
> skb model with a path on converting to real skbs for stack processing.
>
We started discussions about this in IOvisor. The Huawei project is
called ceth (Common Ethernet). It is essentially a layer called
directly from drivers intended for fast path forwarding and network
virtualization. They have put quite a bit of effort into buffer
management and other parts of the infrastructure, much of which we
would like to leverage in XDP. The code is currently in github, will
ask them to make it generally accessible.

The programmability part, essentially BPF, should be part of a common
solution. We can define the necessary interfaces independently of the
underlying infrastructure which is really the only way we can do this
if we want the BPF programs to be portable across different
platforms-- in Linux, userspace, HW, etc.

Tom

> I really like the current proposal by Brenden for its simplicity and
> targeted compatibility with cls_bpf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Jamal Hadi Salim @ 2016-04-10 18:09 UTC (permalink / raw)
  To: Tom Herbert, Thomas Graf
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm
In-Reply-To: <CALx6S37gzDE0-S4JjS3m+FyuA2xypyp+O+XYdeK3ciAnvmmkyQ@mail.gmail.com>

On 16-04-10 12:53 PM, Tom Herbert wrote:

> We started discussions about this in IOvisor. The Huawei project is
> called ceth (Common Ethernet). It is essentially a layer called
> directly from drivers intended for fast path forwarding and network
> virtualization. They have put quite a bit of effort into buffer
> management and other parts of the infrastructure, much of which we
> would like to leverage in XDP. The code is currently in github, will
> ask them to make it generally accessible.
>

Cant seem to find any info on it on the googles.
If it is forwarding then it should hopefully at least make use of Linux
control APIs I hope.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: roopa @ 2016-04-10 18:28 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, jhs, davem
In-Reply-To: <20160410081650.GB22873@pox.localdomain>

On 4/10/16, 1:16 AM, Thomas Graf wrote:
> On 04/08/16 at 11:38pm, Roopa Prabhu wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This patch adds a new RTM_GETSTATS message to query link stats via netlink
>> from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
>> returns a lot more than just stats and is expensive in some cases when
>> frequent polling for stats from userspace is a common operation.
>>
>> RTM_GETSTATS is an attempt to provide a light weight netlink message
>> to explicity query only link stats from the kernel on an interface.
>> The idea is to also keep it extensible so that new kinds of stats can be
>> added to it in the future.
>>
>> This patch adds the following attribute for NETDEV stats:
>> struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
>>         [IFLA_STATS_LINK64]  = { .len = sizeof(struct rtnl_link_stats64) },
>> };
>>
>> This patch also allows for af family stats (an example af stats for IPV6
>> is available with the second patch in the series).
>>
>> Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
>> a single interface or all interfaces with NLM_F_DUMP.
> Awesome stuff Roopa.
>
> This currently ties everything to a net_device with a selector to
> include certain bits of that net_device. How about we take it half a
> step further and allow for non net_device stats such as IP, TCP,
> routing or ipsec stats to be retrieved as well?
yes, absolutely. and that is also the goal.
> A simple array of nested attributes replacing IFLA_STATS_* would
> allow for that, e.g.
>
> 1. {.type = ST_IPSTATS, value = { ...} }
>
> 2. {.type = ST_LINK, .value = {
>     {.type = ST_LINK_NAME, .value = "eth0"},
>     {.type = ST_LINK_Q, .value = 10}
>   }}
>
> 3. ...

One thing though,  Its unclear to me if we absolutely need the additional nest.
Every stats netlink msg has an ifindex in the header (if_stats_msg) if the scope
of the stats is a netdev. If the msg does not have an ifindex in the if_stats_msg,
it represents a global stat. ie Cant a dump, include other stats netlink msgs after
all the netdev msgs are done when the filter has global stat filters ?.
same will apply to RTM_GETSTATS (without NLM_F_DUMP).

Since the msg may potentially have more nest levels
in the IFLA_EXT_STATS categories, just trying to see if i can avoid adding another
top-level nest. We can sure add it if there is no other way to include global
stats in the same dump.
> So for your initial patch, you'd simply add a new top level attribute
> which ties all the net_device specific statistics to a top level
> attribute and we can add the non net_device specific stats later on.
> We can't do that later on though without breaking ABI so that would
> have to go in with the first iteration.
agreed
>
> We can preserve all the existing attribute formats, we simply have
> to introduce new attribute types which don't overlap and document
> which statistic identifier maps to which existing attribute id.
sure,

thanks Thomas.

^ permalink raw reply

* [PATCH 4.5 155/238] ARC: [plat-axs10x] add Ethernet PHY description in .dts
From: Greg Kroah-Hartman @ 2016-04-10 18:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Alexey Brodkin, Rob Herring,
	Phil Reid, David S. Miller, netdev, Sergei Shtylyov, Vineet Gupta
In-Reply-To: <20160410183456.398741366@linuxfoundation.org>

4.5-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>

commit 667a490bdb6e27db0887d2ca515b907d6aa87118 upstream.

Commit e34d65696d2e ("stmmac: create of compatible mdio bus for stmmac
driver") broke DW GMAC functionality on ARC AXS10x boards:

That's what happens on eth0 up:
  --------------------------->8------------------------
| libphy: PHY stmmac-0:ffffffff not found
| eth0: Could not attach to PHY
| stmmac_open: Cannot attach to PHY (error: -19)
  --------------------------->8------------------------

Simplest solution is to add PHY description in board's .dts.
And so we do here.

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Phil Reid <preid@electromag.com.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/arc/boot/dts/axs10x_mb.dtsi |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/arch/arc/boot/dts/axs10x_mb.dtsi
+++ b/arch/arc/boot/dts/axs10x_mb.dtsi
@@ -47,6 +47,14 @@
 			clocks = <&apbclk>;
 			clock-names = "stmmaceth";
 			max-speed = <100>;
+			mdio0 {
+				#address-cells = <1>;
+				#size-cells = <0>;
+				compatible = "snps,dwmac-mdio";
+				phy1: ethernet-phy@1 {
+					reg = <1>;
+				};
+			};
 		};

 		ehci@0x40000 {

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] Add sample for adding simple drop program to link
From: Brenden Blanco @ 2016-04-10 18:38 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, brouer,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo
In-Reply-To: <57093B67.1080604@mojatatu.com>

On Sat, Apr 09, 2016 at 01:27:03PM -0400, Jamal Hadi Salim wrote:
> On 16-04-09 12:43 PM, Brenden Blanco wrote:
> >On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote:
> 
> 
> >>Ok, sorry - should have looked this far before sending earlier email.
> >>So when you run concurently you see about 5Mpps per core but if you
> >>shoot all traffic at a single core you see 20Mpps?
> >No, only sender is multiple, receiver is still single core. The flow is
> >the same in all 4 of the send threads. Note that only ksoftirqd/6 is
> >active.
> 
> Got it.
> The sender was limited to the 20Mpps and you are able to keep up
> if i understand correctly.
Perhaps, though I can't say 100%. The sender is able to do about 21/22
Mpps when pause frames are disabled. The sender is likely CPU limited as
it is an older Xeon.
> 
> 
> >>
> >>Devil's advocate question:
> >>If the bottleneck is the driver - is there an advantage in adding the
> >>bpf code at all in the driver?
> >Only by adding this hook into the driver has it become the bottleneck.
> >
> >Prior to this, the bottleneck was later in the codepath, primarily in
> >allocations.
> >
> 
> Maybe useful in your commit log to show the prior and after.
I can add this, sure.
> Looking at both your and Daniel's profile you show in this email
> mlx4_en_process_rx_cq() seems to be where the action is on both, no?
I don't draw this conclusion. With the phys_dev drop,
mlx4_en_process_rx_cq is the majority time consumer. In the perf output
showing drop in tc, the functions such as dev_gro_receive,
kmem_cache_free, napi_gro_frags, inet_gro_receive, __build_skb, etc
combined add up to 60% of the time spent. None of these are called when
early drop occurs. Just because mlx4_en_process_rx_cq is at the top of
the list doesn't mean it is the lowest hanging fruit.
> 
> >If a packet is to be dropped, and a determination can be made with fewer
> >cpu cycles spent, then there is more time for the goodput.
> >
> 
> Agreed.
> 
> >Beyond that, even if the skb allocation gets 10x or 100x or whatever
> >improvement, there is still a non-zero cost associated, and dropping bad
> >packets with minimal time spent has value. The same argument holds for
> >physical nic forwarding decisions.
> >
> 
> I always go for the lowest hanging fruit.
Which to me is the 60% time spent above the driver level as shown above.
> It seemed it was the driver path in your case. When we removed
> the driver overhead (as demoed at the tc workshop in netdev11) we saw
> __netif_receive_skb_core() at the top of the profile.
> So in this case seems it was mlx4_en_process_rx_cq() - thats why i
> was saying the bottleneck is the driver.
I wouldn't call it a bottleneck when the time spent is additive,
aka run-to-completion.
> Having said that: I agree that early drop is useful if not for anything
> else to avoid the longer code path (but was worried after reading on
> thread this was going to get into a messy stack-in-the-driver and i am
> not sure it is avoidable either given a new ops interface is showing
>  up).
> 
> >>I am curious than before to see the comparison for the same bpf code
> >>running at tc level vs in the driver..
> >Here is a perf report for drop in the clsact qdisc with direct-action,
> >which Daniel earlier showed to have the best performance to-date. On my
> >machine, this gets about 6.5Mpps drop single core. Drop due to failed
> >IP lookup (not shown here) is worse @4.5Mpps.
> >
> 
> Nice.
> However, still for this to be orange/orange comparison you have to
> run it on the _same receiver machine_ as opposed to Daniel doing
> it on his for the one case. And two different kernels booted up
> one patched  with your changes and another virgin without them.
Of course the second perf report is on the same machine as the commit
message. That was generated fresh for this email thread. All of the
numbers I've quoted come from the same single-sender/single-receiver
setup. I did also revert the change the in mlx4 driver and there was no
change in the tc numbers.
> 
> cheers,
> jamal

^ permalink raw reply

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Sagi Grimberg @ 2016-04-10 18:45 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, Jesper Dangaard Brouer
  Cc: lsf@lists.linux-foundation.org, Tom Herbert, Brenden Blanco,
	James Bottomley, linux-mm, netdev@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Alexei Starovoitov
In-Reply-To: <570678B7.7010802@sandisk.com>


>> This is also very interesting for storage targets, which face the same
>> issue.  SCST has a mode where it caches some fully constructed SGLs,
>> which is probably very similar to what NICs want to do.
>
> I think a cached allocator for page sets + the scatterlists that
> describe these page sets would not only be useful for SCSI target
> implementations but also for the Linux SCSI initiator. Today the scsi-mq
> code reserves space in each scsi_cmnd for a scatterlist of
> SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> sets less memory would be needed per scsi_cmnd.

If we go down this road how about also attaching some driver opaques
to the page sets?

I know of some drivers that can make good use of those ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: roopa @ 2016-04-10 19:17 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev, davem
In-Reply-To: <570A59B4.30007@mojatatu.com>

On 4/10/16, 6:48 AM, Jamal Hadi Salim wrote:
> On 16-04-09 02:00 PM, roopa wrote:
>
>> This EXTENDED_HW_STATS is for ethtool like extended hw stats. This is keeping in
>> mind that we want to also move ethtool to netlink in the future and with switchdev
>> it becomes more necessary that we provide all stats closer to the other netdev stats.
>> So far hw extended stats have always been available through this separate ethtool
>> channel. The intent here is to unify the api for extended hw and software only stats.
>
> Ok, so these are _not_ the stats which are broken down by packet size
> ranges which are quiet common; but rather "proprietary" per port
> type h/w stats? I browsed a couple of users of ethtool_stats and i see
> they return proprietary looking 64 bit counters (batman for example
> has a very strange meaning to those stats).
> What i meant is a lot of ASICS have counters for byte ranges
> [64bytes-128bytes], [128bytes-256bytes], etc - sorry i cant pin a name
> to those but i am sure you have seen them and i thought those at minimal
> need their own TLV since they are always fixed.

yes, I think i have seen this. But, these if presented by the driver or hardware,
can be part of the extended hw stats (just like how ethtool would do).
So, there is the base netdev stats IFLA_STATS_LINK64 which is nothing but
rtnl_link_stats64 (which this patch adds). And this patch does not add but lists
examples for future additions IFLA_STATS_EXTENDED (for example: bridge
(can include igmp, stp, vlan stats here). bond can includelacp, and other stats here). All these will be TLV based.
and note that this patch does notrestrict or define the design for these yet.


>
>> XSTATS is per netdev can be included as a nested attribute inside IFLA_EXTENDED_STATS
>> which are per netdev. bridge vlan stats will also fall here.
>>
>
> And you are going to distinguish which come from hardware and which are
> software derived?

a) IFLA_EXTENDED_STATS - these are all software. ( for example: bridge (can include 
igmp, stp, vlan stats here).
bond can include lacp, and other stats here).

b) IFLA_HW_EXTENDED_STATS - for hardware stats. for a switch port these will be 
similar to ethtool
physical port stats today. For logical devices, this can include specific hw 
stats that switch asics
provide and those which cannot be added to the software stats.

And since this patch does not define the format for these stats yet, we can 
potentially design it when we see the first case of such stats.




>
>> And this api  provides netdev specific stats. We have always mapped all
>> asic stats to the switch port netdev stats. and this api does not cover the non-netdev specific stats.
>> If you are for example asking for stats for a hardware offloaded bridge, then yes, they will fall here
>> and be available on the bridge netdev. For asic stats that don't map to any netdev, devlink will be an
>> appropriate infrastructure IMO.
>>
>> I am not sure if I answered your question :).
>>
>
> It is useful to have this discussion; unfortunately these user APIS once are in will never be allowed to change. The answers could come
> in time.

yep, agreed.
>
>>> Should such a command then not be rejected with an error code?
>> Dumps with no data are not rejected with an error code AFAIK. ie they don't return
>> -ENODATA. This is consistent with all other dumps (unless i missed it).
>>   But, if there is a need for an error code, i can certainly check again.
>>
>
> It is mostly because you chose a whitelist filter i.e you list what
> is allowed to be sent back. And if such a list is missing there
> needs to be an opposite default (which is a deny all).

ok, by default no-filter = no-data is what we decided. And that
keeps the api simple. will think some more about if we should return
an err in case of no-filter.


>
>

[snip]

> True, but it must be 32 bit aligned. So something like:
> struct if_stats_msg {
>      __u8  family;
>      __u8 pad1;
>      __u16 pad2;
>     __u32 ifindex;
>     __u32 filter_mask;
> }
>

ok ack

>
>> Yeah i remember :). But deferring it for a later incremental feature. That needs some more thought.
>
> NP ;->
>
>> Right now there is an urgent need to get the basic get stats api for a bunch of other stats: mpls, bridge vlan etc.
>> Because it is not clean to extend the current stats infra part of the link message for this. So trying to get this in first.
>>
>
> Agreed.
> The only thing i strongly feel about is the if_stats_msg - please fix
> that and lets get at least the basics in. We can resolve other things
> with further discussions.

will do. Thanks for the review.

^ permalink raw reply

* [PATCH] ravb: make ravb_ptp_interrupt() *void*
From: Sergei Shtylyov @ 2016-04-10 20:55 UTC (permalink / raw)
  To: netdev; +Cc: linux-renesas-soc
In-Reply-To: <2504091.pf4fyK11eg@wasted.cogentembedded.com>

When we have the ISS.CGIS bit set, we already know that gPTP interrupt has
happened, so an extra GIS register check at the end of ravb_ptp_interrupt()
seems superfluous.  We can model the gPTP interrupt  handler like all other
dedicated interrupt handlers in the driver and make it *void*.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

---
The patch is against the Dave Miller's 'net-next.git' repo.

 drivers/net/ethernet/renesas/ravb.h      |    2 +-
 drivers/net/ethernet/renesas/ravb_main.c |    8 ++++++--
 drivers/net/ethernet/renesas/ravb_ptp.c  |    9 ++-------
 3 files changed, 9 insertions(+), 10 deletions(-)

Index: net-next/drivers/net/ethernet/renesas/ravb.h
===================================================================
--- net-next.orig/drivers/net/ethernet/renesas/ravb.h
+++ net-next/drivers/net/ethernet/renesas/ravb.h
@@ -1045,7 +1045,7 @@ void ravb_modify(struct net_device *ndev
 		 u32 set);
 int ravb_wait(struct net_device *ndev, enum ravb_reg reg, u32 mask, u32 value);
 
-irqreturn_t ravb_ptp_interrupt(struct net_device *ndev);
+void ravb_ptp_interrupt(struct net_device *ndev);
 void ravb_ptp_init(struct net_device *ndev, struct platform_device *pdev);
 void ravb_ptp_stop(struct net_device *ndev);
 
Index: net-next/drivers/net/ethernet/renesas/ravb_main.c
===================================================================
--- net-next.orig/drivers/net/ethernet/renesas/ravb_main.c
+++ net-next/drivers/net/ethernet/renesas/ravb_main.c
@@ -807,8 +807,10 @@ static irqreturn_t ravb_interrupt(int ir
 	}
 
 	/* gPTP interrupt status summary */
-	if ((iss & ISS_CGIS) && ravb_ptp_interrupt(ndev) == IRQ_HANDLED)
+	if (iss & ISS_CGIS) {
+		ravb_ptp_interrupt(ndev);
 		result = IRQ_HANDLED;
+	}
 
 	mmiowb();
 	spin_unlock(&priv->lock);
@@ -838,8 +840,10 @@ static irqreturn_t ravb_multi_interrupt(
 	}
 
 	/* gPTP interrupt status summary */
-	if ((iss & ISS_CGIS) && ravb_ptp_interrupt(ndev) == IRQ_HANDLED)
+	if (iss & ISS_CGIS) {
+		ravb_ptp_interrupt(ndev);
 		result = IRQ_HANDLED;
+	}
 
 	mmiowb();
 	spin_unlock(&priv->lock);
Index: net-next/drivers/net/ethernet/renesas/ravb_ptp.c
===================================================================
--- net-next.orig/drivers/net/ethernet/renesas/ravb_ptp.c
+++ net-next/drivers/net/ethernet/renesas/ravb_ptp.c
@@ -296,7 +296,7 @@ static const struct ptp_clock_info ravb_
 };
 
 /* Caller must hold the lock */
-irqreturn_t ravb_ptp_interrupt(struct net_device *ndev)
+void ravb_ptp_interrupt(struct net_device *ndev)
 {
 	struct ravb_private *priv = netdev_priv(ndev);
 	u32 gis = ravb_read(ndev, GIS);
@@ -319,12 +319,7 @@ irqreturn_t ravb_ptp_interrupt(struct ne
 		}
 	}
 
-	if (gis) {
-		ravb_write(ndev, ~gis, GIS);
-		return IRQ_HANDLED;
-	}
-
-	return IRQ_NONE;
+	ravb_write(ndev, ~gis, GIS);
 }
 
 void ravb_ptp_init(struct net_device *ndev, struct platform_device *pdev)

^ permalink raw reply

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
From: Eric Dumazet @ 2016-04-10 23:54 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: netdev
In-Reply-To: <20160410001853.GA30364@oracle.com>

On Sat, 2016-04-09 at 20:18 -0400, Sowmini Varadhan wrote:
> On (04/07/16 07:16), Eric Dumazet wrote:
> > Use skb split like TCP in output path ?
> > Really, pskb_expand_head() is not supposed to copy payload ;)
> 
> Question- how come skb_split doesnt have to deal with frag_list
> and do a skb_walk_frags()? Couldn't the split-line be somewhere
> in the frag_list? Also even for the skb_split_inside_header,
> dont we have to set
>   skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list;
> and cut loose the skb_shinfo(skb)->frag_list?
> 
> As I try to mimic skb_split in some new set of "skb_carve"
> funtions, I'm running into all  the various frag_list cases. I'm 
> afraid I might end up needing most of the stuff under the "Pure 
> masohism" (sic) comment in __pskb_pull_tail(). 
> 

The helper was written for TCP I guess. TCP in output path never uses
skb_shinfo(skb)->frag_list : It is guaranteed to be NULL for all skbs.

^ permalink raw reply

* Вы сэкономите средства на кофе и чай, Вы сэкономите средства на кофе и чай
From: Галина (кофейная компания) @ 2016-04-08 11:33 UTC (permalink / raw)
  To: midiza, losb9111, n.oborina00, maestrowinever, piratk0, olgavm84,
	pincer, phoenix-fire, pylatik, mileninagv, notthatkind,
	maystrov2000, nariste-1993, m.n.makaroff, master-union, ovt69,
	netdev, podruga.lp, oukca, olgak12071979

Добрый день.

Поставляем Вашему офису кофе-машину абсолютно бесплатно(действительно бесплатно)!

Какая выгода для нас - Ваши сотрудники будут покупать кофе в нашей кофе-машине.

В чем выгода у Вас:
- Кофе разбудит сонных работников;
- Вы сэкономите деньги на бесплатные чай и кофе, которые они потребляли за счет фирмы.
- Кофе-аппарат порадует еще и Ваших покупателей, посетивших офис.

Данный кофе-аппарат работает в в автономном режиме.

Преимущества нашего кофе-аппарата:
- Современный внешний вид;
- Высокая скорость кофе - от 15 секунд;
- Бесплатный сервис;
- Аппарат готовит: эспрессо, американо, капучино, латте, мокачино, горячий шоколад, молочный шоколад, ванильный капучино.

Если принципиальный интерес у Вас есть - я вышлю Вам фотографии наших аппаратов и их характеристики.
Напишите пожалуйста - сколько человек в Вашем офисе, чтобы я могла сориентироваться - какого объема аппараты Вам можно предложить.

С уважением, Галина.

Буду ждать ответ от Вас.

P.S. Предложение действительно для фирм  со штатом от 50 человек находящихся в Москве.

^ permalink raw reply

* Re: [PATCH] ath9k: remove duplicate assignment of variable ah
From: Julian Calaby @ 2016-04-11  1:39 UTC (permalink / raw)
  To: Colin King
  Cc: QCA ath9k Development, Kalle Valo, linux-wireless, ath9k-devel,
	netdev, linux-kernel@vger.kernel.org
In-Reply-To: <1460287531-17385-1-git-send-email-colin.king@canonical.com>

Hi All,

On Sun, Apr 10, 2016 at 9:25 PM, Colin King <colin.king@canonical.com> wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> ah is written twice with the same value, remove one of the
> redundant assignments to ah.
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Looks right to me.

Signed-off-by: Julian Calaby <julian.calaby@gmail.com>

Thanks,

> ---
>  drivers/net/wireless/ath/ath9k/init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ath/ath9k/init.c b/drivers/net/wireless/ath/ath9k/init.c
> index 77ace8d..fb702c4 100644
> --- a/drivers/net/wireless/ath/ath9k/init.c
> +++ b/drivers/net/wireless/ath/ath9k/init.c
> @@ -477,7 +477,7 @@ static void ath9k_eeprom_request_cb(const struct firmware *eeprom_blob,
>  static int ath9k_eeprom_request(struct ath_softc *sc, const char *name)
>  {
>         struct ath9k_eeprom_ctx ec;
> -       struct ath_hw *ah = ah = sc->sc_ah;
> +       struct ath_hw *ah = sc->sc_ah;
>         int err;
>
>         /* try to load the EEPROM content asynchronously */
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Julian Calaby

Email: julian.calaby@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

^ permalink raw reply

* [net-next PATCH v2 0/5] GRO Fixed IPv4 ID support and GSO partial support
From: Alexander Duyck @ 2016-04-11  1:44 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem

This patch series sets up a few different things.

First it adds support for GRO of frames with a fixed IP ID value.  This
will allow us to perform GRO for frames that go through things like an IPv6
to IPv4 header translation.

The second item we add is support for segmenting frames that are generated
this way.  Most devices only support an incrementing IP ID value, and in
the case of TCP the IP ID can be ignored in many cases since the DF bit
should be set.  So we can technically segment these frames using existing
TSO if we are willing to allow the IP ID to be mangled.  As such I have
added a matching feature for the new form of GRO/GSO called TCP IPv4 ID
mangling.  With this enabled we can assemble and disassemble a frame with
the sequence number fixed and the only ill effect will be that the IPv4 ID
will be altered which may or may not have any noticeable effect.  As such I
have defaulted the feature to disabled.

The third item this patch series adds is support for partial GSO
segmentation.  Partial GSO segmentation allows us to split a large frame
into two pieces.  The first piece will have an even multiple of MSS worth
of data and the headers before the one pointed to by csum_start will have
been updated so that they are correct for if the data payload had already
been segmented.  By doing this we can do things such as precompute the
outer header checksums for a frame to be segmented allowing us to perform
TSO on devices that don't support tunneling, or tunneling with outer header
checksums.

This patch set is based on the net-next tree, but I included "net: remove
netdevice gso_min_segs" in my tree as I assume it is likely to be applied
before this patch set will and I wanted to avoid a merge conflict.

v2: Fixed items reported by Jesse Gross
	fixed missing GSO flag in MPLS check
	adding DF check for MANGLEID
    Moved extra GSO feature checks into gso_features_check
    Rebased batches to account for "net: remove netdevice gso_min_segs"

Driver patches from the first patch set should still be compatible.  However
I do have a few changes in them so I will submit a v2 of those to Jeff
Kirsher once these patches are accepted into net-next.

Example driver patches for i40e, ixgbe, and igb:
https://patchwork.ozlabs.org/patch/608221/
https://patchwork.ozlabs.org/patch/608224/
https://patchwork.ozlabs.org/patch/608225/

---

Alexander Duyck (5):
      ethtool: Add support for toggling any of the GSO offloads
      GSO: Add GSO type for fixed IPv4 ID
      GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
      GSO: Support partial segmentation offload
      Documentation: Add documentation for TSO and GSO features

 Documentation/networking/segmentation-offloads.txt |  130 ++++++++++++++++++++
 include/linux/netdev_features.h                    |    8 +
 include/linux/netdevice.h                          |    8 +
 include/linux/skbuff.h                             |   27 +++-
 net/core/dev.c                                     |   67 +++++++++-
 net/core/ethtool.c                                 |    4 +
 net/core/skbuff.c                                  |   29 ++++
 net/ipv4/af_inet.c                                 |   70 ++++++++---
 net/ipv4/gre_offload.c                             |   27 +++-
 net/ipv4/tcp_offload.c                             |   30 ++++-
 net/ipv4/udp_offload.c                             |   27 +++-
 net/ipv6/ip6_offload.c                             |   21 +++
 net/mpls/mpls_gso.c                                |    1 
 13 files changed, 395 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/networking/segmentation-offloads.txt

^ permalink raw reply

* [net-next PATCH v2 1/5] ethtool: Add support for toggling any of the GSO offloads
From: Alexander Duyck @ 2016-04-11  1:44 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160411013855.11189.56567.stgit@ahduyck-xeon-server>

The strings were missing for several of the GSO offloads that are
available.  This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 net/core/ethtool.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f426c5ad6149..6a7f99661c2f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -82,9 +82,11 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
 	[NETIF_F_GSO_GRE_BIT] =		 "tx-gre-segmentation",
+	[NETIF_F_GSO_GRE_CSUM_BIT] =	 "tx-gre-csum-segmentation",
 	[NETIF_F_GSO_IPIP_BIT] =	 "tx-ipip-segmentation",
 	[NETIF_F_GSO_SIT_BIT] =		 "tx-sit-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
+	[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",

^ permalink raw reply related

* [net-next PATCH v2 2/5] GSO: Add GSO type for fixed IPv4 ID
From: Alexander Duyck @ 2016-04-11  1:44 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160411013855.11189.56567.stgit@ahduyck-xeon-server>

This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdev_features.h |    3 +++
 include/linux/netdevice.h       |    1 +
 include/linux/skbuff.h          |   20 +++++++++++---------
 net/core/dev.c                  |   34 +++++++++++++++++++++++++++++-----
 net/core/ethtool.c              |    1 +
 net/ipv4/af_inet.c              |   19 +++++++++++--------
 net/ipv4/gre_offload.c          |    1 +
 net/ipv4/tcp_offload.c          |    4 +++-
 net/ipv6/ip6_offload.c          |    3 ++-
 net/mpls/mpls_gso.c             |    1 +
 10 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index a734bf43d190..7cf272a4b5c8 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -39,6 +39,7 @@ enum {
 	NETIF_F_UFO_BIT,		/* ... UDPv4 fragmentation */
 	NETIF_F_GSO_ROBUST_BIT,		/* ... ->SKB_GSO_DODGY */
 	NETIF_F_TSO_ECN_BIT,		/* ... TCP ECN support */
+	NETIF_F_TSO_MANGLEID_BIT,	/* ... IPV4 ID mangling allowed */
 	NETIF_F_TSO6_BIT,		/* ... TCPv6 segmentation */
 	NETIF_F_FSO_BIT,		/* ... FCoE segmentation */
 	NETIF_F_GSO_GRE_BIT,		/* ... GRE with TSO */
@@ -120,6 +121,7 @@ enum {
 #define NETIF_F_GSO_SIT		__NETIF_F(GSO_SIT)
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
+#define NETIF_F_TSO_MANGLEID	__NETIF_F(TSO_MANGLEID)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
@@ -147,6 +149,7 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE	(NETIF_F_TSO | NETIF_F_TSO_ECN | \
+				 NETIF_F_TSO_MANGLEID | \
 				 NETIF_F_TSO6 | NETIF_F_UFO)
 
 /* List of IP checksum features. Note that NETIF_F_ HW_CSUM should not be
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 347ad5de0d93..eb7f037a4068 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3992,6 +3992,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP     != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_DODGY   != (NETIF_F_GSO_ROBUST >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_TCP_FIXEDID != (NETIF_F_TSO_MANGLEID >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_FCOE    != (NETIF_F_FSO >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_GRE     != (NETIF_F_GSO_GRE >> NETIF_F_GSO_SHIFT));
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 007381270ff8..5fba16658f9d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -465,23 +465,25 @@ enum {
 	/* This indicates the tcp segment has CWR set. */
 	SKB_GSO_TCP_ECN = 1 << 3,
 
-	SKB_GSO_TCPV6 = 1 << 4,
+	SKB_GSO_TCP_FIXEDID = 1 << 4,
 
-	SKB_GSO_FCOE = 1 << 5,
+	SKB_GSO_TCPV6 = 1 << 5,
 
-	SKB_GSO_GRE = 1 << 6,
+	SKB_GSO_FCOE = 1 << 6,
 
-	SKB_GSO_GRE_CSUM = 1 << 7,
+	SKB_GSO_GRE = 1 << 7,
 
-	SKB_GSO_IPIP = 1 << 8,
+	SKB_GSO_GRE_CSUM = 1 << 8,
 
-	SKB_GSO_SIT = 1 << 9,
+	SKB_GSO_IPIP = 1 << 9,
 
-	SKB_GSO_UDP_TUNNEL = 1 << 10,
+	SKB_GSO_SIT = 1 << 10,
 
-	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 11,
+	SKB_GSO_UDP_TUNNEL = 1 << 11,
 
-	SKB_GSO_TUNNEL_REMCSUM = 1 << 12,
+	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 12,
+
+	SKB_GSO_TUNNEL_REMCSUM = 1 << 13,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/core/dev.c b/net/core/dev.c
index 09fb1ace9dc8..e896b1953ab6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2825,14 +2825,36 @@ static netdev_features_t dflt_features_check(const struct sk_buff *skb,
 	return vlan_features_check(skb, features);
 }
 
+static netdev_features_t gso_features_check(const struct sk_buff *skb,
+					    struct net_device *dev,
+					    netdev_features_t features)
+{
+	u16 gso_segs = skb_shinfo(skb)->gso_segs;
+
+	if (gso_segs > dev->gso_max_segs)
+		return features & ~NETIF_F_GSO_MASK;
+
+	/* Make sure to clear the IPv4 ID mangling feature if
+	 * the IPv4 header has the potential to be fragmented.
+	 */
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
+		struct iphdr *iph = skb->encapsulation ?
+				    inner_ip_hdr(skb) : ip_hdr(skb);
+
+		if (!(iph->frag_off & htons(IP_DF)))
+			features &= ~NETIF_F_TSO_MANGLEID;
+	}
+
+	return features;
+}
+
 netdev_features_t netif_skb_features(struct sk_buff *skb)
 {
 	struct net_device *dev = skb->dev;
 	netdev_features_t features = dev->features;
-	u16 gso_segs = skb_shinfo(skb)->gso_segs;
 
-	if (gso_segs > dev->gso_max_segs)
-		features &= ~NETIF_F_GSO_MASK;
+	if (skb_is_gso(skb))
+		features = gso_features_check(skb, dev, features);
 
 	/* If encapsulation offload request, verify we are testing
 	 * hardware encapsulation features instead of standard
@@ -6976,9 +6998,11 @@ int register_netdevice(struct net_device *dev)
 	dev->features |= NETIF_F_SOFT_FEATURES;
 	dev->wanted_features = dev->features & dev->hw_features;
 
-	if (!(dev->flags & IFF_LOOPBACK)) {
+	if (!(dev->flags & IFF_LOOPBACK))
 		dev->hw_features |= NETIF_F_NOCACHE_COPY;
-	}
+
+	if (dev->hw_features & NETIF_F_TSO)
+		dev->hw_features |= NETIF_F_TSO_MANGLEID;
 
 	/* Make NETIF_F_HIGHDMA inheritable to VLAN devices.
 	 */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6a7f99661c2f..9494c41cc77c 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -79,6 +79,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_UFO_BIT] =              "tx-udp-fragmentation",
 	[NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
 	[NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
+	[NETIF_F_TSO_MANGLEID_BIT] =	 "tx-tcp-mangleid-segmentation",
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
 	[NETIF_F_GSO_GRE_BIT] =		 "tx-gre-segmentation",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8217cd22f921..5bbea9a0ce96 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1195,10 +1195,10 @@ EXPORT_SYMBOL(inet_sk_rebuild_header);
 static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 					netdev_features_t features)
 {
+	bool udpfrag = false, fixedid = false, encap;
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	const struct net_offload *ops;
 	unsigned int offset = 0;
-	bool udpfrag, encap;
 	struct iphdr *iph;
 	int proto;
 	int nhoff;
@@ -1217,6 +1217,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_TCPV6 |
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
+		       SKB_GSO_TCP_FIXEDID |
 		       SKB_GSO_TUNNEL_REMCSUM |
 		       0)))
 		goto out;
@@ -1248,11 +1249,14 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 
 	segs = ERR_PTR(-EPROTONOSUPPORT);
 
-	if (skb->encapsulation &&
-	    skb_shinfo(skb)->gso_type & (SKB_GSO_SIT|SKB_GSO_IPIP))
-		udpfrag = proto == IPPROTO_UDP && encap;
-	else
-		udpfrag = proto == IPPROTO_UDP && !skb->encapsulation;
+	if (!skb->encapsulation || encap) {
+		udpfrag = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
+		fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);
+
+		/* fixed ID is invalid if DF bit is not set */
+		if (fixedid && !(iph->frag_off & htons(IP_DF)))
+			goto out;
+	}
 
 	ops = rcu_dereference(inet_offloads[proto]);
 	if (likely(ops && ops->callbacks.gso_segment))
@@ -1265,12 +1269,11 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	do {
 		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
 		if (udpfrag) {
-			iph->id = htons(id);
 			iph->frag_off = htons(offset >> 3);
 			if (skb->next)
 				iph->frag_off |= htons(IP_MF);
 			offset += skb->len - nhoff - ihl;
-		} else {
+		} else if (!fixedid) {
 			iph->id = htons(id++);
 		}
 		iph->tot_len = htons(skb->len - nhoff);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 6a5bd4317866..6376b0cdf693 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -32,6 +32,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 				  SKB_GSO_UDP |
 				  SKB_GSO_DODGY |
 				  SKB_GSO_TCP_ECN |
+				  SKB_GSO_TCP_FIXEDID |
 				  SKB_GSO_GRE |
 				  SKB_GSO_GRE_CSUM |
 				  SKB_GSO_IPIP |
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 773083b7f1e9..08dd25d835af 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -89,6 +89,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 			     ~(SKB_GSO_TCPV4 |
 			       SKB_GSO_DODGY |
 			       SKB_GSO_TCP_ECN |
+			       SKB_GSO_TCP_FIXEDID |
 			       SKB_GSO_TCPV6 |
 			       SKB_GSO_GRE |
 			       SKB_GSO_GRE_CSUM |
@@ -98,7 +99,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 			       SKB_GSO_UDP_TUNNEL_CSUM |
 			       SKB_GSO_TUNNEL_REMCSUM |
 			       0) ||
-			     !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))))
+			     !(type & (SKB_GSO_TCPV4 |
+				       SKB_GSO_TCPV6))))
 			goto out;
 
 		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 204af2219471..b3a779393d71 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -73,6 +73,8 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP |
 		       SKB_GSO_DODGY |
 		       SKB_GSO_TCP_ECN |
+		       SKB_GSO_TCP_FIXEDID |
+		       SKB_GSO_TCPV6 |
 		       SKB_GSO_GRE |
 		       SKB_GSO_GRE_CSUM |
 		       SKB_GSO_IPIP |
@@ -80,7 +82,6 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
-		       SKB_GSO_TCPV6 |
 		       0)))
 		goto out;
 
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index 0183b32da942..bbcf60465e5c 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -31,6 +31,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
 				  SKB_GSO_TCPV6 |
 				  SKB_GSO_UDP |
 				  SKB_GSO_DODGY |
+				  SKB_GSO_TCP_FIXEDID |
 				  SKB_GSO_TCP_ECN)))
 		goto out;
 

^ permalink raw reply related

* [net-next PATCH v2 3/5] GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
From: Alexander Duyck @ 2016-04-11  1:44 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160411013855.11189.56567.stgit@ahduyck-xeon-server>

This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdevice.h |    5 ++++-
 net/core/dev.c            |    1 +
 net/ipv4/af_inet.c        |   35 ++++++++++++++++++++++++++++-------
 net/ipv4/tcp_offload.c    |   16 +++++++++++++++-
 net/ipv6/ip6_offload.c    |    8 ++++++--
 5 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eb7f037a4068..6a248a3a44bf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2121,7 +2121,10 @@ struct napi_gro_cb {
 	/* Used in GRE, set in fou/gue_gro_receive */
 	u8	is_fou:1;
 
-	/* 6 bit hole */
+	/* Used to determine if flush_id can be ignored */
+	u8	is_atomic:1;
+
+	/* 5 bit hole */
 
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index e896b1953ab6..b78b586b1856 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4462,6 +4462,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 		NAPI_GRO_CB(skb)->free = 0;
 		NAPI_GRO_CB(skb)->encap_mark = 0;
 		NAPI_GRO_CB(skb)->is_fou = 0;
+		NAPI_GRO_CB(skb)->is_atomic = 1;
 		NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
 		/* Setup for GRO checksum validation */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5bbea9a0ce96..8564cab96189 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1328,6 +1328,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 
 	for (p = *head; p; p = p->next) {
 		struct iphdr *iph2;
+		u16 flush_id;
 
 		if (!NAPI_GRO_CB(p)->same_flow)
 			continue;
@@ -1351,16 +1352,36 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 			(iph->tos ^ iph2->tos) |
 			((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
 
-		/* Save the IP ID check to be included later when we get to
-		 * the transport layer so only the inner most IP ID is checked.
-		 * This is because some GSO/TSO implementations do not
-		 * correctly increment the IP ID for the outer hdrs.
-		 */
-		NAPI_GRO_CB(p)->flush_id =
-			    ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
 		NAPI_GRO_CB(p)->flush |= flush;
+
+		/* We need to store of the IP ID check to be included later
+		 * when we can verify that this packet does in fact belong
+		 * to a given flow.
+		 */
+		flush_id = (u16)(id - ntohs(iph2->id));
+
+		/* This bit of code makes it much easier for us to identify
+		 * the cases where we are doing atomic vs non-atomic IP ID
+		 * checks.  Specifically an atomic check can return IP ID
+		 * values 0 - 0xFFFF, while a non-atomic check can only
+		 * return 0 or 0xFFFF.
+		 */
+		if (!NAPI_GRO_CB(p)->is_atomic ||
+		    !(iph->frag_off & htons(IP_DF))) {
+			flush_id ^= NAPI_GRO_CB(p)->count;
+			flush_id = flush_id ? 0xFFFF : 0;
+		}
+
+		/* If the previous IP ID value was based on an atomic
+		 * datagram we can overwrite the value and ignore it.
+		 */
+		if (NAPI_GRO_CB(skb)->is_atomic)
+			NAPI_GRO_CB(p)->flush_id = flush_id;
+		else
+			NAPI_GRO_CB(p)->flush_id |= flush_id;
 	}
 
+	NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
 	NAPI_GRO_CB(skb)->flush |= flush;
 	skb_set_network_header(skb, off);
 	/* The above will be needed by the transport layer if there is one
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 08dd25d835af..d1ffd55289bd 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -239,7 +239,7 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 found:
 	/* Include the IP ID check below from the inner most IP hdr */
-	flush = NAPI_GRO_CB(p)->flush | NAPI_GRO_CB(p)->flush_id;
+	flush = NAPI_GRO_CB(p)->flush;
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
@@ -248,6 +248,17 @@ found:
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);
 
+	/* When we receive our second frame we can made a decision on if we
+	 * continue this flow as an atomic flow with a fixed ID or if we use
+	 * an incrementing ID.
+	 */
+	if (NAPI_GRO_CB(p)->flush_id != 1 ||
+	    NAPI_GRO_CB(p)->count != 1 ||
+	    !NAPI_GRO_CB(p)->is_atomic)
+		flush |= NAPI_GRO_CB(p)->flush_id;
+	else
+		NAPI_GRO_CB(p)->is_atomic = false;
+
 	mss = skb_shinfo(p)->gso_size;
 
 	flush |= (len - 1) >= mss;
@@ -316,6 +327,9 @@ static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 				  iph->daddr, 0);
 	skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV4;
 
+	if (NAPI_GRO_CB(skb)->is_atomic)
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_FIXEDID;
+
 	return tcp_gro_complete(skb);
 }
 
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index b3a779393d71..061adcda65f3 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -240,10 +240,14 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head,
 		NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
 		NAPI_GRO_CB(p)->flush |= flush;
 
-		/* Clear flush_id, there's really no concept of ID in IPv6. */
-		NAPI_GRO_CB(p)->flush_id = 0;
+		/* If the previous IP ID value was based on an atomic
+		 * datagram we can overwrite the value and ignore it.
+		 */
+		if (NAPI_GRO_CB(skb)->is_atomic)
+			NAPI_GRO_CB(p)->flush_id = 0;
 	}
 
+	NAPI_GRO_CB(skb)->is_atomic = true;
 	NAPI_GRO_CB(skb)->flush |= flush;
 
 	skb_gro_postpull_rcsum(skb, iph, nlen);

^ permalink raw reply related

* [net-next PATCH v2 4/5] GSO: Support partial segmentation offload
From: Alexander Duyck @ 2016-04-11  1:45 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160411013855.11189.56567.stgit@ahduyck-xeon-server>

This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/linux/netdev_features.h |    5 +++++
 include/linux/netdevice.h       |    2 ++
 include/linux/skbuff.h          |    9 +++++++--
 net/core/dev.c                  |   36 +++++++++++++++++++++++++++++++++---
 net/core/ethtool.c              |    1 +
 net/core/skbuff.c               |   29 ++++++++++++++++++++++++++++-
 net/ipv4/af_inet.c              |   20 ++++++++++++++++----
 net/ipv4/gre_offload.c          |   26 +++++++++++++++++++++-----
 net/ipv4/tcp_offload.c          |   10 ++++++++--
 net/ipv4/udp_offload.c          |   27 +++++++++++++++++++++------
 net/ipv6/ip6_offload.c          |   10 +++++++++-
 11 files changed, 151 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 7cf272a4b5c8..9fc79df0e561 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,6 +48,10 @@ enum {
 	NETIF_F_GSO_SIT_BIT,		/* ... SIT tunnel with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_BIT,	/* ... UDP TUNNEL with TSO */
 	NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
+	NETIF_F_GSO_PARTIAL_BIT,	/* ... Only segment inner-most L4
+					 *     in hardware and all other
+					 *     headers in software.
+					 */
 	NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
 		NETIF_F_GSO_TUNNEL_REMCSUM_BIT,
@@ -122,6 +126,7 @@ enum {
 #define NETIF_F_GSO_UDP_TUNNEL	__NETIF_F(GSO_UDP_TUNNEL)
 #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM)
 #define NETIF_F_TSO_MANGLEID	__NETIF_F(TSO_MANGLEID)
+#define NETIF_F_GSO_PARTIAL	 __NETIF_F(GSO_PARTIAL)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX	__NETIF_F(HW_VLAN_STAG_RX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6a248a3a44bf..e15fbcd79be6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1654,6 +1654,7 @@ struct net_device {
 	netdev_features_t	vlan_features;
 	netdev_features_t	hw_enc_features;
 	netdev_features_t	mpls_features;
+	netdev_features_t	gso_partial_features;
 
 	int			ifindex;
 	int			group;
@@ -4004,6 +4005,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_SIT     != (NETIF_F_GSO_SIT >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5fba16658f9d..da0ace389fec 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -483,7 +483,9 @@ enum {
 
 	SKB_GSO_UDP_TUNNEL_CSUM = 1 << 12,
 
-	SKB_GSO_TUNNEL_REMCSUM = 1 << 13,
+	SKB_GSO_PARTIAL = 1 << 13,
+
+	SKB_GSO_TUNNEL_REMCSUM = 1 << 14,
 };
 
 #if BITS_PER_LONG > 32
@@ -3591,7 +3593,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
  * Keeps track of level of encapsulation of network headers.
  */
 struct skb_gso_cb {
-	int	mac_offset;
+	union {
+		int	mac_offset;
+		int	data_offset;
+	};
 	int	encap_level;
 	__wsum	csum;
 	__u16	csum_start;
diff --git a/net/core/dev.c b/net/core/dev.c
index b78b586b1856..556dd09af3b8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2711,6 +2711,19 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 			return ERR_PTR(err);
 	}
 
+	/* Only report GSO partial support if it will enable us to
+	 * support segmentation on this frame without needing additional
+	 * work.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL) {
+		netdev_features_t partial_features = NETIF_F_GSO_ROBUST;
+		struct net_device *dev = skb->dev;
+
+		partial_features |= dev->features & dev->gso_partial_features;
+		if (!skb_gso_ok(skb, features | partial_features))
+			features &= ~NETIF_F_GSO_PARTIAL;
+	}
+
 	BUILD_BUG_ON(SKB_SGO_CB_OFFSET +
 		     sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb));
 
@@ -2834,8 +2847,17 @@ static netdev_features_t gso_features_check(const struct sk_buff *skb,
 	if (gso_segs > dev->gso_max_segs)
 		return features & ~NETIF_F_GSO_MASK;
 
-	/* Make sure to clear the IPv4 ID mangling feature if
-	 * the IPv4 header has the potential to be fragmented.
+	/* Support for GSO partial features requires software
+	 * intervention before we can actually process the packets
+	 * so we need to strip support for any partial features now
+	 * and we can pull them back in after we have partially
+	 * segmented the frame.
+	 */
+	if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
+		features &= ~dev->gso_partial_features;
+
+	/* Make sure to clear the IPv4 ID mangling feature if the
+	 * IPv4 header has the potential to be fragmented.
 	 */
 	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
 		struct iphdr *iph = skb->encapsulation ?
@@ -6729,6 +6751,14 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
 		}
 	}
 
+	/* GSO partial features require GSO partial be set */
+	if ((features & dev->gso_partial_features) &&
+	    !(features & NETIF_F_GSO_PARTIAL)) {
+		netdev_dbg(dev,
+			   "Dropping partially supported GSO features since no GSO partial.\n");
+		features &= ~dev->gso_partial_features;
+	}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	if (dev->netdev_ops->ndo_busy_poll)
 		features |= NETIF_F_BUSY_POLL;
@@ -7011,7 +7041,7 @@ int register_netdevice(struct net_device *dev)
 
 	/* Make NETIF_F_SG inheritable to tunnel devices.
 	 */
-	dev->hw_enc_features |= NETIF_F_SG;
+	dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL;
 
 	/* Make NETIF_F_SG inheritable to MPLS.
 	 */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 9494c41cc77c..e0cf20a3b3dd 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -88,6 +88,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_GSO_SIT_BIT] =		 "tx-sit-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_BIT] =	 "tx-udp_tnl-segmentation",
 	[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
+	[NETIF_F_GSO_PARTIAL_BIT] =	 "tx-gso-partial",
 
 	[NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
 	[NETIF_F_SCTP_CRC_BIT] =        "tx-checksum-sctp",
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d04c2d1c8c87..4cc594cdaada 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3076,8 +3076,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *frag_skb = head_skb;
 	unsigned int offset = doffset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
+	unsigned int partial_segs = 0;
 	unsigned int headroom;
-	unsigned int len;
+	unsigned int len = head_skb->len;
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
@@ -3094,6 +3095,15 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 	csum = !!can_checksum_protocol(features, proto);
 
+	/* GSO partial only requires that we trim off any excess that
+	 * doesn't fit into an MSS sized block, so take care of that
+	 * now.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL) {
+		partial_segs = len / mss;
+		mss *= partial_segs;
+	}
+
 	headroom = skb_headroom(head_skb);
 	pos = skb_headlen(head_skb);
 
@@ -3281,6 +3291,23 @@ perform_csum_check:
 	 */
 	segs->prev = tail;
 
+	/* Update GSO info on first skb in partial sequence. */
+	if (partial_segs) {
+		int type = skb_shinfo(head_skb)->gso_type;
+
+		/* Update type to add partial and then remove dodgy if set */
+		type |= SKB_GSO_PARTIAL;
+		type &= ~SKB_GSO_DODGY;
+
+		/* Update GSO info and prepare to start updating headers on
+		 * our way back down the stack of protocols.
+		 */
+		skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size;
+		skb_shinfo(segs)->gso_segs = partial_segs;
+		skb_shinfo(segs)->gso_type = type;
+		SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset;
+	}
+
 	/* Following permits correct backpressure, for protocols
 	 * using skb_set_owner_w().
 	 * Idea is to tranfert ownership from head_skb to last segment.
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8564cab96189..2e6e65fc4d20 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1200,7 +1200,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	const struct net_offload *ops;
 	unsigned int offset = 0;
 	struct iphdr *iph;
-	int proto;
+	int proto, tot_len;
 	int nhoff;
 	int ihl;
 	int id;
@@ -1219,6 +1219,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TCP_FIXEDID |
 		       SKB_GSO_TUNNEL_REMCSUM |
+		       SKB_GSO_PARTIAL |
 		       0)))
 		goto out;
 
@@ -1273,10 +1274,21 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 			if (skb->next)
 				iph->frag_off |= htons(IP_MF);
 			offset += skb->len - nhoff - ihl;
-		} else if (!fixedid) {
-			iph->id = htons(id++);
+			tot_len = skb->len - nhoff;
+		} else if (skb_is_gso(skb)) {
+			if (!fixedid) {
+				iph->id = htons(id);
+				id += skb_shinfo(skb)->gso_segs;
+			}
+			tot_len = skb_shinfo(skb)->gso_size +
+				  SKB_GSO_CB(skb)->data_offset +
+				  skb->head - (unsigned char *)iph;
+		} else {
+			if (!fixedid)
+				iph->id = htons(id++);
+			tot_len = skb->len - nhoff;
 		}
-		iph->tot_len = htons(skb->len - nhoff);
+		iph->tot_len = htons(tot_len);
 		ip_send_check(iph);
 		if (encap)
 			skb_reset_inner_headers(skb);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 6376b0cdf693..20557f211408 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -36,7 +36,8 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 				  SKB_GSO_GRE |
 				  SKB_GSO_GRE_CSUM |
 				  SKB_GSO_IPIP |
-				  SKB_GSO_SIT)))
+				  SKB_GSO_SIT |
+				  SKB_GSO_PARTIAL)))
 		goto out;
 
 	if (!skb->encapsulation)
@@ -87,7 +88,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 	skb = segs;
 	do {
 		struct gre_base_hdr *greh;
-		__be32 *pcsum;
+		__sum16 *pcsum;
 
 		/* Set up inner headers if we are offloading inner checksum */
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -107,10 +108,25 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 			continue;
 
 		greh = (struct gre_base_hdr *)skb_transport_header(skb);
-		pcsum = (__be32 *)(greh + 1);
+		pcsum = (__sum16 *)(greh + 1);
+
+		if (skb_is_gso(skb)) {
+			unsigned int partial_adj;
+
+			/* Adjust checksum to account for the fact that
+			 * the partial checksum is based on actual size
+			 * whereas headers should be based on MSS size.
+			 */
+			partial_adj = skb->len + skb_headroom(skb) -
+				      SKB_GSO_CB(skb)->data_offset -
+				      skb_shinfo(skb)->gso_size;
+			*pcsum = ~csum_fold((__force __wsum)htonl(partial_adj));
+		} else {
+			*pcsum = 0;
+		}
 
-		*pcsum = 0;
-		*(__sum16 *)pcsum = gso_make_checksum(skb, 0);
+		*(pcsum + 1) = 0;
+		*pcsum = gso_make_checksum(skb, 0);
 	} while ((skb = skb->next));
 out:
 	return segs;
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index d1ffd55289bd..02737b607aa7 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -109,6 +109,12 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		goto out;
 	}
 
+	/* GSO partial only requires splitting the frame into an MSS
+	 * multiple and possibly a remainder.  So update the mss now.
+	 */
+	if (features & NETIF_F_GSO_PARTIAL)
+		mss = skb->len - (skb->len % mss);
+
 	copy_destructor = gso_skb->destructor == tcp_wfree;
 	ooo_okay = gso_skb->ooo_okay;
 	/* All segments but the first should have ooo_okay cleared */
@@ -133,7 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
 					       (__force u32)delta));
 
-	do {
+	while (skb->next) {
 		th->fin = th->psh = 0;
 		th->check = newcheck;
 
@@ -153,7 +159,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
 		th->seq = htonl(seq);
 		th->cwr = 0;
-	} while (skb->next);
+	}
 
 	/* Following permits TCP Small Queues to work well with GSO :
 	 * The callback to TCP stack will be called at the time last frag
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 6230cf4b0d2d..097060def7f0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -39,8 +39,11 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 	 * 16 bit length field due to the header being added outside of an
 	 * IP or IPv6 frame that was already limited to 64K - 1.
 	 */
-	partial = csum_sub(csum_unfold(uh->check),
-			   (__force __wsum)htonl(skb->len));
+	if (skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL)
+		partial = (__force __wsum)uh->len;
+	else
+		partial = (__force __wsum)htonl(skb->len);
+	partial = csum_sub(csum_unfold(uh->check), partial);
 
 	/* setup inner skb. */
 	skb->encapsulation = 0;
@@ -89,7 +92,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 	udp_offset = outer_hlen - tnl_hlen;
 	skb = segs;
 	do {
-		__be16 len;
+		unsigned int len;
 
 		if (remcsum)
 			skb->ip_summed = CHECKSUM_NONE;
@@ -107,14 +110,26 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
 		skb_reset_mac_header(skb);
 		skb_set_network_header(skb, mac_len);
 		skb_set_transport_header(skb, udp_offset);
-		len = htons(skb->len - udp_offset);
+		len = skb->len - udp_offset;
 		uh = udp_hdr(skb);
-		uh->len = len;
+
+		/* If we are only performing partial GSO the inner header
+		 * will be using a length value equal to only one MSS sized
+		 * segment instead of the entire frame.
+		 */
+		if (skb_is_gso(skb)) {
+			uh->len = htons(skb_shinfo(skb)->gso_size +
+					SKB_GSO_CB(skb)->data_offset +
+					skb->head - (unsigned char *)uh);
+		} else {
+			uh->len = htons(len);
+		}
 
 		if (!need_csum)
 			continue;
 
-		uh->check = ~csum_fold(csum_add(partial, (__force __wsum)len));
+		uh->check = ~csum_fold(csum_add(partial,
+				       (__force __wsum)htonl(len)));
 
 		if (skb->encapsulation || !offload_csum) {
 			uh->check = gso_make_checksum(skb, ~uh->check);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 061adcda65f3..f5eb184e1093 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -63,6 +63,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	int proto;
 	struct frag_hdr *fptr;
 	unsigned int unfrag_ip6hlen;
+	unsigned int payload_len;
 	u8 *prevhdr;
 	int offset = 0;
 	bool encap, udpfrag;
@@ -82,6 +83,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 		       SKB_GSO_UDP_TUNNEL |
 		       SKB_GSO_UDP_TUNNEL_CSUM |
 		       SKB_GSO_TUNNEL_REMCSUM |
+		       SKB_GSO_PARTIAL |
 		       0)))
 		goto out;
 
@@ -118,7 +120,13 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
 	for (skb = segs; skb; skb = skb->next) {
 		ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
-		ipv6h->payload_len = htons(skb->len - nhoff - sizeof(*ipv6h));
+		if (skb_is_gso(skb))
+			payload_len = skb_shinfo(skb)->gso_size +
+				      SKB_GSO_CB(skb)->data_offset +
+				      skb->head - (unsigned char *)(ipv6h + 1);
+		else
+			payload_len = skb->len - nhoff - sizeof(*ipv6h);
+		ipv6h->payload_len = htons(payload_len);
 		skb->network_header = (u8 *)ipv6h - skb->head;
 
 		if (udpfrag) {

^ permalink raw reply related

* [net-next PATCH v2 5/5] Documentation: Add documentation for TSO and GSO features
From: Alexander Duyck @ 2016-04-11  1:45 UTC (permalink / raw)
  To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160411013855.11189.56567.stgit@ahduyck-xeon-server>

This document is a starting point for defining the TSO and GSO features.
The whole thing is starting to get a bit messy so I wanted to make sure we
have notes somwhere to start describing what does and doesn't work.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 Documentation/networking/segmentation-offloads.txt |  130 ++++++++++++++++++++
 1 file changed, 130 insertions(+)
 create mode 100644 Documentation/networking/segmentation-offloads.txt

diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt
new file mode 100644
index 000000000000..f200467ade38
--- /dev/null
+++ b/Documentation/networking/segmentation-offloads.txt
@@ -0,0 +1,130 @@
+Segmentation Offloads in the Linux Networking Stack
+
+Introduction
+============
+
+This document describes a set of techniques in the Linux networking stack
+to take advantage of segmentation offload capabilities of various NICs.
+
+The following technologies are described:
+ * TCP Segmentation Offload - TSO
+ * UDP Fragmentation Offload - UFO
+ * IPIP, SIT, GRE, and UDP Tunnel Offloads
+ * Generic Segmentation Offload - GSO
+ * Generic Receive Offload - GRO
+ * Partial Generic Segmentation Offload - GSO_PARTIAL
+
+TCP Segmentation Offload
+========================
+
+TCP segmentation allows a device to segment a single frame into multiple
+frames with a data payload size specified in skb_shinfo()->gso_size.
+When TCP segmentation requested the bit for either SKB_GSO_TCP or
+SKB_GSO_TCP6 should be set in skb_shinfo()->gso_type and
+skb_shinfo()->gso_size should be set to a non-zero value.
+
+TCP segmentation is dependent on support for the use of partial checksum
+offload.  For this reason TSO is normally disabled if the Tx checksum
+offload for a given device is disabled.
+
+In order to support TCP segmentation offload it is necessary to populate
+the network and transport header offsets of the skbuff so that the device
+drivers will be able determine the offsets of the IP or IPv6 header and the
+TCP header.  In addition as CHECKSUM_PARTIAL is required csum_start should
+also point to the TCP header of the packet.
+
+For IPv4 segmentation we support one of two types in terms of the IP ID.
+The default behavior is to increment the IP ID with every segment.  If the
+GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP
+ID and all segments will use the same IP ID.  If a device has
+NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO
+and we will either increment the IP ID for all frames, or leave it at a
+static value based on driver preference.
+
+UDP Fragmentation Offload
+=========================
+
+UDP fragmentation offload allows a device to fragment an oversized UDP
+datagram into multiple IPv4 fragments.  Many of the requirements for UDP
+fragmentation offload are the same as TSO.  However the IPv4 ID for
+fragments should not increment as a single IPv4 datagram is fragmented.
+
+IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
+========================================================
+
+In addition to the offloads described above it is possible for a frame to
+contain additional headers such as an outer tunnel.  In order to account
+for such instances an additional set of segmentation offload types were
+introduced including SKB_GSO_IPIP, SKB_GSO_SIT, SKB_GSO_GRE, and
+SKB_GSO_UDP_TUNNEL.  These extra segmentation types are used to identify
+cases where there are more than just 1 set of headers.  For example in the
+case of IPIP and SIT we should have the network and transport headers moved
+from the standard list of headers to "inner" header offsets.
+
+Currently only two levels of headers are supported.  The convention is to
+refer to the tunnel headers as the outer headers, while the encapsulated
+data is normally referred to as the inner headers.  Below is the list of
+calls to access the given headers:
+
+IPIP/SIT Tunnel:
+		Outer			Inner
+MAC		skb_mac_header
+Network		skb_network_header	skb_inner_network_header
+Transport	skb_transport_header
+
+UDP/GRE Tunnel:
+		Outer			Inner
+MAC		skb_mac_header		skb_inner_mac_header
+Network		skb_network_header	skb_inner_network_header
+Transport	skb_transport_header	skb_inner_transport_header
+
+In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
+SKB_GSO_UDP_TUNNEL_CSUM.  These two additional tunnel types reflect the
+fact that the outer header also requests to have a non-zero checksum
+included in the outer header.
+
+Finally there is SKB_GSO_REMCSUM which indicates that a given tunnel header
+has requested a remote checksum offload.  In this case the inner headers
+will be left with a partial checksum and only the outer header checksum
+will be computed.
+
+Generic Segmentation Offload
+============================
+
+Generic segmentation offload is a pure software offload that is meant to
+deal with cases where device drivers cannot perform the offloads described
+above.  What occurs in GSO is that a given skbuff will have its data broken
+out over multiple skbuffs that have been resized to match the MSS provided
+via skb_shinfo()->gso_size.
+
+Before enabling any hardware segmentation offload a corresponding software
+offload is required in GSO.  Otherwise it becomes possible for a frame to
+be re-routed between devices and end up being unable to be transmitted.
+
+Generic Receive Offload
+=======================
+
+Generic receive offload is the complement to GSO.  Ideally any frame
+assembled by GRO should be segmented to create an identical sequence of
+frames using GSO, and any sequence of frames segmented by GSO should be
+able to be reassembled back to the original by GRO.  The only exception to
+this is IPv4 ID in the case that the DF bit is set for a given IP header.
+If the value of the IPv4 ID is not sequentially incrementing it will be
+altered so that it is when a frame assembled via GRO is segmented via GSO.
+
+Partial Generic Segmentation Offload
+====================================
+
+Partial generic segmentation offload is a hybrid between TSO and GSO.  What
+it effectively does is take advantage of certain traits of TCP and tunnels
+so that instead of having to rewrite the packet headers for each segment
+only the inner-most transport header and possibly the outer-most network
+header need to be updated.  This allows devices that do not support tunnel
+offloads or tunnel offloads with checksum to still make use of segmentation.
+
+With the partial offload what occurs is that all headers excluding the
+inner transport header are updated such that they will contain the correct
+values for if the header was simply duplicated.  The one exception to this
+is the outer IPv4 ID field.  It is up to the device drivers to guarantee
+that the IPv4 ID field is incremented in the case that a given header does
+not have the DF bit set.

^ permalink raw reply related

* Re: [PATCH v2] sctp: avoid refreshing heartbeat timer too often
From: David Miller @ 2016-04-11  2:23 UTC (permalink / raw)
  To: marcelo.leitner; +Cc: netdev, nhorman, vyasevich, David.Laight, linux-sctp
In-Reply-To: <7aa05d5d8b9100fa48798b43887de6d14d577eac.1459966346.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Wed,  6 Apr 2016 15:15:19 -0300

> Currently on high rate SCTP streams the heartbeat timer refresh can
> consume quite a lot of resources as timer updates are costly and it
> contains a random factor, which a) is also costly and b) invalidates
> mod_timer() optimization for not editing a timer to the same value.
> It may even cause the timer to be slightly advanced, for no good reason.
> 
> As suggested by David Laight this patch now removes this timer update
> from hot path by leaving the timer on and re-evaluating upon its
> expiration if the heartbeat is still needed or not, similarly to what is
> done for TCP. If it's not needed anymore the timer is re-scheduled to
> the new timeout, considering the time already elapsed.
> 
> For this, we now record the last tx timestamp per transport, updated in
> the same spots as hb timer was restarted on tx. Also split up
> sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
> sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
> heartbeat one.
> 
> On loopback with MTU of 65535 and data chunks with 1636, so that we
> have a considerable amount of chunks without stressing system calls,
> netperf -t SCTP_STREAM -l 30, perf looked like this before:
 . ..
> And after this patch, now with netperf -l 60:
 ...
> Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
> ~3.7%.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Applied, thanks Marcelo.

^ permalink raw reply

* Re: [PATCH 1/3] bonding: do not allow rlb updates to invalid mac
From: David Miller @ 2016-04-11  2:26 UTC (permalink / raw)
  To: dbanerje; +Cc: j.vosburgh, vfalico, gospo, netdev, linux-kernel
In-Reply-To: <1459974275-24944-1-git-send-email-dbanerje@akamai.com>

Please resubmit this patch series with a proper cover letter.

It should have "[PATCH 0/3] ..." as the subject line and explain
at a high level what your patch series is doing, how it is doing
it, and why it is doing it that way.

You must also be explicit about which of my trees your changes
are targetting.

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox