Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Pravin Shelar @ 2016-12-01 20:31 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130153041.7a9590ef@griffin>

On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote:
>> Do not always set skb->protocol to be the ethertype of the L3 header.
>> For a packet with non-accelerated VLAN tags skb->protocol needs to be
>> the ethertype of the outermost non-accelerated VLAN ethertype.
>
> Well, the current handling of skb->protocol matches what used to be the
> handling of the kernel net stack before Jiri Pirko cleaned up the vlan
> code.
>
> I'm not opposed to changing this but I'm afraid it needs much deeper
> review. Because with this in place, no core kernel functions that
> depend on skb->protocol may be called from within openvswitch.
>
Can you give specific example where it does not work?

>> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>>       if (res <= 0)
>>               return res;
>>
>> +     /* If the outer vlan tag was accelerated, skb->protocol should
>> +      * refelect the inner vlan type. */
>> +     if (!eth_type_vlan(skb->protocol))
>> +             skb->protocol = key->eth.cvlan.tpid;
>
> This should not depend on the current value in skb->protocol which
> could be arbitrary at this point (from the point of view of how this
> patch understands the skb->protocol values). It's easy to fix, though -
> just add a local bool variable tracking whether the skb->protocol has
> been set.
>
skb-protocol value is set by the caller, so it should not be
arbitrary. is it missing in any case?

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
From: Vivien Didelot @ 2016-12-01 20:31 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130233810.GT21645@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> As we have seen in the past, this sort of loop is broken if we end up
> sleeping for a long time. Please take the opportunity to replace it
> with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait()

That won't work. the _wait() helpers are made to wait on self-clear (SC)
bits, i.e. looping until they are cleared to zero.

Here we want the opposite.

I will keep this existing wait loop for the moment and work soon on a
new patchset to rework the wait routines. We need a generic access to
test a given value against a given mask and wrappers for busy bits, etc.

>> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
>> +{
>> +	u16 val;
>> +	int err;
>> +
>> +	/* Check the value of the InitReady bit 11 */
>> +	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
>> +	if (err)
>> +		return err;
>> +
>> +	*ready = !!(val & GLOBAL_STATUS_INIT_READY);
>
> I would actually do the wait here.

That is better indeed.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
From: David Miller @ 2016-12-01 20:29 UTC (permalink / raw)
  To: marcelo.leitner
  Cc: netdev, jmaxwell37, alexandre.sidorenko, kuznet, jmorris,
	yoshfuji, kaber, tlfalcon, brking, eric.dumazet
In-Reply-To: <0d41deb00d57206f518e6bffae1b0be355bbc726.1480511277.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Wed, 30 Nov 2016 11:14:32 -0200

> There have been some reports lately about TCP connection stalls caused
> by NIC drivers that aren't setting gso_size on aggregated packets on rx
> path. This causes TCP to assume that the MSS is actually the size of the
> aggregated packet, which is invalid.
> 
> Although the proper fix is to be done at each driver, it's often hard
> and cumbersome for one to debug, come to such root cause and report/fix
> it.
> 
> This patch amends this situation in two ways. First, it adds a warning
> on when this situation occurs, so it gives a hint to those trying to
> debug this. It also limit the maximum probed MSS to the adverised MSS,
> as it should never be any higher than that.
> 
> The result is that the connection may not have the best performance ever
> but it shouldn't stall, and the admin will have a hint on what to look
> for.
> 
> Tested with virtio by forcing gso_size to 0.
> 
> Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

I totally agree with this change, however I think the warning message can
be improved in two ways:

>  	len = skb_shinfo(skb)->gso_size ? : skb->len;
>  	if (len >= icsk->icsk_ack.rcv_mss) {
> -		icsk->icsk_ack.rcv_mss = len;
> +		icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> +					       tcp_sk(sk)->advmss);
> +		if (icsk->icsk_ack.rcv_mss != len)
> +			pr_warn_once("Seems your NIC driver is doing bad RX acceleration. TCP performance may be compromised.\n");

We know it's a bad GRO implementation that causes this so let's be specific in the
message, perhaps something like:

	Driver has suspect GRO implementation, TCP performance may be compromised.

Also, we have skb->dev available here most likely, so prefixing the message with
skb->dev->name would make analyzing this situation even easier for someone hitting
this.

I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
definitely something you need to consider.

Thanks!

^ permalink raw reply

* [PATCH -next] net: ethernet: ti: davinci_cpdma: add missing EXPORTs
From: Paul Gortmaker @ 2016-12-01 20:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Paul Gortmaker, Ivan Khoronzhuk, Mugunthan V N, Grygorii Strashko,
	linux-omap, netdev

As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..36518fc5c7cc 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -796,6 +796,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +811,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
 	return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +855,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +868,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
 	return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 				     cpdma_handler_fn handler, int rx_type)
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-12-01 20:26 UTC (permalink / raw)
  To: hannes; +Cc: david.lebrun, netdev
In-Reply-To: <1480511568.3649771.803688521.5B47BE8F@webmail.messagingengine.com>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 30 Nov 2016 14:12:48 +0100

> David, one question: do you remember if you measured with linked lists
> at that time or also with arrays. I actually would expect small arrays
> that entirely fit into cachelines to be actually faster than our current
> approach, which also walks a linked list, probably the best algorithm to
> trash cache lines. I ask because I currently prefer this approach more
> than having large allocations in the O(1) case because of easier code
> and easier management.

I did not try this and I do agree with you that for extremely small table
sizes a list or array would perform better because of the cache behavior.

^ permalink raw reply

* Re: [PATCH] stmmac: simplify flag assignment
From: David Miller @ 2016-12-01 20:23 UTC (permalink / raw)
  To: pavel; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161130114431.GB14296@amd>

From: Pavel Machek <pavel@ucw.cz>
Date: Wed, 30 Nov 2016 12:44:31 +0100

> 
> Simplify flag assignment.
>     
> Signed-off-by: Pavel Machek <pavel@denx.de>
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ed20668..0b706a7 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct net_device *dev,
>  		features &= ~NETIF_F_CSUM_MASK;
>  
>  	/* Disable tso if asked by ethtool */
> -	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> -		if (features & NETIF_F_TSO)
> -			priv->tso = true;
> -		else
> -			priv->tso = false;
> -	}
> +	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> +		priv->tso = !!(features & NETIF_F_TSO);
>  

Pavel, this really seems arbitrary.

Whilst I really appreciate you're looking into this driver a bit because
of some issues you are trying to resolve, I'd like to ask that you not
start bombarding me with nit-pick cleanups here and there and instead
concentrate on the real bug or issue.

Thanks in advance.

^ permalink raw reply

* [PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.
From: Roman Mashak @ 2016-12-01 20:20 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sathya.perla, Roman Mashak

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 man/man8/tc.8 | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 8a47a2b..d957ffa 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -32,7 +32,9 @@ class-id ] qdisc
 DEV
 .B [ parent
 qdisc-id
-.B | root ] protocol
+.B | root ] [ handle
+handle-id ]
+.B protocol
 protocol
 .B prio
 priority filtertype
@@ -577,7 +579,7 @@ it is created.
 
 .TP
 get
-Displays a single filter given the interface, parent ID, priority, protocol and handle ID.
+Displays a single filter given the interface, qdisc-id, priority, protocol and handle-id.
 
 .TP
 show
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
From: Florian Fainelli @ 2016-12-01 20:21 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: idosch, andrew, vivien.didelot, netdev, bridge, jiri, davem
In-Reply-To: <20161123134856.cwk6sznnwa7p4xtq@splinter.mtl.com>

On 11/23/2016 05:48 AM, Ido Schimmel wrote:
> Hi Florian,
> 
> On Tue, Nov 22, 2016 at 09:56:30AM -0800, Florian Fainelli wrote:
>> On 11/22/2016 09:41 AM, Ido Schimmel wrote:
>>> Hi Florian,
>>>
>>> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
>>>> Hi all,
>>>>
>>>> This patch series allows using the bridge master interface to configure
>>>> an Ethernet switch port's CPU/management port with different VLAN attributes than
>>>> those of the bridge downstream ports/members.
>>>>
>>>> Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I
>>>> tested this with b53 and a mockup DSA driver.
>>>
>>> We'll need to add a check in mlxsw and ignore any VLAN configuration for
>>> the bridge device itself. Otherwise, any configuration done on br0 will
>>> be propagated to all of its slaves, which is incorrect.
>>>
>>>>
>>>> Open questions:
>>>>
>>>> - if we have more than one bridge on top of a physical switch, the driver
>>>>   should keep track of that and verify that we are not going to change
>>>>   the CPU port VLAN attributes in a way that results in incompatible settings
>>>>   to be applied
>>>>
>>>> - if the default behavior is to have all VLANs associated with the CPU port
>>>>   be ingressing/egressing tagged to the CPU, is this really useful?
>>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in
>> premise, this driver plus the DSA tag protocol hook do exactly the same
>> things as you just describe.
> 
> Thanks for the detailed explanation! I also took the time to read
> dsa.txt, however I still don't quite understand the motivation for
> VLAN filtering on the CPU port. In which cases would you like to prevent
> packets from going to the host due to their VLAN header? This change
> would make sense to me if you only had a limited number of VLANs you
> could enable on the CPU port, but from your response I understand that
> this isn't the case.

It's not much about VLAN filtering per-se, but more about the default
VLAN membership of the CPU port, in the absence of any explicit
configuration. As an user, I find it a little inconvenient to have to
create one VLAN interface per VLAN that I am adding to the bridge to be
able to terminate that traffic properly towards the host/CPU/management
interface, especially when this VLAN is untagged.

This is really the motivation for these patches: if there is only one
VLAN configured, and it's the default VLAN for all ports, then the
bridge master interface also terminates this VLAN with the same
properties as those added by default (typically with default_pvid: VID 1
untagged, unless changed of course).

If you don't want that as an user, you now have the ability to change
it, and make this VLAN (or any other for that matter) to be terminated
differently at the host/CPU/management port level than how it is
egressing at the downstream ports part of that VLAN too.

Maybe it's a bit overkill...

> 
> FWIW, I don't have a problem with patches if they are useful for you,
> I'm just trying to understand the use case. We can easily patch mlxsw to
> ignore calls with orig_dev=br0.

OK, if I resubmit, I will try to take care of mlxsw and rocker as well.

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: David Miller @ 2016-12-01 20:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 01 Dec 2016 09:04:17 -0800

> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left.  Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)

The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.

Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.

Let's say we put 1 packet in, and hit the doorbell.

Then another packet comes in and we defer the doorbell to the IRQ.

At this point there are a couple things I'm unclear about.

For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor?  Depending upon how the doorbell works
it might, or it might not.

Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?

This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.

Maybe the situation I'm worried about is not possible, so please
explain it to me :-)

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 20:18 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <aac93b13-6298-b9eb-7f3c-b074f22c388c@hpe.com>

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 11:05 AM, Tom Herbert wrote:
>>
>> For the GSO and GRO the rationale is that performing the extra SW
>> processing to do the offloads is significantly less expensive than
>> running each packet through the full stack. This is true in a
>> multi-layered generalized stack. In TXDP, however, we should be able
>> to optimize the stack data path such that that would no longer be
>> true. For instance, if we can process the packets received on a
>> connection quickly enough so that it's about the same or just a little
>> more costly than GRO processing then we might bypass GRO entirely.
>> TSO is probably still relevant in TXDP since it reduces overheads
>> processing TX in the device itself.
>
>
> Just how much per-packet path-length are you thinking will go away under the
> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
> non-trivial things to effective overhead (service demand) and so throughput:
>
For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      9260.24   2.02     -1.00    0.428 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      5621.82   4.25     -1.00    1.486 -1.000
>
> And that is still with the stretch-ACKs induced by GRO at the receiver.
>
Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.

> Losing GRO has quite similar results:
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      9154.02   4.00     -1.00    0.860 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
>
>  87380  16384  16384    10.00      4212.06   5.36     -1.00    2.502 -1.000
>
> I'm sure there is a very non-trivial "it depends" component here - netperf
> will get the peak benefit from *SO and so one will see the peak difference
> in service demands - but even if one gets only 6 segments per *SO that is a
> lot of path-length to make-up.
>
True, but I think there's a lot of path we'll be able to cut out. In
this mode we don't need IPtables, Netfilter, input route, IPvlan
check, or other similar lookups. Once we've successfully matched a
establish TCP state anything related to policy on both TX and RX for
that connection is inferred from the state. We want the processing
path in this case to just be concerned with just protocol processing
and interface to user.

> 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz
>
> And even if one does have the CPU cycles to burn so to speak, the effect on
> power consumption needs to be included in the calculus.
>
Definitely, power consumption is the down side of spin polling CPUs.
As I said we would never should be spinning any more CPUs than
necessary to handle the load.

Tom

> happy benchmarking,
>
> rick jones

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Sowmini Varadhan @ 2016-12-01 20:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>

On (12/01/16 11:05), Tom Herbert wrote:
> 
> Polling does not necessarily imply that networking monopolizes the CPU
> except when the CPU is otherwise idle. Presumably the application
> drives the polling when it is ready to receive work.

I'm not grokking that- "if the cpu is idle, we want to busy-poll
and make it 0% idle"?  Keeping CPU 0% idle has all sorts
of issues, see slide 20 of
 http://www.slideshare.net/shemminger/dpdk-performance

> > and one other critical difference from the hot-potato-forwarding
> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> > does not apply: in order to figure out the ethernet and IP headers
> > in the response correctly at all times (in the face of things like VRRP,
> > gw changes, gw's mac addr changes etc) the application should really
> > be listening on NETLINK sockets for modifications to the networking
> > state - again points to needing a select() socket set where you can
> > have both the I/O fds and the netlink socket,
> >
> I would think that that is management would not be implemented in a
> fast path processing thread for an application.

sure, but my point was that *XDP and other stack-bypass methods needs 
to provide a select()able socket: when your use-case is not about just
networking, you have to snoop on changes to the control plane, and update
your data path. In the OVS case (pure networking) the OVS control plane
updates are intrinsic to OVS. For the rest of the request/response world,
we need a select()able socket set to do this elegantly (not really
possible in DPDK, for example)

> The *SOs are always an interesting question. They make for great
> benchmarks, but in real life the amount of benefit is somewhat
> unclear. Under the wrong conditions, like all cwnds have collapsed or

I think Rick's already bringing up this one.

--Sowmini

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <20161201201707.5f51a02e@redhat.com>

On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > BTW, if you are doing tests on mlx4 40Gbit,
> 
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
> 
> >  would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
> 
> What tool should I use to send "low-rate flows"?
> 

You could use https://github.com/google/neper

It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second  (so sending one 2-MSS TSO packet
every ms per flow)



> And what am I looking for?

Max throughput, in packets per second :/

^ permalink raw reply

* Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
From: Michał Kępień @ 2016-12-01 20:08 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Johannes Berg, David S . Miller, linux-wireless,
	netdev, linux-kernel
In-Reply-To: <201612020131.aDbI7Mq9%fengguang.wu@intel.com>

> Hi Michał,
> 
> [auto build test ERROR on mac80211-next/master]
> [also build test ERROR on v4.9-rc7 next-20161201]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master
> config: i386-randconfig-x004-201648 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    net/rfkill/core.c: In function 'rfkill_set_block':
> >> net/rfkill/core.c:354:2: error: implicit declaration of function '__rfkill_any_led_trigger_event' [-Werror=implicit-function-declaration]
>      __rfkill_any_led_trigger_event();
>      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    net/rfkill/core.c: In function 'rfkill_init':
>    net/rfkill/core.c:1349:1: warning: label 'error_led_trigger' defined but not used [-Wunused-label]
>     error_led_trigger:
>     ^~~~~~~~~~~~~~~~~
>    At top level:
>    net/rfkill/core.c:243:13: warning: 'rfkill_any_led_trigger_unregister' defined but not used [-Wunused-function]
>     static void rfkill_any_led_trigger_unregister(void)
>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    net/rfkill/core.c:238:12: warning: 'rfkill_any_led_trigger_register' defined but not used [-Wunused-function]
>     static int rfkill_any_led_trigger_register(void)
>                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    cc1: some warnings being treated as errors
> 
> vim +/__rfkill_any_led_trigger_event +354 net/rfkill/core.c
> 
>    348		rfkill->state &= ~RFKILL_BLOCK_SW_SETCALL;
>    349		rfkill->state &= ~RFKILL_BLOCK_SW_PREV;
>    350		curr = rfkill->state & RFKILL_BLOCK_SW;
>    351		spin_unlock_irqrestore(&rfkill->lock, flags);
>    352	
>    353		rfkill_led_trigger_event(rfkill);
>  > 354		__rfkill_any_led_trigger_event();
>    355	
>    356		if (prev != curr)
>    357			rfkill_event(rfkill);

Thanks, these are obviously all valid concerns.  Sorry for being sloppy
with the ifdefs.  If I get positive feedback on the proposed feature
itself, all these issues (and the warning pointed out in the other
message) will be resolved in v2.

-- 
Best regards,
Michał Kępień

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: David Miller @ 2016-12-01 20:04 UTC (permalink / raw)
  To: idosch
  Cc: hannes, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>


Hannes and Ido,

It looks like we are very close to having this in mergable shape, can
you guys work out this final issue and figure out if it really is
a merge stopped or not?

Thanks.

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:51 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201024407.GE26507@breakpoint.cc>

On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal <fw@strlen.de> wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> Posting for discussion....
>
> Warning: You are not going to like this reply...
>
>> Now that XDP seems to be nicely gaining traction
>
> Yes, I regret to see that.  XDP seems useful to create impressive
> benchmark numbers (and little else).
>
> I will send a separate email to keep that flamebait part away from
> this thread though.
>
> [..]
>
>> addresses the performance gap for stateless packet processing). The
>> problem statement is analogous to that which we had for XDP, namely
>> can we create a mode in the kernel that offer the same performance
>> that is seen with L4 protocols over kernel bypass
>
> Why?  If you want to bypass the kernel, then DO IT.
>
I don't want kernel bypass. I want the Linux stack to provide
something close to bare metal performance for TCP/UDP for some latency
sensitive applications we run.

> There is nothing wrong with DPDK.  The ONLY problem is that the kernel
> does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.
>
> But even without that its not difficult to get DPDK running.
>
That is not true for large scale deployments. Also, TXDP is about
accelerating transport layers like TCP, DPDK is just the interface
from userspace to device queues. We need a whole lot more with DPDK, a
userspace TCP/IP stack for instance, to consider that we have an
equivalent functionality.

> (T)XDP seems born from spite, not technical rationale.
> IMO everyone would be better off if we'd just have something netmap-esqe
> in the network core (also see below).
>
>> I imagine there are a few reasons why userspace TCP stacks can get
>> good performance:
>>
>> - Spin polling (we already can do this in kernel)
>> - Lockless, I would assume that threads typically have exclusive
>> access to a queue pair for a connection
>> - Minimal TCP/IP stack code
>> - Zero copy TX/RX
>> - Light weight structures for queuing
>> - No context switches
>> - Fast data path for in order, uncongested flows
>> - Silo'ing between application and device queues
>
> I only see two cases:
>
> 1. Many applications running (standard Os model) that need to
> send/receive data
> -> Linux Network Stack
>
> 2. Single dedicated application that does all rx/tx
>
> -> no queueing needed (can block network rx completely if receiver
> is slow)
> -> no allocations needed at runtime at all
> -> no locking needed (single produce, single consumer)
>
> If you have #2 and you need to be fast etc then full userspace
> bypass is fine.  We will -- no matter what we do in kernel -- never
> be able to keep up with the speed you can get with that
> because we have to deal with #1.  (Plus the ease of use/freedom of doing
> userspace programming).  And yes, I think that #2 is something we
> should address solely by providing netmap or something similar.
>
> But even considering #1 there are ways to speed stack up:
>
> I'd kill RPF/RPS so we don't have IPI anymore and skb stays
> on same cpu up to where it gets queued (ofo or rx queue).
>
The reference to RPS and RFS is only to move packets off the hot CPU
that are not in the datapath. For instance if we get a FIN for a
connection it we can put this into a slow path since FIN processing is
not latency sensitive but may take a considerable amount of CPU to
process.

> Then we could tell driver what happened with the skb it gave us, e.g.
> we can tell driver it can do immediate page/dma reuse, for example
> in pure ack case as opposed to skb sitting in ofo or receive queue.
>
> (RPS/RFS functionality could still be provided via one of the gazillion
>  hooks we now have in the stack for those that need/want it), so I do
> not think we would lose functionality.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>> XDP API although we would need to generalize the interface to call
>> stack functions (I previously posted patches for that). We will also
>> need a new action, XDP_HELD?, that indicates the XDP function held the
>> packet (put on a socket for instance).
>
> Seems this will not work at all with the planned page pool thing when
> pages start to be held indefinitely.
>
The processing needed to gift a page to the stack shouldn't be very
different than what a driver needs to do when and skbuff is created
when XDP_PASS is returned. We probably would want to pass additional
meta data, things like checksum and vlan information from received
descriptor to the stack. A callback can be included if the stack
decides it wants to hold on to the buffer and driver needs to do
dma_sync etc. for that.

> You can also never get even close to userspace offload stacks once you
> need/do this; allocations in hotpath are too expensive.
>
> [..]
>
>>   - When we transmit, it would be nice to go straight from TCP
>> connection to an XDP device queue and in particular skip the qdisc
>> layer. This follows the principle of low latency being first criteria.
>
> It will never be lower than userspace offloads so anyone with serious
> "low latency" requirement (trading) will use that instead.
>
Maybe, but the question is how close can we get? If we can get within
say 10-20% performance that would be a win.

> Whats your target audience?
>
Many applications, but the most recent one that seems to driving the
need for very low latency is machine learning. The competition here
really isn't DPDK but is still RDMA (tomorrow's technology for the
past twenty years ;-) ). When the apps guys run their tests, they see
a huge difference between RDMA performance and the stack out of the
box-- like latency for an op goes from 1 usec to 30 usecs. So the apps
guys naturally want RDMA, but anyone in kernel or network ops knows
the nightmare that deploying RDMA entails. If we can get the latencies
and variance down to something comparable (say <5 usecs) then we have
much stronger argument that we can avoid the immense costs that RDMA
brings in.

>> longer latencies in effect which likely means TXDP isn't appropriate
>> in such a cases. BQL is also out, however we would want the TX
>> batching of XDP.
>
> Right, congestion control and buffer bloat are totally overrated .. 8-(
>
> So far I haven't seen anything that would need XDP at all.
>
> What makes it technically impossible to apply these miracles to the
> stack...?
>
> E.g. "mini-skb": Even if we assume that this provides a speedup
> (where does that come from? should make no difference if a 32 or
>  320 byte buffer gets allocated).
>
It's the zero'ing of three cache lines. I believe we talked about that
as netdev.

> If we assume that its the zeroing of sk_buff (but iirc it made little
> to no difference), could add
>
> unsigned long skb_extensions[1];
>
> to sk_buff, then move everything not needed for tcp fastpath
> (e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
> below that
>
Yes, that's the intent.

> Then convert accesses to accessors and init it on demand.
>
> One could probably also split cb[] into a smaller fastpath area
> and another one at the end that won't be touched at allocation time.
>
>> Miscellaneous
>
>> contemplating that connections/sockets can be bound to particularly
>> CPUs and that any operations (socket operations, timers, receive
>> processing) must occur on that CPU. The CPU would be the one where RX
>> happens. Note this implies perfect silo'ing, everything for driver RX
>> to application processing happens inline on the CPU. The stack would
>> not cross CPUs for a connection while in this mode.
>
> Again don't see how this relates to xdp.  Could also be done with
> current stack if we make rps/rfs pluggable since nothing else
> currently pushes skb to another cpu (except when scheduler is involved
> via tc mirred, netfilter userspace queueing etc) but that is always
> explicit (i.e. skb leaves softirq protection).
>
> Can we please fix and improve what we already have rather than creating
> yet another NIH thing that will have to be maintained forever?
>
That's what we are doing and this is major reason what we need to
improve Linux as opposed introducing to parallel stacks. The cost for
supporting modifications to Linux pale in comparison to we would need
to support a parallel stack.

Tom

> Thanks.

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Pravin Shelar @ 2016-12-01 19:50 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130145159.3cee7ba4@griffin>

On Wed, Nov 30, 2016 at 5:51 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote:
>> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>>               goto drop;
>>       }
>>
>> -     if (unlikely(packet_length(skb, vport->dev) > mtu &&
>> -                  !skb_is_gso(skb))) {
>> -             net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
>> -                                  vport->dev->name,
>> -                                  packet_length(skb, vport->dev), mtu);
>> +     if (unlikely(!is_skb_forwardable(vport->dev, skb))) {
>
> How does this work when the vlan tag is accelerated? Then we can be
> over MTU, yet the check will pass.
>

This is not changing any behavior compared to current OVS vlan checks.
Single vlan header is not considered for MTU check.

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 19:48 UTC (permalink / raw)
  To: Tom Herbert, Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>

On 12/01/2016 11:05 AM, Tom Herbert wrote:
> For the GSO and GRO the rationale is that performing the extra SW
> processing to do the offloads is significantly less expensive than
> running each packet through the full stack. This is true in a
> multi-layered generalized stack. In TXDP, however, we should be able
> to optimize the stack data path such that that would no longer be
> true. For instance, if we can process the packets received on a
> connection quickly enough so that it's about the same or just a little
> more costly than GRO processing then we might bypass GRO entirely.
> TSO is probably still relevant in TXDP since it reduces overheads
> processing TX in the device itself.

Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      9260.24   2.02     -1.00    0.428 
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      5621.82   4.25     -1.00    1.486 
-1.000

And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      9154.02   4.00     -1.00    0.860 
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Recv     Send     Recv    Send
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384  16384    10.00      4212.06   5.36     -1.00    2.502 
-1.000

I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.

4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2016-12-01 19:43 UTC (permalink / raw)
  To: andreyknvl
  Cc: herbert, jasowang, edumazet, pmk, pabeni, mst, soheil, elfring,
	rppt, netdev, linux-kernel, dvyukov, kcc, syzkaller
In-Reply-To: <1480584880-48651-1-git-send-email-andreyknvl@google.com>

From: Andrey Konovalov <andreyknvl@google.com>
Date: Thu,  1 Dec 2016 10:34:40 +0100

> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
> stack exhaustion). The difference between the two is that netif_rx queues
> the packet into the backlog, and netif_receive_skb proccesses the packet
> in the current context.
> 
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
> 
> As mentioned by Eric this change also speeds up tun/tap. As measured by
> Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
> about 23%, from 700k packets/second to 867k packets/second.
> 
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks.
> 
> [1] https://github.com/google/syzkaller
> 
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
> 
> [3] https://www.spinics.net/lists/netdev/msg130570.html
> 
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> ---
> 
> Changes since v1:
> - incorporate Eric's note about speed improvements in commit description
> - use netif_receive_skb CONFIG_4KSTACKS is not enabled

Applied to net-next, thanks!

^ permalink raw reply

* Re: [PATCH net-next 2/3] net/act_pedit: Support using offset relative to the conventional network headers
From: David Miller @ 2016-12-01 19:41 UTC (permalink / raw)
  To: amir; +Cc: netdev, jhs, ogerlitz, hadarh
In-Reply-To: <20161130090928.14816-3-amir@vadai.me>

From: Amir Vadai <amir@vadai.me>
Date: Wed, 30 Nov 2016 11:09:27 +0200

> @@ -119,18 +119,45 @@ static bool offset_valid(struct sk_buff *skb, int offset)
>  	return true;
>  }
>  
> +static int pedit_skb_hdr_offset(struct sk_buff *skb,
> +				enum pedit_header_type htype, int *hoffset)
> +{
> +	int ret = -1;
> +
> +	switch (htype) {
> +	case PEDIT_HDR_TYPE_ETH:
> +		if (skb_mac_header_was_set(skb)) {
> +			*hoffset = skb_mac_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	case PEDIT_HDR_TYPE_RAW:
> +	case PEDIT_HDR_TYPE_IP4:
> +	case PEDIT_HDR_TYPE_IP6:
> +		*hoffset = skb_network_offset(skb);
> +		ret = 0;
> +		break;
> +	case PEDIT_HDR_TYPE_TCP:
> +	case PEDIT_HDR_TYPE_UDP:
> +		if (skb_transport_header_was_set(skb)) {
> +			*hoffset = skb_transport_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	};
> +
> +	return ret;
> +}
> +

The only distinction between the cases is "L2", "L3", and "L4".

Therefore I don't see any reason to break it down into IP4 vs. IP6 vs.
RAW, for example.  They all map to the same thing.

So why not just have PEDIT_HDR_TYPE_L2, PEDIT_HDR_TYPE_L3, and
PEDIT_HDR_TYPE_L4?  It definitely seems more straightforward
and cleaner that way.

Thanks.

^ permalink raw reply

* Re: [net-next] rtnetlink: return the correct error code
From: David Miller @ 2016-12-01 19:39 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev
In-Reply-To: <1480495054-5114-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 30 Nov 2016 16:37:34 +0800

> Before this patch, function ndo_dflt_fdb_dump() will always return code
> from uc fdb dump. The reture code of mc fdb dump is lost.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: asix: Fix AX88772_suspend() USB vendor commands failure issues
From: David Miller @ 2016-12-01 19:35 UTC (permalink / raw)
  To: allan
  Cc: jonathanh, freddy, Dean_Jenkins, Mark_Craske, robert.foss,
	ivecera, john.stultz, vpalatin, stephen, grundler, changchias,
	andrew, tremyfr, colin.king, linux-usb, netdev, linux-kernel,
	vpalatin
In-Reply-To: <00d701d24ae3$d4f4f2a0$7eded7e0$@asix.com.tw>

From: "ASIX_Allan [Office]" <allan@asix.com.tw>
Date: Wed, 30 Nov 2016 16:29:08 +0800

> The change fixes AX88772_suspend() USB vendor commands failure issues.
> 
> Signed-off-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Patch applied, thanks.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Willem de Bruijn @ 2016-12-01 19:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Willem de Bruijn,
	Rick Jones, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan, Achiad Shochat
In-Reply-To: <CALx6S34ENTbUmCGx_4izNHoXbdy5UHuvUesbFGw+8kQSidesEg@mail.gmail.com>

>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>>> > and RX work.

This problem is somewhat tangential to the doorbell avoidance discussion.

>>> >
>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>>> > thousands of TCP sockets.
>>> >
>> We have thousands of applications, and some of them 'kind of multicast'
>> events to a broad number of TCP sockets.
>>
>> Very often, applications writers use a single thread for doing this,
>> when all they need is to send small packets to 10,000 sockets, and they
>> do not really care of doing this very fast. They also do not want to
>> hurt other applications sharing the NIC.
>>
>> Very often, process scheduler will also run this single thread in a
>> single cpu, ie avoiding expensive migrations if they are not needed.
>>
>> Problem is this behavior locks one TX queue for the duration of the
>> multicast, since XPS will force all the TX packets to go to one TX
>> queue.
>>
> The fact that XPS is forcing TX packets to go over one CPU is a result
> of how you've configured XPS. There are other potentially
> configurations that present different tradeoffs.

Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.

> For instance, TX
> queues are plentiful now days to the point that we could map a number
> of queues to each CPU while respecting NUMA locality between the
> sending CPU and where TX completions occur. If the set is sufficiently
> large this would also helps to address device lock contention. The
> other thing I'm wondering is if Willem's concepts distributing DOS
> attacks in RPS might be applicable in XPS. If we could detect that a
> TX queue is "under attack" maybe we can automatically backoff to
> distributing the load to more queues in XPS somehow.

If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-12-01 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>


On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> BTW, if you are doing tests on mlx4 40Gbit,

I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
machines too.

>  would you check the
> following quick/dirty hack, using lots of low-rate flows ?

What tool should I use to send "low-rate flows"?

And what am I looking for?

> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
>         queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                                                 struct net_device *dev,
>                                                 netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                     (udp_hdr(skb)->dest != priv->vxlan_port))
>                         features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>         }
> +       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +               features &= NETIF_F_GSO_MASK;
>  
>         return features;
>  }
 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Cyrill Gorcunov @ 2016-12-01 19:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, avagin
In-Reply-To: <20161201105701.1c54b258@xeon-e3>

On Thu, Dec 01, 2016 at 10:57:01AM -0800, Stephen Hemminger wrote:
> 
> Applied both patches, but needed to remove inet_diag.h since
> already updated kernel headers.

Thank you! I think we might need to extend the matching interface
for killing raw sockets in near future, because for now it is
too wildcard. I put this into my todo list, once I finish with
more urgent tasks will back to this one.

^ permalink raw reply

* Re: [PATCH] netfilter: avoid warn and OOM on vmalloc call
From: Marcelo Ricardo Leitner @ 2016-12-01 19:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Florian Westphal, Neil Horman, netdev, netfilter-devel, LKML
In-Reply-To: <CAAeHK+wjP=h_4YxB6VUc+FjKcZi9igmyTs3nPAuUJeNomYSA0w@mail.gmail.com>

On Thu, Dec 01, 2016 at 10:42:22AM +0100, Andrey Konovalov wrote:
> On Wed, Nov 30, 2016 at 8:42 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > Hi Andrey,
> >
> > Please let me know how this works for you. It seems good here, though
> > your poc may still trigger OOM through other means.
> 
> Hi Marcelo,
> 
> Don't see any reports with this patch.
> 
> Thanks!

Thanks Andrey.
I'll post a v2 after a few more tests here and to s/OOM/OOM killer/ in
most of the changelog.

> 
> >
> > Thanks,
> > Marcelo
> >
> > ---8<---
> >
> > Andrey Konovalov reported that this vmalloc call is based on an
> > userspace request and that it's spewing traces, which may flood the logs
> > and cause DoS if abused.
> >
> > Florian Westphal also mentioned that this call should not trigger OOM,
> > as kmalloc one is already not triggering it.
> >
> > This patch brings the vmalloc call in sync to kmalloc and disables the
> > warn trace on allocation failure and also disable OOM invocation.
> >
> > Note, however, that under such stress situation, other places may
> > trigger OOM invocation.
> >
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Cc: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/netfilter/x_tables.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> >         if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >                 info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> >         if (!info) {
> > -               info = vmalloc(sz);
> > +               info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> > +                                    __GFP_NORETRY | __GFP_HIGHMEM,
> > +                                PAGE_KERNEL);
> >                 if (!info)
> >                         return NULL;
> >         }
> > --
> > 2.9.3
> >
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox