Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:57 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130182243.cpoyoqvxuv2bnn4y@splinter.mtl.com>

On 30.11.2016 19:22, Ido Schimmel wrote:
> On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 17:32, Ido Schimmel wrote:
>>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>>>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>
>>>>> Make sure the device has a complete view of the FIB tables by invoking
>>>>> their dump during module init.
>>>>>
>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>> ---
>>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>>>  1 file changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> index 14bed1d..d176047 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>>>  	return NOTIFY_DONE;
>>>>>  }
>>>>>  
>>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>>>> +{
>>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>>>> +
>>>>> +	/* Flush pending FIB notifications and then flush the device's
>>>>> +	 * table before requesting another dump. Do that with RTNL held,
>>>>> +	 * as FIB notification block is already registered.
>>>>> +	 */
>>>>> +	mlxsw_core_flush_owq();
>>>>> +	rtnl_lock();
>>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>>>> +	rtnl_unlock();
>>>>> +}
>>>>> +
>>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  {
>>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>>>  	int err;
>>>>>  
>>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  
>>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>>>
>>>> Sorry to pick in here again:
>>>>
>>>> There is a race here. You need to protect the registration of the fib
>>>> notifier as well by the sequence counter. Updates here are not ordered
>>>> in relation to this code below.
>>>
>>> You mean updates that can be received after you registered the notifier
>>> and until the dump started? I'm aware of that and that's OK. This
>>> listener should be able to handle duplicates.
>>
>> I am not concerned about duplicates, but about ordering deletes and
>> getting an add from the RCU code you will add the node to hw while it is
>> deleted in the software path. You probably will ignore the delete
>> because nothing is installed in hw and later add the node which was
>> actually deleted but just reordered which happend on another CPU, no?
> 
> Are you referring to reordering in the workqueue? We already covered
> this using an ordered workqueue, which has one context of execution
> system-wide.

Ups, sorry, I missed that mail. Probably read it on the mobile phone and
it became invisible for me later on. Busy day... ;)

The reordering in the workqueue seems fine to me and also still necessary.

Basically, if you delete a node right now the kernel might simply do a
RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
or synchronization with the reader side. Thus you might get a callback
from the notifier for a delete event on the one CPU and you end up
queueing this fib entry after the delete queue, because the RCU walk
isn't protected by any means.

Looking closer at this series again, I overlooked the fact that you
fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
orders fetching of fib_seq and thus the RCU dumping after any concurrent
executing fib table update, also the mutex_lock and unlock provide
proper acquire and release fences, so the CPU indeed sees the effect of
a RCU_INIT_POINTER update done on another CPU, because they pair with
the rtnl_unlock which might happen on the other CPU.

My question is if this is a bit of luck and if we should make this
explicit by putting the registration itself under the protection of the
sequence counter. I favor the additional protection, e.g. if we some day
actually we optimize the fib_seq code? Otherwise we might probably
document this fact. :)

>>> I've a follow up patchset that introduces a new event in switchdev
>>> notification chain called SWITCHDEV_SYNC, which is sent when port
>>> netdevs are enslaved / released  from a master device (points in time
>>> where kernel<->device can get out of sync). It will invoke
>>> re-propagation of configuration from different parts of the stack
>>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
>>> in duplicates.
>>
>> Okay, understood. I wonder how we can protect against accidentally abort
>> calls actually. E.g. if I start to inject routes into my routing domain
>> how can I make sure the box doesn't die after I try to insert enough
>> routes. Do we need to touch quagga etc?
> 
> The whole point of moving abort mechanism to the driver is that the
> system won't die, but instead routing will be done in the kernel. If you
> respect hardware limitations, then there's no reason for abort mechanism
> to kick in.

Quick follow-up question: How can I quickly find out the hw limitations
via the kernel api?

Thanks,
Hannes

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <9b4264f8-26b9-a611-56f0-0840cecf9c44@stressinduktion.org>

On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 01.12.2016 22:12, Tom Herbert wrote:
>> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> Hello,
>>>
>>> this is a good conversation and I simply want to bring my worries
>>> across. I don't have good solutions for the problems XDP tries to solve
>>> but I fear we could get caught up in maintenance problems in the long
>>> term given the ideas floating around on how to evolve XDP currently.
>>>
>>> On 01.12.2016 17:28, Thomas Graf wrote:
>>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>>>> XDP manipulates packets at free will and thus all security guarantees
>>>>> are off as well as in any user space solution.
>>>>>
>>>>> Secondly user space provides policy, acl, more controlled memory
>>>>> protection, restartability and better debugability. If I had multi
>>>>> tenant workloads I would definitely put more complex "business/acl"
>>>>> logic into user space, so I can make use of LSM and other features to
>>>>> especially prevent a network facing service to attack the tenants. If
>>>>> stuff gets put into the kernel you run user controlled code in the
>>>>> kernel exposing a much bigger attack vector.
>>>>>
>>>>> What use case do you see in XDP specifically e.g. for container networking?
>>>>
>>>> DDOS mitigation to protect distributed applications in large clusters.
>>>> Relying on CDN works to protect API gateways and frontends (as long as
>>>> they don't throw you out of their network) but offers no protection
>>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>>> level and allowing the mitigation capability to scale up with the number
>>>> of servers is natural and cheap.
>>>
>>> So far we e.g. always considered L2 attacks a problem of the network
>>> admin to correctly protect the environment. Are you talking about
>>> protecting the L3 data plane? Are there custom proprietary protocols in
>>> place which need custom protocol parsers that need involvement of the
>>> kernel before it could verify the packet?
>>>
>>> In the past we tried to protect the L3 data plane as good as we can in
>>> Linux to allow the plain old server admin to set an IP address on an
>>> interface and install whatever software in user space. We try not only
>>> to protect it but also try to achieve fairness by adding a lot of
>>> counters everywhere. Are protections missing right now or are we talking
>>> about better performance?
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>
I agree that eBPF is not going to be sufficient from everything we'll
want to do. Undoubtably, we'll continue see new addition of more
helpers to assist in processing, but at some point we will want a to
load a kernel module that handles more complex processing and insert
it at the XDP callout. Nothing in the design of XDP precludes doing
that and I have already posted the patches to generalize the XDP
callout for that. Taking either of these routes has tradeoffs, but
regardless of whether this is BPF or module code, the principles of
XDP and its value to help solve some class of problems remains.

 Tom

> Not to mention the fact that you might have to deal with fragments in
> the Internet. Some DOS mitigations were already abused to generate
> blackholes for other users. Filtering such stuff is quite complicated.
>
> I argued also under the aspect of what Thomas said, that the outside
> world of the cluster is already protected by a CDN.
>
> Bye,
> Hannes
>

^ permalink raw reply

* [PATCH] net: ethernet: ti: davinci_cpdma: sanitize inter-module API
From: Arnd Bergmann @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Mugunthan V N
  Cc: Arnd Bergmann, Grygorii Strashko, David S. Miller,
	Ivan Khoronzhuk, Uwe Kleine-König, linux-omap, netdev,
	linux-kernel

The davinci_cpdma module is a helper library that is used by the
actual device drivers and does nothing by itself, so all its API
functions need to be exported.

Four new functions were added recently without an export, so now
we get a build error:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

This exports those symbols. After taking a closer look, I found two
more global functions in this file that are not exported and not used
anywhere, so they can obviously be removed. There is also one function
that is only used internally in this module, and should be 'static'.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Fixes: 0fc6432cc78d ("net: ethernet: ti: davinci_cpdma: add weight function for channels")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 63 +++++++++++----------------------
 drivers/net/ethernet/ti/davinci_cpdma.h |  4 ---
 2 files changed, 21 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..b9d40f0cdf6c 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -628,6 +628,23 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr)
 }
 EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy);
 
+static int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+	if (chan->state != CPDMA_STATE_ACTIVE) {
+		spin_unlock_irqrestore(&chan->lock, flags);
+		return -EINVAL;
+	}
+
+	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
+		      chan->mask);
+	spin_unlock_irqrestore(&chan->lock, flags);
+
+	return 0;
+}
+
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable)
 {
 	unsigned long flags;
@@ -796,6 +813,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +828,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
 	return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +872,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +885,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
 	return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 				     cpdma_handler_fn handler, int rx_type)
@@ -1270,46 +1291,4 @@ int cpdma_chan_stop(struct cpdma_chan *chan)
 }
 EXPORT_SYMBOL_GPL(cpdma_chan_stop);
 
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&chan->lock, flags);
-	if (chan->state != CPDMA_STATE_ACTIVE) {
-		spin_unlock_irqrestore(&chan->lock, flags);
-		return -EINVAL;
-	}
-
-	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
-		      chan->mask);
-	spin_unlock_irqrestore(&chan->lock, flags);
-
-	return 0;
-}
-
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_get(ctlr, control);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_set(ctlr, control, value);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(cpdma_control_set);
-
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db2abab..36d0a09a3d44 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -87,7 +87,6 @@ int cpdma_chan_process(struct cpdma_chan *chan, int quota);
 
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable);
 void cpdma_ctlr_eoi(struct cpdma_ctlr *ctlr, u32 value);
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable);
 u32 cpdma_ctrl_rxchs_state(struct cpdma_ctlr *ctlr);
 u32 cpdma_ctrl_txchs_state(struct cpdma_ctlr *ctlr);
 bool cpdma_check_free_tx_desc(struct cpdma_chan *chan);
@@ -111,7 +110,4 @@ enum cpdma_control {
 	CPDMA_RX_BUFFER_OFFSET,		/* read-write */
 };
 
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control);
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value);
-
 #endif
-- 
2.9.0

^ permalink raw reply related

* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 21:47 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <CALx6S37aknqfrj66AP8hEfVT9X2OmBFK9GVa9A+0FpydPbm9kg@mail.gmail.com>

On 12/01/2016 12:18 PM, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>> Just how much per-packet path-length are you thinking will go away under the
>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
>> non-trivial things to effective overhead (service demand) and so throughput:
>>
> For plain in order TCP packets I believe we should be able process
> each packet at nearly same speed as GRO. Most of the protocol
> processing we do between GRO and the stack are the same, the
> differences are that we need to do a connection lookup in the stack
> path (note we now do this is UDP GRO and that hasn't show up as a
> major hit). We also need to consider enqueue/dequeue on the socket
> which is a major reason to try for lockless sockets in this instance.

So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.

> Sure, but trying running something emulates a more realistic workload
> than a TCP stream, like RR test with relative small payload and many
> connections.

That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?

I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.

happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
16384  87380

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Murali Karicheri @ 2016-12-01 21:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Joakim Tjernlund, netdev@vger.kernel.org, Roger Quadros,
	Grygorii Strashko
In-Reply-To: <20161201173100.GG21887@lunn.ch>

Hi Andrew,
On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> Hi Murali 
> 
>> 2. Switch mode where it implements a simple Ethernet switch. Currently
>>    it doesn't have address learning capability, but in future it
>>    can.
> 
> If it does not have address learning capabilities, does it act like a
> plain old hub? What comes in one port goes out all others?

Thanks for the response!

Yes. It is a plain hub. it replicates frame to both ports. So need to
run a bridge layer for address learning in software.

> 
> Or can you do the learning in software on the host and program tables,
> which the hardware then uses?
> 

I think not. I see we have a non Linux implementation that does address
learning in software using a hash table and look up MAC for each packet
to see which port it needs to be sent to.

Murali

>> 3. Switch with HSR/PRP offload where it provides HSR/PRP protocol
>>    support and cut through switch.
>>
>> So a device need to function in one of the modes. A a regular Ethernet
>> driver that provides two network devices, one per port, and switchdev
>> for each physical port (in switch mode) will look ideal in this case.
> 
> Yes, this seems the right model to use.
> 
>      Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Alexander Duyck @ 2016-12-01 21:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <1480402716.18162.124.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26         eth0      0.00 5822800.00      0.00 597064.41      0.00      0.00      1.00
> 22:43:27         eth0     24.00 5788237.00      2.09 593520.26      0.00      0.00      0.00
> 22:43:28         eth0     12.00 5817777.00      1.43 596551.47      0.00      0.00      1.00
> 22:43:29         eth0     22.00 5841516.00      1.61 598982.87      0.00      0.00      0.00
> 22:43:30         eth0      4.00 4389137.00      0.71 450058.08      0.00      0.00      1.00
> 22:43:31         eth0      4.00 5871008.00      0.72 602007.79      0.00      0.00      0.00
> 22:43:32         eth0     12.00 5891809.00      1.43 604142.60      0.00      0.00      1.00
> 22:43:33         eth0     10.00 5901904.00      1.12 605175.70      0.00      0.00      0.00
> 22:43:34         eth0      5.00 5907982.00      0.69 605798.99      0.00      0.00      1.00
> 22:43:35         eth0      2.00 5847086.00      0.12 599554.71      0.00      0.00      0.00
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47         eth0      9.00 10226424.00      1.02 1048608.05      0.00      0.00      1.00
> 22:43:48         eth0      1.00 10316955.00      0.06 1057890.89      0.00      0.00      0.00
> 22:43:49         eth0      1.00 10310104.00      0.10 1057188.32      0.00      0.00      1.00
> 22:43:50         eth0      0.00 10249423.00      0.00 1050966.23      0.00      0.00      0.00
> 22:43:51         eth0      0.00 10210441.00      0.00 1046969.05      0.00      0.00      1.00
> 22:43:52         eth0      2.00 10198389.00      0.16 1045733.17      0.00      0.00      1.00
> 22:43:53         eth0      8.00 10079257.00      1.43 1033517.83      0.00      0.00      0.00
> 22:43:54         eth0      0.00 7693509.00      0.00 788885.16      0.00      0.00      0.00
> 22:43:55         eth0      2.00 10343076.00      0.20 1060569.32      0.00      0.00      1.00
> 22:43:56         eth0      1.00 10224571.00      0.14 1048417.93      0.00      0.00      0.00
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.

A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency.  I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?

>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47         eth0      3.00 1355422.00      0.45 319706.04      0.00      0.00      0.00
> 22:50:48         eth0      2.00 1344270.00      0.42 317035.21      0.00      0.00      1.00
> 22:50:49         eth0      3.00 1350503.00      0.51 318478.34      0.00      0.00      0.00
> 22:50:50         eth0     29.00 1348593.00      2.86 318113.02      0.00      0.00      1.00
> 22:50:51         eth0     14.00 1354855.00      1.83 319508.56      0.00      0.00      0.00
> 22:50:52         eth0      7.00 1357794.00      0.73 320226.89      0.00      0.00      1.00
> 22:50:53         eth0      5.00 1326130.00      0.63 312784.72      0.00      0.00      0.00
> 22:50:54         eth0      2.00 994584.00      0.12 234598.40      0.00      0.00      1.00
> 22:50:55         eth0      5.00 1318209.00      0.75 310932.46      0.00      0.00      0.00
> 22:50:56         eth0     20.00 1323445.00      1.73 312178.11      0.00      0.00      1.00
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:51:03         eth0      4.00 1512055.00      0.54 356599.40      0.00      0.00      0.00
> 22:51:04         eth0      4.00 1507631.00      0.55 355609.46      0.00      0.00      1.00
> 22:51:05         eth0      4.00 1487789.00      0.42 350917.47      0.00      0.00      0.00
> 22:51:06         eth0      7.00 1474460.00      1.22 347811.16      0.00      0.00      1.00
> 22:51:07         eth0      2.00 1496529.00      0.24 352995.18      0.00      0.00      0.00
> 22:51:08         eth0      3.00 1485856.00      0.49 350425.65      0.00      0.00      1.00
> 22:51:09         eth0      1.00 1114808.00      0.06 262905.38      0.00      0.00      0.00
> 22:51:10         eth0      2.00 1510924.00      0.30 356397.53      0.00      0.00      1.00
> 22:51:11         eth0      2.00 1506408.00      0.30 355345.76      0.00      0.00      0.00
> 22:51:12         eth0      2.00 1499122.00      0.32 353668.75      0.00      0.00      1.00
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50
>
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>

Just wondering if this should be a separate change?  Seems like this
is something that might apply outside of your changes since it seems
like it is reserving enough room for 2 buffers.

>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>

With the use of READ_ONCE are there some barriers that are going to
need to be added to make sure these are populated in the correct
spots?  Any ordering constraints on the updates these fields?

>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +

I realize you are just copying from the function further down, but
couldn't this just be __raw_writel instead of iowrite32?

Also you may be able to squeeze a bit more out of this by doing some
barrier clean-ups.  In most drivers the wmb() is needed between writes
of the DMA memory and the MMIO memory.  So odds are you could hold off
on using a wmb() until you call this function and you wouldn't really
need it until just before the call to iowrite32 or __raw_writel.  The
rest of it could use either an smp_wmb() or dma_wmb().

>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>

It might be more readable to put the addition on ring_cons out in
front of all the WRITE_ONCE macro calls.

>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();
> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply

* Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor
From: Stefan Schmidt @ 2016-12-01 21:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org, linux-wpan@vger.kernel.org
In-Reply-To: <1477647723-14641-1-git-send-email-stefan@datenfreihafen.org>

Hello.

On 28.10.2016 11:42, Stefan Schmidt wrote:
> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

Did you somehow mangle this patch manually?

Looking at the patch in your git repo it shows no author name but
instead just my patch was send with git format-patch and git send-email
as usual and shows the right author. Was there something worn on my side
or yours? Just checking to avoid it in the future.

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=8ae2c5382bd9d98a8f7ddcb1faad1a978d773909

regards
Stefan Schmidt

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Hannes Frederic Sowa @ 2016-12-01 21:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <CALx6S36=Y0dSux+-omWXZKtxb73_g4+DXhPBawb79+UF6rp9rw@mail.gmail.com>

On 01.12.2016 22:12, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Hello,
>>
>> this is a good conversation and I simply want to bring my worries
>> across. I don't have good solutions for the problems XDP tries to solve
>> but I fear we could get caught up in maintenance problems in the long
>> term given the ideas floating around on how to evolve XDP currently.
>>
>> On 01.12.2016 17:28, Thomas Graf wrote:
>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>>> XDP manipulates packets at free will and thus all security guarantees
>>>> are off as well as in any user space solution.
>>>>
>>>> Secondly user space provides policy, acl, more controlled memory
>>>> protection, restartability and better debugability. If I had multi
>>>> tenant workloads I would definitely put more complex "business/acl"
>>>> logic into user space, so I can make use of LSM and other features to
>>>> especially prevent a network facing service to attack the tenants. If
>>>> stuff gets put into the kernel you run user controlled code in the
>>>> kernel exposing a much bigger attack vector.
>>>>
>>>> What use case do you see in XDP specifically e.g. for container networking?
>>>
>>> DDOS mitigation to protect distributed applications in large clusters.
>>> Relying on CDN works to protect API gateways and frontends (as long as
>>> they don't throw you out of their network) but offers no protection
>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>> level and allowing the mitigation capability to scale up with the number
>>> of servers is natural and cheap.
>>
>> So far we e.g. always considered L2 attacks a problem of the network
>> admin to correctly protect the environment. Are you talking about
>> protecting the L3 data plane? Are there custom proprietary protocols in
>> place which need custom protocol parsers that need involvement of the
>> kernel before it could verify the packet?
>>
>> In the past we tried to protect the L3 data plane as good as we can in
>> Linux to allow the plain old server admin to set an IP address on an
>> interface and install whatever software in user space. We try not only
>> to protect it but also try to achieve fairness by adding a lot of
>> counters everywhere. Are protections missing right now or are we talking
>> about better performance?
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!

I totally understand that. But in my reply to David in this thread I
mentioned DNS apex processing as being problematic which is actually
being referred in your linked slide deck on page 9 ("What do floods look
like") and the problematic of parsing DNS packets in XDP due to string
processing and looping inside eBPF.

Not to mention the fact that you might have to deal with fragments in
the Internet. Some DOS mitigations were already abused to generate
blackholes for other users. Filtering such stuff is quite complicated.

I argued also under the aspect of what Thomas said, that the outside
world of the cluster is already protected by a CDN.

Bye,
Hannes

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 21:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <343eadfa-f872-788d-748c-a195c0c4d03a@stressinduktion.org>

On Thu, Dec 01, 2016 at 10:09:19PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:54, Ido Schimmel wrote:
> > On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> >> On 01.12.2016 21:04, David Miller wrote:
> >>>
> >>> Hannes and Ido,
> >>>
> >>> It looks like we are very close to having this in mergable shape, can
> >>> you guys work out this final issue and figure out if it really is
> >>> a merge stopped or not?
> >>
> >> Sure, if the fib notification register could be done under protection of
> >> the sequence counter I don't see any more problems.
> > 
> > Did you maybe miss my reply yesterday? Because I was trying to
> > understand what "ordering" you're referring to, but didn't receive a
> > reply from you.
> 
> Oh, strange, I am pretty sure I replied to that. Let me resend it.

:)

I did get this reply, and then replied myself here:
https://marc.info/?l=linux-netdev&m=148053017425465&w=2

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161201205457.xf5evjphjj6mytkf@splinter>

On 01.12.2016 21:54, Ido Schimmel wrote:
> On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
>> On 01.12.2016 21:04, David Miller wrote:
>>>
>>> Hannes and Ido,
>>>
>>> It looks like we are very close to having this in mergable shape, can
>>> you guys work out this final issue and figure out if it really is
>>> a merge stopped or not?
>>
>> Sure, if the fib notification register could be done under protection of
>> the sequence counter I don't see any more problems.
> 
> Did you maybe miss my reply yesterday? Because I was trying to
> understand what "ordering" you're referring to, but didn't receive a
> reply from you.

Oh, strange, I am pretty sure I replied to that. Let me resend it.

>> The sync handler is nice to have and can be done in a later patch series.
> 
> Sync handler?

I was talking about SWITCHDEV_SYNC.

Bye,
Hannes

^ permalink raw reply

* Re: [patch] net: renesas: ravb: unintialized return value
From: Sergei Shtylyov @ 2016-12-01 21:13 UTC (permalink / raw)
  To: Dan Carpenter, Johan Hovold
  Cc: David S. Miller, Yoshihiro Kaneko, Kazuya Mizuguchi, Simon Horman,
	Wolfram Sang, Andrew Lunn, Philippe Reynes, Niklas Söderlund,
	Arnd Bergmann, netdev, linux-renesas-soc, kernel-janitors
In-Reply-To: <20161201205744.GB10701@mwanda>

Hello!

On 12/01/2016 11:57 PM, Dan Carpenter wrote:

> We want to set the other "err" variable here so that we can return it
> later.  My version of GCC misses this issue but I caught it with a
> static checker.
>
> Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks")

    Hm, I somehow missed this one, probably due to the horrific CC list. :-(

> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

MBR, Sergei

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 21:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <583b8947-3395-8529-933b-08e1a86a0778@stressinduktion.org>

On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> this is a good conversation and I simply want to bring my worries
> across. I don't have good solutions for the problems XDP tries to solve
> but I fear we could get caught up in maintenance problems in the long
> term given the ideas floating around on how to evolve XDP currently.
>
> On 01.12.2016 17:28, Thomas Graf wrote:
>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>> XDP manipulates packets at free will and thus all security guarantees
>>> are off as well as in any user space solution.
>>>
>>> Secondly user space provides policy, acl, more controlled memory
>>> protection, restartability and better debugability. If I had multi
>>> tenant workloads I would definitely put more complex "business/acl"
>>> logic into user space, so I can make use of LSM and other features to
>>> especially prevent a network facing service to attack the tenants. If
>>> stuff gets put into the kernel you run user controlled code in the
>>> kernel exposing a much bigger attack vector.
>>>
>>> What use case do you see in XDP specifically e.g. for container networking?
>>
>> DDOS mitigation to protect distributed applications in large clusters.
>> Relying on CDN works to protect API gateways and frontends (as long as
>> they don't throw you out of their network) but offers no protection
>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>> level and allowing the mitigation capability to scale up with the number
>> of servers is natural and cheap.
>
> So far we e.g. always considered L2 attacks a problem of the network
> admin to correctly protect the environment. Are you talking about
> protecting the L3 data plane? Are there custom proprietary protocols in
> place which need custom protocol parsers that need involvement of the
> kernel before it could verify the packet?
>
> In the past we tried to protect the L3 data plane as good as we can in
> Linux to allow the plain old server admin to set an IP address on an
> interface and install whatever software in user space. We try not only
> to protect it but also try to achieve fairness by adding a lot of
> counters everywhere. Are protections missing right now or are we talking
> about better performance?
>
The technical plenary at last IETF on Seoul a couple of weeks ago was
exclusively focussed on DDOS in light of the recent attack against
Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
presentation by Nick Sullivan
(https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
alluded to some implementation of DDOS mitigation. In particular, on
slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
numbers he gave we're based in iptables+BPF and that was a whole
1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
and that's also when I introduced XDP to whole IETF :-) ). If that's
the best we can do the Internet is in a world hurt. DDOS mitigation
alone is probably a sufficient motivation to look at XDP. We need
something that drops bad packets as quickly as possible when under
attack, we need this to be integrated into the stack, we need it to be
programmable to deal with the increasing savvy of attackers, and we
don't want to be forced to be dependent on HW solutions. This is why
we created XDP!

Tom

> To provide fairness you often have to share validated data within the
> kernel and with XDP. This requires consistent lookup methods for sockets
> in the lower level. Those can be exported to XDP via external functions
> and become part of uAPI which will limit our ability to change those
> functions in future. When the discussion started about early demuxing in
> XDP I became really nervous, because suddenly the XDP program has to
> decide correctly which protocol type it has and look in the correct
> socket table for the socket. Different semantics for sockets can apply
> here, e.g. some sockets are RCU managed, some end up using reference
> counts. A wrong decision here would cause havoc in the kernel (XDP
> considers packet as UDP but kernel stack as TCP). Also, who knows that
> we won't have per-cpu socket tables we would keep that as uAPI (this is
> btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
> SIP rewriter in XDP and depending on a coherent view of all sockets even
> if their hash doesn't fit to the one of the queue? Suddenly something
> which was thought of as being only mutable by one CPU becomes global
> again and because of XDP we need to add locking because of uAPI.
>
> This discussion is parallel to the discussion about trace points, which
> are not considered uAPI. If eBPF functions are not considered uAPI then
> eBPF in the network stack will have much less value, because you
> suddenly depend on specific kernel versions again and cannot simply load
> the code into the kernel. The API checks will become very difficult to
> implement, see also the ongoing MODVERSIONS discussions on LKML some
> days back.
>
>>>> I agree with you if the LB is a software based appliance in either a
>>>> dedicated VM or on dedicated baremetal.
>>>>
>>>> The reality is turning out to be different in many cases though, LB
>>>> needs to be performed not only for north south but east west as well.
>>>> So even if I would handle LB for traffic entering my datacenter in user
>>>> space, I will need the same LB for packets from my applications and
>>>> I definitely don't want to move all of that into user space.
>>>
>>> The open question to me is why is programmability needed here.
>>>
>>> Look at the discussion about ECMP and consistent hashing. It is not very
>>> easy to actually write this code correctly. Why can't we just put C code
>>> into the kernel that implements this once and for all and let user space
>>> update the policies?
>>
>> Whatever LB logic is put in place with native C code now is unlikely the
>> logic we need in two years. We can't really predict the future. If it
>> was the case, networking would have been done long ago and we would all
>> be working on self eating ice cream now.
>
> Did LB algorithms on the networking layer change that much?
>
> There is a long history of using consistent hashing for load balancing,
> as e.g. is done in haproxy or F5.
>
>>> Load balancers have to deal correctly with ICMP packets, e.g. they even
>>> have to be duplicated to every ECMP route. This seems to be problematic
>>> to do in eBPF programs due to looping constructs so you end up with
>>> complicated user space anyway.
>>
>> Feel free to implement such complex LBs in user space or natively. It is
>> not required for the majority of use cases. The most popular LBs for
>> application load balancing have no idea of ECMP and require ECMP aware
>> routers to be made redundant itself.
>
> They are already available and e.g. deployed as part of some kubernetes
> stacks as I wrote above.
>
> It is a generally available algorithm which fits a lot of use cases,
> basically every website that wants to shard its sessions can make use of
> it. Also it is independent of ECMP and mostly is implemented in load
> balancers due to its need for a lot of memory.
>
> New algorithms outdate old ones but the core principles will be the same
> and don't require major changes to the interface, e.g. ipvs scheduler.
>
> If we are talking about security features for early drop inside TCP
> streams, like http, you need to have a proper stream reassembly engine.
> Snort e.g. dropped a complete stream of TCP packets if you send a RST
> with the same quadruple but a wrong sequence number. End system didn't
> consider the RST but non synchronized solutions ended up not inspecting
> this flow anymore. How do you handle diverting views on meta data in
> networking protocols? Also look how hard it is to keep e.g. the fib
> table synchronized to the hardware.
>
> In retrospect, I think Tom Herbert's move putting ILA stateless
> translation into the XDP hook wasn't that bad after all. ILA maybe
> hopefully becomes a standard and its implementation is already in the
> kernel so why keep its translator not part of the kernel, too?
>
> TLDR; what I'm trying to argue is that evolution of the network stack is
> problematic with a programmable backplane in the kernel which locks out
> future modifications of the stack in some places. On the other side, if
> we don't add those features we will have a half baked solution and
> people will simply prefer netmap or DPDK.
>
> Bye,
> Hannes
>

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>

On 30.11.2016 17:32, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>> From: Ido Schimmel <idosch@mellanox.com>
>>>
>>> Make sure the device has a complete view of the FIB tables by invoking
>>> their dump during module init.
>>>
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> index 14bed1d..d176047 100644
>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>  	return NOTIFY_DONE;
>>>  }
>>>  
>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>> +{
>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>> +
>>> +	/* Flush pending FIB notifications and then flush the device's
>>> +	 * table before requesting another dump. Do that with RTNL held,
>>> +	 * as FIB notification block is already registered.
>>> +	 */
>>> +	mlxsw_core_flush_owq();
>>> +	rtnl_lock();
>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>> +	rtnl_unlock();
>>> +}
>>> +
>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  {
>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>  	int err;
>>>  
>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  
>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>
>> Sorry to pick in here again:
>>
>> There is a race here. You need to protect the registration of the fib
>> notifier as well by the sequence counter. Updates here are not ordered
>> in relation to this code below.
> 
> You mean updates that can be received after you registered the notifier
> and until the dump started? I'm aware of that and that's OK. This
> listener should be able to handle duplicates.

I am not concerned about duplicates, but about ordering deletes and
getting an add from the RCU code you will add the node to hw while it is
deleted in the software path. You probably will ignore the delete
because nothing is installed in hw and later add the node which was
actually deleted but just reordered which happend on another CPU, no?

> I've a follow up patchset that introduces a new event in switchdev
> notification chain called SWITCHDEV_SYNC, which is sent when port
> netdevs are enslaved / released  from a master device (points in time
> where kernel<->device can get out of sync). It will invoke
> re-propagation of configuration from different parts of the stack
> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> in duplicates.

Okay, understood. I wonder how we can protect against accidentally abort
calls actually. E.g. if I start to inject routes into my routing domain
how can I make sure the box doesn't die after I try to insert enough
routes. Do we need to touch quagga etc?

Thanks,
Hannes

^ permalink raw reply

* [patch] net: renesas: ravb: unintialized return value
From: Dan Carpenter @ 2016-12-01 20:57 UTC (permalink / raw)
  To: Sergei Shtylyov, Johan Hovold
  Cc: David S. Miller, Yoshihiro Kaneko, Kazuya Mizuguchi, Simon Horman,
	Wolfram Sang, Andrew Lunn, Philippe Reynes, Niklas Söderlund,
	Arnd Bergmann, netdev, linux-renesas-soc, kernel-janitors

We want to set the other "err" variable here so that we can return it
later.  My version of GCC misses this issue but I caught it with a
static checker.

Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
Applies to the net tree for 4.10.

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 2c0357c..92d7692 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1016,8 +1016,6 @@ static int ravb_phy_init(struct net_device *ndev)
 	 * at this time.
 	 */
 	if (priv->chip_id == RCAR_GEN3) {
-		int err;
-
 		err = phy_set_max_speed(phydev, SPEED_100);
 		if (err) {
 			netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n");

^ permalink raw reply related

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 20:54 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <d70996c1-be4e-136b-6325-1a8c152e44ce@stressinduktion.org>

On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:04, David Miller wrote:
> > 
> > Hannes and Ido,
> > 
> > It looks like we are very close to having this in mergable shape, can
> > you guys work out this final issue and figure out if it really is
> > a merge stopped or not?
> 
> Sure, if the fib notification register could be done under protection of
> the sequence counter I don't see any more problems.

Did you maybe miss my reply yesterday? Because I was trying to
understand what "ordering" you're referring to, but didn't receive a
reply from you.

> The sync handler is nice to have and can be done in a later patch series.

Sync handler?

^ permalink raw reply

* Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
From: marcelo.leitner @ 2016-12-01 20:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, jmaxwell37, alexandre.sidorenko, kuznet, jmorris,
	yoshfuji, kaber, tlfalcon, brking, eric.dumazet
In-Reply-To: <20161201.152949.1953888486413180001.davem@davemloft.net>

On Thu, Dec 01, 2016 at 03:29:49PM -0500, David Miller wrote:
> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Date: Wed, 30 Nov 2016 11:14:32 -0200
> 
> > There have been some reports lately about TCP connection stalls caused
> > by NIC drivers that aren't setting gso_size on aggregated packets on rx
> > path. This causes TCP to assume that the MSS is actually the size of the
> > aggregated packet, which is invalid.
> > 
> > Although the proper fix is to be done at each driver, it's often hard
> > and cumbersome for one to debug, come to such root cause and report/fix
> > it.
> > 
> > This patch amends this situation in two ways. First, it adds a warning
> > on when this situation occurs, so it gives a hint to those trying to
> > debug this. It also limit the maximum probed MSS to the adverised MSS,
> > as it should never be any higher than that.
> > 
> > The result is that the connection may not have the best performance ever
> > but it shouldn't stall, and the admin will have a hint on what to look
> > for.
> > 
> > Tested with virtio by forcing gso_size to 0.
> > 
> > Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> 
> I totally agree with this change, however I think the warning message can
> be improved in two ways:
> 
> >  	len = skb_shinfo(skb)->gso_size ? : skb->len;
> >  	if (len >= icsk->icsk_ack.rcv_mss) {
> > -		icsk->icsk_ack.rcv_mss = len;
> > +		icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > +					       tcp_sk(sk)->advmss);
> > +		if (icsk->icsk_ack.rcv_mss != len)
> > +			pr_warn_once("Seems your NIC driver is doing bad RX acceleration. TCP performance may be compromised.\n");
> 
> We know it's a bad GRO implementation that causes this so let's be specific in the
> message, perhaps something like:
> 
> 	Driver has suspect GRO implementation, TCP performance may be compromised.

Okay.

> 
> Also, we have skb->dev available here most likely, so prefixing the message with
> skb->dev->name would make analyzing this situation even easier for someone hitting
> this.

Nice, yes.
And this skb is mostly non-forwardable as it's bigger than the MTU,
so if someone is using net namespaces and this skb would be routed
through some veth interfaces, it would give a false hint then, but
shouldn't happen. Unless it would fit (a larger) veth mtu, but still,
one probably will simplify things up to debug this.

> 
> I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
> definitely something you need to consider.
> 
> Thanks!
> 

Will check. Thanks!

  Marcelo

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Hannes Frederic Sowa @ 2016-12-01 20:44 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Florian Westphal, netdev
In-Reply-To: <20161201162814.GA31300@pox.localdomain>

Hello,

this is a good conversation and I simply want to bring my worries
across. I don't have good solutions for the problems XDP tries to solve
but I fear we could get caught up in maintenance problems in the long
term given the ideas floating around on how to evolve XDP currently.

On 01.12.2016 17:28, Thomas Graf wrote:
> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>> XDP manipulates packets at free will and thus all security guarantees
>> are off as well as in any user space solution.
>>
>> Secondly user space provides policy, acl, more controlled memory
>> protection, restartability and better debugability. If I had multi
>> tenant workloads I would definitely put more complex "business/acl"
>> logic into user space, so I can make use of LSM and other features to
>> especially prevent a network facing service to attack the tenants. If
>> stuff gets put into the kernel you run user controlled code in the
>> kernel exposing a much bigger attack vector.
>>
>> What use case do you see in XDP specifically e.g. for container networking?
> 
> DDOS mitigation to protect distributed applications in large clusters.
> Relying on CDN works to protect API gateways and frontends (as long as
> they don't throw you out of their network) but offers no protection
> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
> level and allowing the mitigation capability to scale up with the number
> of servers is natural and cheap.

So far we e.g. always considered L2 attacks a problem of the network
admin to correctly protect the environment. Are you talking about
protecting the L3 data plane? Are there custom proprietary protocols in
place which need custom protocol parsers that need involvement of the
kernel before it could verify the packet?

In the past we tried to protect the L3 data plane as good as we can in
Linux to allow the plain old server admin to set an IP address on an
interface and install whatever software in user space. We try not only
to protect it but also try to achieve fairness by adding a lot of
counters everywhere. Are protections missing right now or are we talking
about better performance?

To provide fairness you often have to share validated data within the
kernel and with XDP. This requires consistent lookup methods for sockets
in the lower level. Those can be exported to XDP via external functions
and become part of uAPI which will limit our ability to change those
functions in future. When the discussion started about early demuxing in
XDP I became really nervous, because suddenly the XDP program has to
decide correctly which protocol type it has and look in the correct
socket table for the socket. Different semantics for sockets can apply
here, e.g. some sockets are RCU managed, some end up using reference
counts. A wrong decision here would cause havoc in the kernel (XDP
considers packet as UDP but kernel stack as TCP). Also, who knows that
we won't have per-cpu socket tables we would keep that as uAPI (this is
btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
SIP rewriter in XDP and depending on a coherent view of all sockets even
if their hash doesn't fit to the one of the queue? Suddenly something
which was thought of as being only mutable by one CPU becomes global
again and because of XDP we need to add locking because of uAPI.

This discussion is parallel to the discussion about trace points, which
are not considered uAPI. If eBPF functions are not considered uAPI then
eBPF in the network stack will have much less value, because you
suddenly depend on specific kernel versions again and cannot simply load
the code into the kernel. The API checks will become very difficult to
implement, see also the ongoing MODVERSIONS discussions on LKML some
days back.

>>> I agree with you if the LB is a software based appliance in either a
>>> dedicated VM or on dedicated baremetal.
>>>
>>> The reality is turning out to be different in many cases though, LB
>>> needs to be performed not only for north south but east west as well.
>>> So even if I would handle LB for traffic entering my datacenter in user
>>> space, I will need the same LB for packets from my applications and
>>> I definitely don't want to move all of that into user space.
>>
>> The open question to me is why is programmability needed here.
>>
>> Look at the discussion about ECMP and consistent hashing. It is not very
>> easy to actually write this code correctly. Why can't we just put C code
>> into the kernel that implements this once and for all and let user space
>> update the policies?
> 
> Whatever LB logic is put in place with native C code now is unlikely the
> logic we need in two years. We can't really predict the future. If it
> was the case, networking would have been done long ago and we would all
> be working on self eating ice cream now.

Did LB algorithms on the networking layer change that much?

There is a long history of using consistent hashing for load balancing,
as e.g. is done in haproxy or F5.

>> Load balancers have to deal correctly with ICMP packets, e.g. they even
>> have to be duplicated to every ECMP route. This seems to be problematic
>> to do in eBPF programs due to looping constructs so you end up with
>> complicated user space anyway.
> 
> Feel free to implement such complex LBs in user space or natively. It is
> not required for the majority of use cases. The most popular LBs for
> application load balancing have no idea of ECMP and require ECMP aware
> routers to be made redundant itself.

They are already available and e.g. deployed as part of some kubernetes
stacks as I wrote above.

It is a generally available algorithm which fits a lot of use cases,
basically every website that wants to shard its sessions can make use of
it. Also it is independent of ECMP and mostly is implemented in load
balancers due to its need for a lot of memory.

New algorithms outdate old ones but the core principles will be the same
and don't require major changes to the interface, e.g. ipvs scheduler.

If we are talking about security features for early drop inside TCP
streams, like http, you need to have a proper stream reassembly engine.
Snort e.g. dropped a complete stream of TCP packets if you send a RST
with the same quadruple but a wrong sequence number. End system didn't
consider the RST but non synchronized solutions ended up not inspecting
this flow anymore. How do you handle diverting views on meta data in
networking protocols? Also look how hard it is to keep e.g. the fib
table synchronized to the hardware.

In retrospect, I think Tom Herbert's move putting ILA stateless
translation into the XDP hook wasn't that bad after all. ILA maybe
hopefully becomes a standard and its implementation is already in the
kernel so why keep its translator not part of the kernel, too?

TLDR; what I'm trying to argue is that evolution of the network stack is
problematic with a programmable backplane in the kernel which locks out
future modifications of the stack in some places. On the other side, if
we don't add those features we will have a half baked solution and
people will simply prefer netmap or DPDK.

Bye,
Hannes

^ permalink raw reply

* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: David Miller @ 2016-12-01 20:41 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: alexander.h.duyck, netdev, intel-wired-lan, sfr
In-Reply-To: <1480540522.2377.18.camel@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 30 Nov 2016 13:15:22 -0800

> On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Mon, 28 Nov 2016 10:42:18 -0500
>> 
>> > When I implemented the GSO partial support in the Intel drivers I was
>> using
>> > lco_csum to compute the checksum that we needed to plug into the IPv4
>> > checksum field in order to cancel out the data that was not a part of
>> the
>> > IPv4 header.  However this didn't take into account that the transport
>> > offset might be pointing to the inner transport header.
>> > 
>> > Instead of using lco_csum I have just coded around it so that we can
>> use
>> > the outer IP header plus the IP header length to determine where we
>> need to
>> > start our checksum and then just call csum_partial ourselves.
>> > 
>> > This should fix the SIT issue reported on igb interfaces as well as
>> simliar
>> > issues that would pop up on other Intel NICs.
>> 
>> Jeff, are you going to send me a pull request with this stuff or would
>> you be OK with my applying these directly to 'net'?
> 
> Go ahead and apply those to your net tree, I do not want to hold this up.

Ok, done, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot @ 2016-12-01 20:41 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130232633.GS21645@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> index ab52c37..9e51405 100644
>> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>>  	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>>  			 u16 val);
>>  
>> +	/* Switch Software Reset */
>> +	int (*reset)(struct mv88e6xxx_chip *chip);
>> +
>
> Hi Vivien
>
> In my huge patch series of 6390, i've been using a g1_ prefix for
> functionality which is in global 1, g2_ for global 2, etc.  This has
> worked for everything so far with the exception of setting which
> reserved MAC addresses should be sent to the CPU. Most devices have it
> in g2, but 6390 has it in g1.
>
> Please could you add the prefix.

I don't understand. It looks like you are talking about the second part
of the comment I made on your RFC patchset, about the Rsvd2CPU feature:

https://www.mail-archive.com/netdev@vger.kernel.org/msg139837.html

Switch reset routines are implemented in this patch in global1.c as
mv88e6185_g1_reset and mv88e6352_g1_reset.

6185 and 6352 are implementation references for other switches.

Thanks,

        Vivien

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 20:40 UTC (permalink / raw)
  To: David Miller, idosch
  Cc: jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
	roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber
In-Reply-To: <20161201.150445.558407356269727869.davem@davemloft.net>

On 01.12.2016 21:04, David Miller wrote:
> 
> Hannes and Ido,
> 
> It looks like we are very close to having this in mergable shape, can
> you guys work out this final issue and figure out if it really is
> a merge stopped or not?

Sure, if the fib notification register could be done under protection of
the sequence counter I don't see any more problems.

The sync handler is nice to have and can be done in a later patch series.

^ permalink raw reply

* Re: [PATCH net-next 0/3] sfc: defalconisation fixups
From: David Miller @ 2016-12-01 20:39 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, bkenward, netdev
In-Reply-To: <c52a0276-e379-7841-8d10-d5a834b81c4e@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Thu, 1 Dec 2016 16:59:13 +0000

> A bug fix, the Kconfig change, and cleaning up a bit more unused code.
> 
> Edward Cree (3):
>   sfc: fix debug message format string in efx_farch_handle_rx_not_ok
>   sfc: don't select SFC_FALCON
>   sfc: remove RESET_TYPE_RX_RECOVERY

Series applied, thank you.

^ permalink raw reply

* Re: iproute2 public git outdated?
From: Rami Rosen @ 2016-12-01 20:39 UTC (permalink / raw)
  To: Phil Sutter, Netdev, Stephen Hemminger
In-Reply-To: <20161201121806.GA21576@orbyte.nwl.cc>

Hi Phil,
I suggest that you will try again now, it seems that the iproute2 git
repo was updated in the last 2-4 hours, and "git log" in master shows
now a patch from 30 of November (actually it is your "Add notes about
dropped IPv4 route cache" patch)

Regards,
Rami Rosen


On 1 December 2016 at 14:18, Phil Sutter <phil@nwl.cc> wrote:
> Hi,
>
> I am using iproute2's public git repo at this URL:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
>
> To my surprise, neither master nor net-next branches have received new
> commits since end of October. Did the repo location change or was it
> just not updated for a while?
>
> Thanks, Phil

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 20:39 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201201324.GJ24547@oracle.com>

On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
>
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>> > and one other critical difference from the hot-potato-forwarding
>> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>> > does not apply: in order to figure out the ethernet and IP headers
>> > in the response correctly at all times (in the face of things like VRRP,
>> > gw changes, gw's mac addr changes etc) the application should really
>> > be listening on NETLINK sockets for modifications to the networking
>> > state - again points to needing a select() socket set where you can
>> > have both the I/O fds and the netlink socket,
>> >
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
>
> sure, but my point was that *XDP and other stack-bypass methods needs
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)
>
I'm not sure that TXDP can be reconciled to help OVS. The point of
TXDP is to drive applications closer to bare metal performance, as I
mentioned this is only going to be worth it if the fast path can be
kept simple and not complicated by a requirement for generalization.
It seems like the second we put OVS in we're doubling the data path
and accepting the performance consequences of a complex path anyway.

TXDP can't over the whole system (any more than DPDK can) and needs to
work in concert with other mechanisms-- the key is how to steer the
work amongst the CPUs. For instance, if a latency critical thread is
running on some CPU we either a dedicated queue for the connections of
the thread (e.g. ntuple filtering or aRFS support) or we need a fast
way to get move unrelated packets received on a queue processed by
that CPU to other CPUs (less efficient, but no special HW support is
needed either).

Tom

>
>> The *SOs are always an interesting question. They make for great
>> benchmarks, but in real life the amount of benefit is somewhat
>> unclear. Under the wrong conditions, like all cwnds have collapsed or
>
> I think Rick's already bringing up this one.
>
> --Sowmini
>

^ permalink raw reply

* pull-request: can-next 2016-12-01
From: Marc Kleine-Budde @ 2016-12-01 20:21 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, kernel@pengutronix.de, linux-can@vger.kernel.org


[-- Attachment #1.1: Type: text/plain, Size: 1907 bytes --]

Hello David,

this is a pull request of 4 patches for net-next/master.

There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.

The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.


regards,
Marc

---
The following changes since commit 8f679ed88f8860206edddff725e2749b4cdbb0e8:

  driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev (2016-11-30 15:01:32 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git tags/linux-can-next-for-4.10-20161201

for you to fetch changes up to 0d8f8efd32bace9f222fcc92d4a3132d877e5df6:

  net: can: usb: kvaser_usb: fix spelling mistake of "outstanding" (2016-12-01 14:27:02 +0100)

----------------------------------------------------------------
linux-can-next-for-4.10-20161201

----------------------------------------------------------------
Chris Paterson (2):
      can: rcar_can: Add r8a7796 support
      can: rcar_canfd: Add r8a7796 support

Colin Ian King (1):
      net: can: usb: kvaser_usb: fix spelling mistake of "outstanding"

Geert Uytterhoeven (1):
      can: rcar_canfd: Correct order of interrupt specifiers

 Documentation/devicetree/bindings/net/can/rcar_can.txt   | 12 +++++++-----
 Documentation/devicetree/bindings/net/can/rcar_canfd.txt | 14 ++++++++------
 drivers/net/can/usb/kvaser_usb.c                         |  4 ++--
 3 files changed, 17 insertions(+), 13 deletions(-)

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Pravin Shelar @ 2016-12-01 20:31 UTC (permalink / raw)
  To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130153041.7a9590ef@griffin>

On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote:
>> Do not always set skb->protocol to be the ethertype of the L3 header.
>> For a packet with non-accelerated VLAN tags skb->protocol needs to be
>> the ethertype of the outermost non-accelerated VLAN ethertype.
>
> Well, the current handling of skb->protocol matches what used to be the
> handling of the kernel net stack before Jiri Pirko cleaned up the vlan
> code.
>
> I'm not opposed to changing this but I'm afraid it needs much deeper
> review. Because with this in place, no core kernel functions that
> depend on skb->protocol may be called from within openvswitch.
>
Can you give specific example where it does not work?

>> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>>       if (res <= 0)
>>               return res;
>>
>> +     /* If the outer vlan tag was accelerated, skb->protocol should
>> +      * refelect the inner vlan type. */
>> +     if (!eth_type_vlan(skb->protocol))
>> +             skb->protocol = key->eth.cvlan.tpid;
>
> This should not depend on the current value in skb->protocol which
> could be arbitrary at this point (from the point of view of how this
> patch understands the skb->protocol values). It's easy to fix, though -
> just add a local bool variable tracking whether the skb->protocol has
> been set.
>
skb-protocol value is set by the caller, so it should not be
arbitrary. is it missing in any case?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox