Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 16:14 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <d56e5a40-48cc-d984-c5cd-e18fbc411ed3@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 16:59:07 +0100

> On 13/04/18 16:03, David Miller wrote:
>> Whilst you may not be able to program the filter into the hardware
>> synchronously, you should be able to allocate the ID and get all of
>> the software state setup.
> That's what we were doing before commit 3af0f34290f6 ("sfc: replace
>  asynchronous filter operations"), and (as mentioned in that commit)
>  that leads to (or at least the scheme we used had) race conditions
>  which I could not see a way to fix.  If the hardware resets (and
>  thus forgets all its filters) after we've done the software state
>  setup but before we reach the point of finalising the software state
>  after the hardware operation, we don't know what operations we need
>  to do to re-apply the software state to the hardware, because we
>  don't know whether the reset happened before or after the hardware
>  operation.

When an entry is successfully programmed into the chip, you update
the software state.

When the chip resets, you clear all of those state booleans to false.

Indeed, you would have to synchronize these things somehow.

Is the issue that you learn about the hardware reset asynchronously,
and therefore cannot determine if filter insertion programming
happened afterwards and thus is still in the chip?

You must have a table of all the entries, so that you can reprogram
the hardware should it reset.  Or do you not handle things that way
and it's a lossy system?

> Well, the alternative, even if the software state setup part _could_
>  be made synchronous, is to allow a potentially unbounded queue for
>  the hardware update part (I think there are even still cases in
>  which the exponential growth pathology is possible), causing the
>  filter insertions to be delayed an arbitrarily long time.  Either
>  the flow is still going by that time (in which case the backlog
>  limit approach will get a new ndo_rx_flow_steer request and insert
>  the filter too) or it isn't, in which case getting round to it
>  eventually is no better than dropping it immediately.  In fact it's
>  worse because now you waste time inserting a useless filter which
>  delays new requests even more.
> Besides, I'm fairly confident that the only cases in which you'll
>  even come close to hitting the limit are ones where ARFS wouldn't
>  do you much good anyway, such as:
> * Misconfigured interrupt affinities where ARFS is entirely pointless
> * Many short-lived flows (which greatly diminish the utility of ARFS)
> 
> So for multiple reasons, hitting the limit won't actually make
>  performance worse, although it will often be a sign that performance
>  will be bad for other reasons.

Understood, thanks for explaining.

Please respin your series with the updates you talked about and I'll
apply it.

But generally we do have this issue with various kinds of
configuration programming and async vs. sync.

^ permalink raw reply

* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: David Miller @ 2018-04-13 16:15 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <20180413160912.GA1405@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 13 Apr 2018 18:09:12 +0200

> On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
>> From: Guillaume Nault <g.nault@alphalink.fr>
>> Date: Thu, 12 Apr 2018 20:50:33 +0200
>> 
>> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
>> > tunnel, therefore it can be freed whenever the caller uses it.
>> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
>> > also takes a reference on the returned tunnel. The caller then has to
>> > drop it after it stops using the tunnel.
>> > 
>> > Convert netlink dumps to make them safe against concurrent tunnel
>> > deletion.
>> > 
>> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
>> > Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
>> 
>> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
>> mutex is held.
>> 
>> Therefore no tunnel configuration changes may occur and the tunnel
>> object will persist and is safe to access.
>> 
> Yes, but only for updates done with the genl API. For L2TPv2, the
> tunnel can be created by connecting a PPPOL2TP and a UDP socket.
> Closing these sockets destroys the tunnel without any RTNL
> synchronisation.

Right, that's the part I missed.  Thanks for explaining.

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 16:24 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180413.121415.787716401343641542.davem@davemloft.net>

On 13/04/18 17:14, David Miller wrote:
> Is the issue that you learn about the hardware reset asynchronously,
> and therefore cannot determine if filter insertion programming
> happened afterwards and thus is still in the chip?
Yes, pretty much.

> You must have a table of all the entries, so that you can reprogram
> the hardware should it reset.
Yes, we do have such a table; 'reprogram the hardware' happens in
 efx_ef10_filter_table_restore().

> Understood, thanks for explaining.
>
> Please respin your series with the updates you talked about and I'll
> apply it.
Will do, thanks.

-Ed

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Michal Kubecek @ 2018-04-13 16:32 UTC (permalink / raw)
  To: netdev; +Cc: Dominique Martinet
In-Reply-To: <20180406090720.GA31845@nautica>

On Fri, Apr 06, 2018 at 11:07:20AM +0200, Dominique Martinet wrote:
> 16:49:26.735042 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735046 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735334 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736005 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736402 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.736408 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.738561 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739539 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739882 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.740255 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.746756 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], length 0
> 16:49:26.747923 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], length 0
> 16:49:26.749083 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], length 0
> 16:49:26.750171 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], length 0
> 16:49:26.750808 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], length 0
> 16:49:26.754648 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], length 0
> 16:49:26.755985 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], length 0
> 16:49:26.758513 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], length 0
> 16:49:26.759096 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759421 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759755 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], length 0
> 16:49:26.760653 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], length 0
> 16:49:26.761453 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], length 0
> 16:49:26.762199 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0
> 16:49:26.763547 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36
> 16:49:26.763553 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0
> 16:49:26.764298 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0
> 16:49:26.764676 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0
> 16:49:26.807754 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374
> 16:49:26.876467 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0
> 16:49:27.048760 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374
> 16:49:27.051791 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36
> 16:49:27.076444 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36
> 16:49:27.519862 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374
> 16:49:27.547662 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.883372 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36
> 16:49:28.511861 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374
> 16:49:28.538891 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:28.907197 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36
> 16:49:30.431864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374
> 16:49:30.459127 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:30.955388 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36
> 16:49:34.207879 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374
> 16:49:34.235726 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:35.256285 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36
> 16:49:42.143864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374
> 16:49:42.171531 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:43.448262 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36

The way I read this, server doesn't see anything sent by client since
some point shortly before the dump shown here starts (about 5ms). It
keeps sending data until 16:49:26.807754 (seq 77346) and then keeps
resending first (from its point of view) unacknowledged segment
(32004:33378) in exponentially growing intervals and ignores replies
from the client. Client apparently receives these retransmits and
replies with dupack (with D-SACK for 32004:33378) and retransmits of its
own first unacknowledged segment (4190:4226).

As we can see the client packets in the dump (which was taken on
server), it would mean they are dropped after the point where packet
socket would pass them to libpcap. That might be e.g. netfilter
(conntrack?) or the IP/TCP code detecting them to be invalid for some
reason (which is not obvious to me from the dump above).

There are two strange points:

1. While client apparently responds to all server retransmits, it does
so with TSecr=1313937714 (matching server packet from 16:49:26.807754)
rather than TSval of the packets it dupacks (1313937955 through
1313953050). This doesn't seem to follow the rules of RFC 7323
Section 4.3.

2. Window size values in acks from client grow with each acked packet by
22-23 (which might be ~1400 with scaling factor of 64). I would rather
expect advertised receive window to go down by 1374 with each received
segment and to grow by bigger steps with each read()/recv() call from
application.

We might get more insight if we saw the same connection on both sides.
>From what was presented here, my guess is that

  (1) received packets are dropped somewhere on server side (after they
      are cloned for the packet socket)
  (2) there is something wrong either on client side or between the two
      hosts (there is at least a NAT, IIUC)

Michal Kubecek

^ permalink raw reply

* [PATCH iproute2] utils: Do not reset family for default, any, all addresses
From: David Ahern @ 2018-04-13 16:36 UTC (permalink / raw)
  To: stephen; +Cc: netdev, whissi, David Ahern, Serhey Popovych

Thomas reported a change in behavior with respect to autodectecting
address families. Specifically, 'ip ro add default via fe80::1'
syntax was failing to treat fe80::1 as an IPv6 address as it did in
prior releases. The root causes appears to be a change in family when
the default keyword is parsed.

'default', 'any' and 'all' are relevant outside of AF_INET. Leave the
family arg as is for these when setting addr.

Fixes: 93fa12418dc6 ("utils: Always specify family and ->bytelen in get_prefix_1()")
Reported-by: Thomas Deutschmann <whissi@gentoo.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Serhey Popovych <serhe.popovych@gmail.com>
---
 lib/utils.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/utils.c b/lib/utils.c
index 60d7eb14b438..8a0bff0babeb 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -568,7 +568,7 @@ static int __get_addr_1(inet_prefix *addr, const char *name, int family)
 	if (strcmp(name, "default") == 0) {
 		if ((family == AF_DECnet) || (family == AF_MPLS))
 			return -1;
-		addr->family = (family != AF_UNSPEC) ? family : AF_INET;
+		addr->family = family;
 		addr->bytelen = af_byte_len(addr->family);
 		addr->bitlen = -2;
 		addr->flags |= PREFIXLEN_SPECIFIED;
@@ -579,7 +579,7 @@ static int __get_addr_1(inet_prefix *addr, const char *name, int family)
 	    strcmp(name, "any") == 0) {
 		if ((family == AF_DECnet) || (family == AF_MPLS))
 			return -1;
-		addr->family = AF_UNSPEC;
+		addr->family = family;
 		addr->bytelen = 0;
 		addr->bitlen = -2;
 		return 0;
-- 
2.11.0

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Samudrala, Sridhar @ 2018-04-13 16:49 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
	Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <CAJ3xEMjqj5ENO8NRbcYFFCcS4NGwrNbYo8uydHEYMyXHjCBqhw@mail.gmail.com>

On 4/13/2018 1:57 AM, Or Gerlitz wrote:
> On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>>> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>>>> <sridhar.samudrala@intel.com> wrote:
>>>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>>>>>> Hi Dave and all,
>>>>>>
>>>>>> During and after the BoF on SRIOV switchdev mode, we came into a
>>>>>> consensus among the developers from four different HW vendors (CC
>>>>>> audience) that a correct thing to do would be to disallow any new
>>>>>> extensions to the legacy mode.
>>>>>>
>>>>>> The idea is to put focus on the new mode and not add new UAPIs and
>>>>>> kernel code which was turned to be a wrong design which does not allow
>>>>>> for properly offloading a kernel switching SW model to e-switch HW.
>>>>>>
>>>>>> We also had a good session the day after regarding alignment for the
>>>>>> representation model of the uplink (physical port) and PF/s.
>>>>>>
>>>>>> The VF representor netdevs  exist for all drivers that support the new
>>>>>> mode but the representation for the uplink and PF wasn't the same for
>>>>>> all. The decision was to represent the uplink and PFs vports in the
>>>>>> same manner done for VFs, using rep netdevs. This alignment would
>>>>>> provide a more strict and clear view of the kernel model for e-switch
>>>>>> to users and upper layer control plane SW.
>>>>>>
>>>>> I don't see any changes in the Mellanox/other drivers to move to this new
>>>>> model to enable the uplink and PF port representors, any updates?
>>>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>>>> so far.  I have resumed
>>>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>>>> for 4.18
>>>>
>>>>> It would be really nice to highlight the pros and cons of the old versus
>>>>> the
>>>>> new model.
>>>>>
>>>>> We are looking into adding switchdev support for our new 100Gb ice driver
>>>>> and could use some feedback on the direction we should be taking.
>>>> good news.
>>>>
>>>> The uplink rep is clear cut that needs to be a rep device representing
>>>> the uplink just like vf
>>>> rep represents the vport toward the vf - please just do it correct
>>>> from the begining
>>>>
>>> Having an uplink rep will definitely help implement the slow path with
>>> flat/vlan network
>>> scenarios by not having to add PF to the bridge.
>>>
>>> But how do they help with a vxlan overlay scenario? In case of overlays, the
>>> slow path has to go via vxlan -> ip stack -> pf?
>> in  overlay networks scheme, the uplink has the VTEP ip and is not connected
> the uplink rep has the vtep ip
>
>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
>> and the ip stack routes through the uplink rep

This changes the legacy mode behavior of configuring  vtep ip on the pf netdev.
How does host to host traffic expected to work when vtep ip is moved to uplink rep?


>>
>>> What about pf-rep?

Are you planning to create a pf-rep too? Is pf also treated similar to vf in switchdev mode?
All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default without any rules
programmed?

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Christoph Hellwig @ 2018-04-13 16:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Hellwig, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, David Woodhouse, William Tu,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Arnaldo Carvalho de Melo
In-Reply-To: <20180412173131.49f01252@redhat.com>

On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote:
> > I guess that is because x86 selects it as the default as soon as
> > we have more than 4G memory. 
> 
> I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
> that might explain it. And I'm not hitting the bounce-buffer case.
> 
> How do I control which DMA engine I use? (So, I can play a little)

At the lowest level you control it by:

 (1) setting the dma_ops pointer in struct device
 (2) if that is NULL by choosing what is returned from
     get_arch_dma_ops()

> 
> 
> > That should be solveable fairly easily with the per-device dma ops,
> > though.
> 
> I didn't understand this part.

What I mean with that is that we can start out setting dma_ops
to dma_direct_ops for everyone on x86 when we start out (that is assuming
we don't have an iommu), and only switching to swiotlb_dma_ops when
actually required by either a dma_mask that can't address all memory,
or some other special cases like SEV or broken bridges.

> I wanted to ask your opinion, on a hackish idea I have...
> Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
> operation on another device (still/only calling sync_single_for_device).
> 
> With XDP_REDIRECT we are redirecting between net_device's. Usually
> we keep the RX-DMA mapping as we recycle the page. On the redirect to
> TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
> question is how to avoid this mapping(?).  In some cases, with some DMA
> engines (or lack of) I guess the DMA address is actually the same as
> the RX-DMA mapping dma_addr_t already known, right?  For those cases,
> would it be possible to just (re)use that address for TX?

You can't in any sensible way without breaking a lot of abstractions.
For dma direct ops that mapping will be the same unless the devices
have different dma_offsets in their struct device, or the architecture
overrides phys_to_dma entirely, in which case all bets are off.
If you have an iommu it depends on which devices are behind the same
iommu.

^ permalink raw reply

* Creating FOU tunnels to the same destination IP but different port
From: Kostas Peletidis @ 2018-04-13 16:57 UTC (permalink / raw)
  To: netdev

Hello,

I am having trouble with a particular case of setting up a fou tunnel
and I would really appreciate your help.

I have a remote multihomed host behind a NAT box and I want to create
a fou tunnel for each of its IP addresses, from my machine.

A typical case would be something like that (output from the local machine):

# ip tun
ipudp09602: ip/ip remote 135.196.22.100 local 172.31.0.140 ttl 225
ipudp00101: ip/ip remote 148.252.129.30 local 172.31.0.140 ttl 225
ipudp09604: ip/ip remote 77.247.11.249 local 172.31.0.140 ttl 225
tunl0: any/ip remote any local any ttl inherit nopmtudisc
ipudp00102: ip/ip remote 213.205.194.18 local 172.31.0.140 ttl 225

However, if the remote end has the same IP address with the remote end
of an existing tunnel (but a different remote port)
tunnel creation fails. In this example there is already a tunnel to
135.196.22.100:32270 and I wanted to create a new tunnel
to 135.196.22.100:24822 as below:

# ip link add name ipudp09603 mtu 1356 type ipip \
  remote 135.196.22.100 \
  local 172.31.0.140 \
  ttl 225 \
  encap fou \
     encap-sport 4500 \
     encap-dport 24822

RTNETLINK answers: File exists

The remote IP addresses in this case are identical because there is a
NAT box in the way, but the port numbers are different. The source
address and port are the same in all cases.

I noticed that ip_tunnel_find() does not check port numbers - being IP
and all - so I am thinking that a not-so-elegant way to do it is to
get the port numbers from the netlink request and have
ip_tunnel_find() compare them against encap.{sport, dport} of existing
tunnels.

Is there a better way to create a second fou tunnel to the same IP
address but a different port? Use of keys as unique tunnel IDs maybe?
Any feedback is appreciated. Thank you.

Regards,
Kostas

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Tushar Dave @ 2018-04-13 17:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jesper Dangaard Brouer
  Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org,
	David Woodhouse, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <20180412145653.GA7172@lst.de>



On 04/12/2018 07:56 AM, Christoph Hellwig wrote:
> On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
>> On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
>>> ---------------
>>> Implement support for keeping the DMA mapping through the XDP return
>>> call, to remove RX map/unmap calls.  Implement bulking for XDP
>>> ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
>>> bulking via scatter-gatter DMA calls, XDP TX need it for DMA
>>> map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
>>> to mitigate (via bulk technique). Ask DMA maintainer for a common
>>> case direct call for swiotlb DMA sync call ;-)
>>
>> Why do you even end up in swiotlb code?  Once you bounce buffer your
>> performance is toast anyway..
> 
> I guess that is because x86 selects it as the default as soon as
> we have more than 4G memory. That should be solveable fairly easily
> with the per-device dma ops, though.\

I guess there is nothing we need to do!

On x86, in case of no intel iommu or iommu is disabled, you end up in
swiotlb for DMA API calls when system has 4G memory.
However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
use bounce buffer until and unless you have swiotlb=force specified in
kernel commandline.

e.g. here is the snip:
dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
                             unsigned long offset, size_t size,
                             enum dma_data_direction dir,
                             unsigned long attrs)
{
         phys_addr_t map, phys = page_to_phys(page) + offset;
         dma_addr_t dev_addr = phys_to_dma(dev, phys);

         BUG_ON(dir == DMA_NONE);
         /*
          * If the address happens to be in the device's DMA window,
          * we can safely return the device addr and not worry about bounce
          * buffering it.
          */
         if (dma_capable(dev, dev_addr, size) && swiotlb_force != 
SWIOTLB_FORCE)
                 return dev_addr;


-Tushar

^ permalink raw reply

* imaging solutions
From: Ross @ 2018-04-13 14:51 UTC (permalink / raw)
  To: netdev

Hi,

Not sure if you received my email from last week.

We offer following image editing services:
images cutting out, clipping path, masking
jewelry photos retouching
beauty photos retouching
also wedding photos etc

If you want to test our quality of work.
You may send us one photo with instruction and we will work on it.

Hope to hear from you soon.

Regards,
Ross
The Studio Manager

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Christoph Hellwig @ 2018-04-13 17:26 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Christoph Hellwig, Jesper Dangaard Brouer,
	xdp-newbies@vger.kernel.org, netdev@vger.kernel.org,
	David Woodhouse, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <fbcb4cc8-53fb-82cd-bd9d-76c5fdd47918@oracle.com>

On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:
> I guess there is nothing we need to do!
>
> On x86, in case of no intel iommu or iommu is disabled, you end up in
> swiotlb for DMA API calls when system has 4G memory.
> However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> use bounce buffer until and unless you have swiotlb=force specified in
> kernel commandline.

Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
involves an indirect call to do exactly nothing, which in the workload
Jesper is looking at is causing a huge performance degradation due to
retpolines.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: ethernet: ave: add multiple clocks and resets support as required property
From: Rob Herring @ 2018-04-13 17:26 UTC (permalink / raw)
  To: Kunihiko Hayashi
  Cc: David Miller, netdev, Andrew Lunn, Florian Fainelli, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar
In-Reply-To: <1523255925-6469-2-git-send-email-hayashi.kunihiko@socionext.com>

On Mon, Apr 09, 2018 at 03:38:43PM +0900, Kunihiko Hayashi wrote:
> When the link is becoming up for Pro4 SoC, the kernel is stalled
> due to some missing clocks and resets.
> 
> The AVE block for Pro4 is connected to the GIO bus in the SoC.
> Without its clock/reset, the access to the AVE register makes the
> system stall.
> 
> In the same way, another MAC clock for Giga-bit Connection and
> the PHY clock are also required for Pro4 to activate the Giga-bit feature
> and to recognize the PHY.
> 
> To satisfy these requirements, this patch adds support for multiple clocks
> and resets, and adds the clock-names and reset-names to the binding because
> we need to distinguish clock/reset for the AVE main block and the others.
> 
> Also, make the resets a required property. Currently, "reset is
> optional" relies on that the bootloader or firmware has deasserted
> the reset before booting the kernel.  Drivers should work without
> such expectation.
> 
> Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
> ---
>  .../bindings/net/socionext,uniphier-ave4.txt       |  13 ++-

Reviewed-by: Rob Herring <robh@kernel.org>

>  drivers/net/ethernet/socionext/sni_ave.c           | 108 ++++++++++++++++-----
>  2 files changed, 96 insertions(+), 25 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next 2/3] dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting
From: Rob Herring @ 2018-04-13 17:27 UTC (permalink / raw)
  To: Kunihiko Hayashi
  Cc: David Miller, netdev, Andrew Lunn, Florian Fainelli, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar
In-Reply-To: <1523255925-6469-3-git-send-email-hayashi.kunihiko@socionext.com>

On Mon, Apr 09, 2018 at 03:38:44PM +0900, Kunihiko Hayashi wrote:
> Add "socionext,syscon-phy-mode" property to specify system controller that
> configures the settings about phy-mode.
> 
> Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
> ---
>  Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH net] team: avoid adding twice the same option to the event list
From: David Miller @ 2018-04-13 18:07 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, jiri
In-Reply-To: <0e295e62358a68b22c646adece4272a9bd0473f8.1523620752.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 13 Apr 2018 13:59:25 +0200

> When parsing the options provided by the user space,
> team_nl_cmd_options_set() insert them in a temporary list to send
> multiple events with a single message.
> While each option's attribute is correctly validated, the code does
> not check for duplicate entries before inserting into the event
> list.
> 
> Exploiting the above, the syzbot was able to trigger the following
> splat:
 ...
> This changeset addresses the avoiding list_add() if the current
> option is already present in the event list.
> 
> Reported-and-tested-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")

Looks good to me.

It's too bad that the tmp list entries don't get marked as they are
added, or get unlinked by the list processor.  Either scheme would
make the "already added" test a lot simpler.

Jiri, please review before I apply this.

Thanks.

^ permalink raw reply

* ethtool 4.16 released
From: John W. Linville @ 2018-04-13 18:02 UTC (permalink / raw)
  To: netdev

ethtool version 4.16 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.16.tar.xz

Release notes:

	* Feature: add support for extra RSS contexts and RSS steering filters
	* Feature: Document RSS context control and RSS filters
	* Fix: don't fall back to grxfhindir when context was specified
	* Fix: correct display of VF when showing vf/queue filters
	* Fix: show VF and queue in the help for -N
	* Fix: correct VF index values for the ring_cookie parameter
	* Feature: Add SFF 8636 date code parsing support

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* [PATCH v2 net 0/3] sfc: ARFS fixes
From: Edward Cree @ 2018-04-13 18:16 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Three issues introduced by my recent asynchronous filter handling changes:
1. The old filter_rfs_insert would replace a matching filter of equal
   priority; we need to pass the appropriate argument to filter_insert to
   make it do the same.
2. We're lying to the kernel with our return value from ndo_rx_flow_steer,
   so we need to lie consistently when calling rps_may_expire_flow.  This
   is only a partial fix, as the lie still prevents us from steering
   multiple flows with the same ID to different queues; a proper fix that
   stops us lying at all will hopefully follow later.
3. It's possible to cause the kernel to hammer ndo_rx_flow_steer very
   hard, so make sure we don't build up too huge a backlog of workitems.

Possibly it would be better to fix #3 on the kernel side; I have a patch
 which I think does that but it's not a regression in 4.17 so isn't 'net'
 material.
There's also the issue that we come up in the bad configuration that
 triggers #3 by default, but that too is a problem for another time.

Edward Cree (3):
  sfc: insert ARFS filters with replace_equal=true
  sfc: pass the correctly bogus filter_id to rps_may_expire_flow()
  sfc: limit ARFS workitems in flight per channel

 drivers/net/ethernet/sfc/ef10.c       |  3 +-
 drivers/net/ethernet/sfc/farch.c      |  2 +-
 drivers/net/ethernet/sfc/net_driver.h | 25 +++++++++++++++
 drivers/net/ethernet/sfc/rx.c         | 60 ++++++++++++++++++-----------------
 4 files changed, 58 insertions(+), 32 deletions(-)

^ permalink raw reply

* [PATCH] PCI: Add PCIe to pcie_print_link_status() messages
From: Jakub Kicinski @ 2018-04-13 18:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: oss-drivers, Tal Gilboa, Tariq Toukan, Jacob Keller,
	Ganesh Goudar, Jeff Kirsher, intel-wired-lan, netdev,
	linux-kernel, linux-pci, Jakub Kicinski

Currently the pcie_print_link_status() will print PCIe bandwidth
and link width information but does not mention it is pertaining
to the PCIe.  Since this and related functions are used exclusively
by networking drivers today users may get confused into thinking
that it's the NIC bandwidth that is being talked about.  Insert a
"PCIe" into the messages.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/pci/pci.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index aa86e904f93c..73a0a4993f6a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5273,11 +5273,11 @@ void pcie_print_link_status(struct pci_dev *dev)
 	bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
 
 	if (bw_avail >= bw_cap)
-		pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d link)\n",
+		pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
 			 bw_cap / 1000, bw_cap % 1000,
 			 PCIE_SPEED2STR(speed_cap), width_cap);
 	else
-		pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
+		pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
 			 bw_avail / 1000, bw_avail % 1000,
 			 PCIE_SPEED2STR(speed), width,
 			 limiting_dev ? pci_name(limiting_dev) : "<unknown>",
-- 
2.16.2

^ permalink raw reply related

* [PATCH v2 net 1/3] sfc: insert ARFS filters with replace_equal=true
From: Edward Cree @ 2018-04-13 18:17 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <878265b2-a42a-d49e-0e68-0bbcabbabeaa@solarflare.com>

Necessary to allow redirecting a flow when the application moves.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 95682831484e..13b0eb71dbf3 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -851,7 +851,7 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int rc;
 
-	rc = efx->type->filter_insert(efx, &req->spec, false);
+	rc = efx->type->filter_insert(efx, &req->spec, true);
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.

^ permalink raw reply related

* [PATCH v2 net 2/3] sfc: pass the correctly bogus filter_id to rps_may_expire_flow()
From: Edward Cree @ 2018-04-13 18:17 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <878265b2-a42a-d49e-0e68-0bbcabbabeaa@solarflare.com>

When we inserted an ARFS filter for ndo_rx_flow_steer(), we didn't know
 what the filter ID would be, so we just returned 0.  Thus, we must also
 pass 0 as the filter ID when calling rps_may_expire_flow() for it, and
 rely on the flow_id to identify what we're talking about.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c  | 3 +--
 drivers/net/ethernet/sfc/farch.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 50daad0a1482..36f24c7e553a 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -4776,8 +4776,7 @@ static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 		goto out_unlock;
 	}
 
-	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id,
-				 flow_id, filter_idx)) {
+	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
 		ret = false;
 		goto out_unlock;
 	}
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 4a19c7efdf8d..7174ef5e5c5e 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2912,7 +2912,7 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 	if (test_bit(index, table->used_bitmap) &&
 	    table->spec[index].priority == EFX_FILTER_PRI_HINT &&
 	    rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-				flow_id, index)) {
+				flow_id, 0)) {
 		efx_farch_filter_table_clear_entry(efx, table, index);
 		ret = true;
 	}

^ permalink raw reply related

* [PATCH v2 net 3/3] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 18:18 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <878265b2-a42a-d49e-0e68-0bbcabbabeaa@solarflare.com>

A misconfigured system (e.g. with all interrupts affinitised to all CPUs)
 may produce a storm of ARFS steering events.  With the existing sfc ARFS
 implementation, that could create a backlog of workitems that grinds the
 system to a halt.  To prevent this, limit the number of workitems that
 may be in flight for a given SFC device to 8 (EFX_RPS_MAX_IN_FLIGHT), and
 return EBUSY from our ndo_rx_flow_steer method if the limit is reached.
Given this limit, also store the workitems in an array of slots within the
 struct efx_nic, rather than dynamically allocating for each request.
The limit should not negatively impact performance, because it is only
 likely to be hit in cases where ARFS will be ineffective anyway.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/net_driver.h | 25 +++++++++++++++
 drivers/net/ethernet/sfc/rx.c         | 58 ++++++++++++++++++-----------------
 2 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 5e379a83c729..eea3808b3f25 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -733,6 +733,27 @@ struct efx_rss_context {
 	u32 rx_indir_table[128];
 };
 
+#ifdef CONFIG_RFS_ACCEL
+/**
+ * struct efx_async_filter_insertion - Request to asynchronously insert a filter
+ * @net_dev: Reference to the netdevice
+ * @spec: The filter to insert
+ * @work: Workitem for this request
+ * @rxq_index: Identifies the channel for which this request was made
+ * @flow_id: Identifies the kernel-side flow for which this request was made
+ */
+struct efx_async_filter_insertion {
+	struct net_device *net_dev;
+	struct efx_filter_spec spec;
+	struct work_struct work;
+	u16 rxq_index;
+	u32 flow_id;
+};
+
+/* Maximum number of ARFS workitems that may be in flight on an efx_nic */
+#define EFX_RPS_MAX_IN_FLIGHT	8
+#endif /* CONFIG_RFS_ACCEL */
+
 /**
  * struct efx_nic - an Efx NIC
  * @name: Device name (net device name or bus id before net device registered)
@@ -850,6 +871,8 @@ struct efx_rss_context {
  * @rps_expire_channel: Next channel to check for expiry
  * @rps_expire_index: Next index to check for expiry in
  *	@rps_expire_channel's @rps_flow_id
+ * @rps_slot_map: bitmap of in-flight entries in @rps_slot
+ * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
  * @active_queues: Count of RX and TX queues that haven't been flushed and drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be flushed.
  *	Decremented when the efx_flush_rx_queue() is called.
@@ -1004,6 +1027,8 @@ struct efx_nic {
 	struct mutex rps_mutex;
 	unsigned int rps_expire_channel;
 	unsigned int rps_expire_index;
+	unsigned long rps_slot_map;
+	struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
 #endif
 
 	atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 13b0eb71dbf3..9c593c661cbf 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -827,28 +827,13 @@ MODULE_PARM_DESC(rx_refill_threshold,
 
 #ifdef CONFIG_RFS_ACCEL
 
-/**
- * struct efx_async_filter_insertion - Request to asynchronously insert a filter
- * @net_dev: Reference to the netdevice
- * @spec: The filter to insert
- * @work: Workitem for this request
- * @rxq_index: Identifies the channel for which this request was made
- * @flow_id: Identifies the kernel-side flow for which this request was made
- */
-struct efx_async_filter_insertion {
-	struct net_device *net_dev;
-	struct efx_filter_spec spec;
-	struct work_struct work;
-	u16 rxq_index;
-	u32 flow_id;
-};
-
 static void efx_filter_rfs_work(struct work_struct *data)
 {
 	struct efx_async_filter_insertion *req = container_of(data, struct efx_async_filter_insertion,
 							      work);
 	struct efx_nic *efx = netdev_priv(req->net_dev);
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
+	int slot_idx = req - efx->rps_slot;
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
@@ -878,8 +863,8 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	}
 
 	/* Release references */
+	clear_bit(slot_idx, &efx->rps_slot_map);
 	dev_put(req->net_dev);
-	kfree(req);
 }
 
 int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
@@ -888,22 +873,36 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_async_filter_insertion *req;
 	struct flow_keys fk;
+	int slot_idx;
+	int rc;
 
-	if (flow_id == RPS_FLOW_ID_INVALID)
-		return -EINVAL;
+	/* find a free slot */
+	for (slot_idx = 0; slot_idx < EFX_RPS_MAX_IN_FLIGHT; slot_idx++)
+		if (!test_and_set_bit(slot_idx, &efx->rps_slot_map))
+			break;
+	if (slot_idx >= EFX_RPS_MAX_IN_FLIGHT)
+		return -EBUSY;
 
-	if (!skb_flow_dissect_flow_keys(skb, &fk, 0))
-		return -EPROTONOSUPPORT;
+	if (flow_id == RPS_FLOW_ID_INVALID) {
+		rc = -EINVAL;
+		goto out_clear;
+	}
 
-	if (fk.basic.n_proto != htons(ETH_P_IP) && fk.basic.n_proto != htons(ETH_P_IPV6))
-		return -EPROTONOSUPPORT;
-	if (fk.control.flags & FLOW_DIS_IS_FRAGMENT)
-		return -EPROTONOSUPPORT;
+	if (!skb_flow_dissect_flow_keys(skb, &fk, 0)) {
+		rc = -EPROTONOSUPPORT;
+		goto out_clear;
+	}
 
-	req = kmalloc(sizeof(*req), GFP_ATOMIC);
-	if (!req)
-		return -ENOMEM;
+	if (fk.basic.n_proto != htons(ETH_P_IP) && fk.basic.n_proto != htons(ETH_P_IPV6)) {
+		rc = -EPROTONOSUPPORT;
+		goto out_clear;
+	}
+	if (fk.control.flags & FLOW_DIS_IS_FRAGMENT) {
+		rc = -EPROTONOSUPPORT;
+		goto out_clear;
+	}
 
+	req = efx->rps_slot + slot_idx;
 	efx_filter_init_rx(&req->spec, EFX_FILTER_PRI_HINT,
 			   efx->rx_scatter ? EFX_FILTER_FLAG_RX_SCATTER : 0,
 			   rxq_index);
@@ -933,6 +932,9 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	req->flow_id = flow_id;
 	schedule_work(&req->work);
 	return 0;
+out_clear:
+	clear_bit(slot_idx, &efx->rps_slot_map);
+	return rc;
 }
 
 bool __efx_filter_rfs_expire(struct efx_nic *efx, unsigned int quota)

^ permalink raw reply related

* Re: [RFC bpf-next v2 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Quentin Monnet @ 2018-04-13 18:18 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180413002838.atu7shp5cuubx32p@ast-mbp.dhcp.thefacebook.com>

2018-04-12 17:28 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:53PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Daniel:
>>
>> - bpf_get_prandom_u32()
>> - bpf_get_smp_processor_id()
>> - bpf_get_cgroup_classid()
>> - bpf_get_route_realm()
>> - bpf_skb_load_bytes()
>> - bpf_csum_diff()
>> - bpf_skb_get_tunnel_opt()
>> - bpf_skb_set_tunnel_opt()
>> - bpf_skb_change_proto()
>> - bpf_skb_change_type()
>>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h | 125 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 125 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index f3ea8824efbc..d147d9dd6a83 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -604,6 +612,13 @@ union bpf_attr {
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>>   *
>> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
>> + * 	Description
>> + * 		Retrieve the classid for the current task, i.e. for the
>> + * 		net_cls (network classifier) cgroup to which *skb* belongs.
> 
> please add that kernel should be configured with CONFIG_NET_CLS_CGROUP=y|m
> and mention Documentation/cgroup-v1/net_cls.txt

Ok.

> Otherwise 'network classifier' is way too generic.

I am not so familiar with cgroups. What would you suggest instead?

> I'd also mention that placing a task into net_cls controller
> disables all of cgroup-bpf.

Could you please explain a bit more? Placing a task into the controller
is using:

	echo <task_pid>  >  /sys/fs/cgroup/<my_cgroup_name>/tasks

correct? Then if I do this, it disables all of cgroup-bpf. Does this
mean that I loose the possibility to use or add BPF programs to all
cgroup-related attach points for this cgroup? I think I missed something
here.

>> + * 	Return
>> + * 		The classid, or 0 for the default unconfigured classid.
>> + *
>>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>>   * 	Description
>>   * 		Push a *vlan_tci* (VLAN tag control information) of protocol

I have no particular comments on the other items you reported on this
patch, I will fix them. Thanks!

Quentin

^ permalink raw reply

* [next-queue PATCH 1/1] ixgbe: cleanup sparse warnings
From: cathy.zhou @ 2018-04-13 18:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher, intel-wired-lan; +Cc: netdev, shannon.nelson

From: Cathy Zhou <cathy.zhou@oracle.COM>

Sparse complains valid conversions between restricted types, force
attribute is used to avoid those warnings.

Signed-off-by: Cathy Zhou <cathy.zhou@oracle.COM>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c  | 13 ++++++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c   |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c  | 25 +++++++++++++--------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 29 +++++++++++++++----------
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h  | 16 +++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c   |  9 ++++----
 7 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
index 66a74f4651e8..898b47b1a854 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
@@ -1462,7 +1462,8 @@ void ixgbe_atr_compute_perfect_hash_82599(union ixgbe_atr_input *input,
 {
 
 	u32 hi_hash_dword, lo_hash_dword, flow_vm_vlan;
-	u32 bucket_hash = 0, hi_dword = 0;
+	u32 bucket_hash = 0;
+	__be32 hi_dword = 0;
 	int i;
 
 	/* Apply masks to input data */
@@ -1501,7 +1502,7 @@ void ixgbe_atr_compute_perfect_hash_82599(union ixgbe_atr_input *input,
 	 * Limit hash to 13 bits since max bucket count is 8K.
 	 * Store result at the end of the input stream.
 	 */
-	input->formatted.bkt_hash = bucket_hash & 0x1FFF;
+	input->formatted.bkt_hash = (__force __be16)(bucket_hash & 0x1FFF);
 }
 
 /**
@@ -1610,7 +1611,7 @@ s32 ixgbe_fdir_set_input_mask_82599(struct ixgbe_hw *hw,
 		return IXGBE_ERR_CONFIG;
 	}
 
-	switch (input_mask->formatted.flex_bytes & 0xFFFF) {
+	switch ((__force u16)input_mask->formatted.flex_bytes & 0xFFFF) {
 	case 0x0000:
 		/* Mask Flex Bytes */
 		fdirm |= IXGBE_FDIRM_FLEX;
@@ -1680,13 +1681,13 @@ s32 ixgbe_fdir_write_perfect_filter_82599(struct ixgbe_hw *hw,
 	IXGBE_WRITE_REG(hw, IXGBE_FDIRPORT, fdirport);
 
 	/* record vlan (little-endian) and flex_bytes(big-endian) */
-	fdirvlan = IXGBE_STORE_AS_BE16(input->formatted.flex_bytes);
+	fdirvlan = IXGBE_STORE_AS_BE16((__force u16)input->formatted.flex_bytes);
 	fdirvlan <<= IXGBE_FDIRVLAN_FLEX_SHIFT;
 	fdirvlan |= ntohs(input->formatted.vlan_id);
 	IXGBE_WRITE_REG(hw, IXGBE_FDIRVLAN, fdirvlan);
 
 	/* configure FDIRHASH register */
-	fdirhash = input->formatted.bkt_hash;
+	fdirhash = (__force u32)input->formatted.bkt_hash;
 	fdirhash |= soft_id << IXGBE_FDIRHASH_SIG_SW_INDEX_SHIFT;
 	IXGBE_WRITE_REG(hw, IXGBE_FDIRHASH, fdirhash);
 
@@ -1724,7 +1725,7 @@ s32 ixgbe_fdir_erase_perfect_filter_82599(struct ixgbe_hw *hw,
 	s32 err;
 
 	/* configure FDIRHASH register */
-	fdirhash = input->formatted.bkt_hash;
+	fdirhash = (__force u32)input->formatted.bkt_hash;
 	fdirhash |= soft_id << IXGBE_FDIRHASH_SIG_SW_INDEX_SHIFT;
 	IXGBE_WRITE_REG(hw, IXGBE_FDIRHASH, fdirhash);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index 633be93f3dbb..7db2722366c2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -3652,7 +3652,7 @@ s32 ixgbe_hic_unlocked(struct ixgbe_hw *hw, u32 *buffer, u32 length,
 	 */
 	for (i = 0; i < dword_len; i++)
 		IXGBE_WRITE_REG_ARRAY(hw, IXGBE_FLEX_MNG,
-				      i, cpu_to_le32(buffer[i]));
+				      i, (__force u32)cpu_to_le32(buffer[i]));
 
 	/* Setting this bit tells the ARC that a new command is pending. */
 	IXGBE_WRITE_REG(hw, IXGBE_HICR, hicr | IXGBE_HICR_C);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 7a09a40e4472..4e4c5eeda50d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -465,7 +465,7 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
 	case cpu_to_le32(IXGBE_RXDADV_STAT_FCSTAT_FCPRSP):
 		dma_unmap_sg(&adapter->pdev->dev, ddp->sgl,
 			     ddp->sgc, DMA_FROM_DEVICE);
-		ddp->err = ddp_err;
+		ddp->err = (__force u32)ddp_err;
 		ddp->sgl = NULL;
 		ddp->sgc = 0;
 		/* fall through */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 68af127987bc..33e8c588ff51 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -43,8 +43,9 @@ static void ixgbe_ipsec_set_tx_sa(struct ixgbe_hw *hw, u16 idx,
 	int i;
 
 	for (i = 0; i < 4; i++)
-		IXGBE_WRITE_REG(hw, IXGBE_IPSTXKEY(i), cpu_to_be32(key[3 - i]));
-	IXGBE_WRITE_REG(hw, IXGBE_IPSTXSALT, cpu_to_be32(salt));
+		IXGBE_WRITE_REG(hw, IXGBE_IPSTXKEY(i),
+				(__force u32)cpu_to_be32(key[3 - i]));
+	IXGBE_WRITE_REG(hw, IXGBE_IPSTXSALT, (__force u32)cpu_to_be32(salt));
 	IXGBE_WRITE_FLUSH(hw);
 
 	reg = IXGBE_READ_REG(hw, IXGBE_IPSTXIDX);
@@ -93,7 +94,8 @@ static void ixgbe_ipsec_set_rx_sa(struct ixgbe_hw *hw, u16 idx, __be32 spi,
 	int i;
 
 	/* store the SPI (in bigendian) and IPidx */
-	IXGBE_WRITE_REG(hw, IXGBE_IPSRXSPI, cpu_to_le32(spi));
+	IXGBE_WRITE_REG(hw, IXGBE_IPSRXSPI,
+			(__force u32)cpu_to_le32((__force u32)spi));
 	IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPIDX, ip_idx);
 	IXGBE_WRITE_FLUSH(hw);
 
@@ -101,8 +103,9 @@ static void ixgbe_ipsec_set_rx_sa(struct ixgbe_hw *hw, u16 idx, __be32 spi,
 
 	/* store the key, salt, and mode */
 	for (i = 0; i < 4; i++)
-		IXGBE_WRITE_REG(hw, IXGBE_IPSRXKEY(i), cpu_to_be32(key[3 - i]));
-	IXGBE_WRITE_REG(hw, IXGBE_IPSRXSALT, cpu_to_be32(salt));
+		IXGBE_WRITE_REG(hw, IXGBE_IPSRXKEY(i),
+				(__force u32)cpu_to_be32(key[3 - i]));
+	IXGBE_WRITE_REG(hw, IXGBE_IPSRXSALT, (__force u32)cpu_to_be32(salt));
 	IXGBE_WRITE_REG(hw, IXGBE_IPSRXMOD, mode);
 	IXGBE_WRITE_FLUSH(hw);
 
@@ -121,7 +124,8 @@ static void ixgbe_ipsec_set_rx_ip(struct ixgbe_hw *hw, u16 idx, __be32 addr[])
 
 	/* store the ip address */
 	for (i = 0; i < 4; i++)
-		IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPADDR(i), cpu_to_le32(addr[i]));
+		IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPADDR(i),
+				(__force u32)cpu_to_le32((__force u32)addr[i]));
 	IXGBE_WRITE_FLUSH(hw);
 
 	ixgbe_ipsec_set_rx_item(hw, idx, ips_rx_ip_tbl);
@@ -391,7 +395,8 @@ static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec,
 	struct xfrm_state *ret = NULL;
 
 	rcu_read_lock();
-	hash_for_each_possible_rcu(ipsec->rx_sa_list, rsa, hlist, spi)
+	hash_for_each_possible_rcu(ipsec->rx_sa_list, rsa, hlist,
+				   (__force u32)spi) {
 		if (spi == rsa->xs->id.spi &&
 		    ((ip4 && *daddr == rsa->xs->id.daddr.a4) ||
 		      (!ip4 && !memcmp(daddr, &rsa->xs->id.daddr.a6,
@@ -401,6 +406,7 @@ static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec,
 			xfrm_state_hold(ret);
 			break;
 		}
+	}
 	rcu_read_unlock();
 	return ret;
 }
@@ -593,7 +599,7 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs)
 
 		/* hash the new entry for faster search in Rx path */
 		hash_add_rcu(ipsec->rx_sa_list, &ipsec->rx_tbl[sa_idx].hlist,
-			     rsa.xs->id.spi);
+			     (__force u64)rsa.xs->id.spi);
 	} else {
 		struct tx_sa tsa;
 
@@ -677,7 +683,8 @@ static void ixgbe_ipsec_del_sa(struct xfrm_state *xs)
 			if (!ipsec->ip_tbl[ipi].ref_cnt) {
 				memset(&ipsec->ip_tbl[ipi], 0,
 				       sizeof(struct rx_ip_sa));
-				ixgbe_ipsec_set_rx_ip(hw, ipi, zerobuf);
+				ixgbe_ipsec_set_rx_ip(hw, ipi,
+						      (__force __be32 *)zerobuf);
 			}
 		}
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..ef40b226edfc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -752,8 +752,8 @@ static void ixgbe_dump(struct ixgbe_adapter *adapter)
 					ring_desc = "";
 				pr_info("T [0x%03X]    %016llX %016llX %016llX %08X %p %016llX %p%s",
 					i,
-					le64_to_cpu(u0->a),
-					le64_to_cpu(u0->b),
+					le64_to_cpu((__force __le64)u0->a),
+					le64_to_cpu((__force __le64)u0->b),
 					(u64)dma_unmap_addr(tx_buffer, dma),
 					dma_unmap_len(tx_buffer, len),
 					tx_buffer->next_to_watch,
@@ -864,15 +864,15 @@ static void ixgbe_dump(struct ixgbe_adapter *adapter)
 				/* Descriptor Done */
 				pr_info("RWB[0x%03X]     %016llX %016llX ---------------- %p%s\n",
 					i,
-					le64_to_cpu(u0->a),
-					le64_to_cpu(u0->b),
+					le64_to_cpu((__force __le64)u0->a),
+					le64_to_cpu((__force __le64)u0->b),
 					rx_buffer_info->skb,
 					ring_desc);
 			} else {
 				pr_info("R  [0x%03X]     %016llX %016llX %016llX %p%s\n",
 					i,
-					le64_to_cpu(u0->a),
-					le64_to_cpu(u0->b),
+					le64_to_cpu((__force __le64)u0->a),
+					le64_to_cpu((__force __le64)u0->b),
 					(u64)rx_buffer_info->dma,
 					rx_buffer_info->skb,
 					ring_desc);
@@ -7801,7 +7801,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 
 	/* remove payload length from inner checksum */
 	paylen = skb->len - l4_offset;
-	csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
+	csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
 
 	/* update gso size and bytecount with header size */
 	first->gso_segs = skb_shinfo(skb)->gso_segs;
@@ -9109,7 +9109,8 @@ static int ixgbe_clsu32_build_input(struct ixgbe_fdir_filter *input,
 
 		for (j = 0; field_ptr[j].val; j++) {
 			if (field_ptr[j].off == off) {
-				field_ptr[j].val(input, mask, val, m);
+				field_ptr[j].val(input, mask, (__force u32)val,
+						 (__force u32)m);
 				input->filter.formatted.flow_type |=
 					field_ptr[j].type;
 				found_entry = true;
@@ -9118,8 +9119,10 @@ static int ixgbe_clsu32_build_input(struct ixgbe_fdir_filter *input,
 		}
 		if (nexthdr) {
 			if (nexthdr->off == cls->knode.sel->keys[i].off &&
-			    nexthdr->val == cls->knode.sel->keys[i].val &&
-			    nexthdr->mask == cls->knode.sel->keys[i].mask)
+			    nexthdr->val ==
+			    (__force u32)cls->knode.sel->keys[i].val &&
+			    nexthdr->mask ==
+			    (__force u32)cls->knode.sel->keys[i].mask)
 				found_jump_field = true;
 			else
 				continue;
@@ -9223,7 +9226,8 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter,
 		for (i = 0; nexthdr[i].jump; i++) {
 			if (nexthdr[i].o != cls->knode.sel->offoff ||
 			    nexthdr[i].s != cls->knode.sel->offshift ||
-			    nexthdr[i].m != cls->knode.sel->offmask)
+			    nexthdr[i].m !=
+			    (__force u32)cls->knode.sel->offmask)
 				return err;
 
 			jump = kzalloc(sizeof(*jump), GFP_KERNEL);
@@ -9970,7 +9974,8 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 		}
 	} else {
 		for (i = 0; i < adapter->num_rx_queues; i++)
-			xchg(&adapter->rx_ring[i]->xdp_prog, adapter->xdp_prog);
+			(void)xchg(&adapter->rx_ring[i]->xdp_prog,
+			    adapter->xdp_prog);
 	}
 
 	if (old_prog)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
index 72446644f9fa..57de21a299ea 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
@@ -53,8 +53,8 @@ static inline int ixgbe_mat_prgm_sip(struct ixgbe_fdir_filter *input,
 				     union ixgbe_atr_input *mask,
 				     u32 val, u32 m)
 {
-	input->filter.formatted.src_ip[0] = val;
-	mask->formatted.src_ip[0] = m;
+	input->filter.formatted.src_ip[0] = (__force __be32)val;
+	mask->formatted.src_ip[0] = (__force __be32)m;
 	return 0;
 }
 
@@ -62,8 +62,8 @@ static inline int ixgbe_mat_prgm_dip(struct ixgbe_fdir_filter *input,
 				     union ixgbe_atr_input *mask,
 				     u32 val, u32 m)
 {
-	input->filter.formatted.dst_ip[0] = val;
-	mask->formatted.dst_ip[0] = m;
+	input->filter.formatted.dst_ip[0] = (__force __be32)val;
+	mask->formatted.dst_ip[0] = (__force __be32)m;
 	return 0;
 }
 
@@ -79,10 +79,10 @@ static inline int ixgbe_mat_prgm_ports(struct ixgbe_fdir_filter *input,
 				       union ixgbe_atr_input *mask,
 				       u32 val, u32 m)
 {
-	input->filter.formatted.src_port = val & 0xffff;
-	mask->formatted.src_port = m & 0xffff;
-	input->filter.formatted.dst_port = val >> 16;
-	mask->formatted.dst_port = m >> 16;
+	input->filter.formatted.src_port = (__force __be16)(val & 0xffff);
+	mask->formatted.src_port = (__force __be16)(m & 0xffff);
+	input->filter.formatted.dst_port = (__force __be16)(val >> 16);
+	mask->formatted.dst_port = (__force __be16)(m >> 16);
 
 	return 0;
 };
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index 3123267dfba9..336f3218177a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -898,8 +898,9 @@ static s32 ixgbe_read_ee_hostif_buffer_X550(struct ixgbe_hw *hw,
 		buffer.hdr.req.checksum = FW_DEFAULT_CHECKSUM;
 
 		/* convert offset from words to bytes */
-		buffer.address = cpu_to_be32((offset + current_word) * 2);
-		buffer.length = cpu_to_be16(words_to_read * 2);
+		buffer.address = (__force u32)cpu_to_be32((offset +
+							   current_word) * 2);
+		buffer.length = (__force u16)cpu_to_be16(words_to_read * 2);
 		buffer.pad2 = 0;
 		buffer.pad3 = 0;
 
@@ -1109,9 +1110,9 @@ static s32 ixgbe_read_ee_hostif_X550(struct ixgbe_hw *hw, u16 offset, u16 *data)
 	buffer.hdr.req.checksum = FW_DEFAULT_CHECKSUM;
 
 	/* convert offset from words to bytes */
-	buffer.address = cpu_to_be32(offset * 2);
+	buffer.address = (__force u32)cpu_to_be32(offset * 2);
 	/* one word */
-	buffer.length = cpu_to_be16(sizeof(u16));
+	buffer.length = (__force u16)cpu_to_be16(sizeof(u16));
 
 	status = hw->mac.ops.acquire_swfw_sync(hw, mask);
 	if (status)
-- 
2.11.1

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-13 20:16 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
	Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <bec866ff-2898-72b3-247d-f64189ef6d0f@intel.com>

On Fri, Apr 13, 2018 at 7:49 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/13/2018 1:57 AM, Or Gerlitz wrote:

>>> in  overlay networks scheme, the uplink rep has the VTEP ip and is not connected
>>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected
>>> to ovs and the ip stack routes through the uplink rep

> This changes the legacy mode behavior of configuring  vtep ip on the pf
> netdev. How does host to host traffic expected to work when vtep ip is moved to uplink rep?

What do you mean host to host traffic, is that two VFs on the same host?
control plane SWs (such as OVS) don't apply encapsulation within the same host

>>>> What about pf-rep?

> Are you planning to create a pf-rep too? Is pf also treated similar to vf in
> switchdev mode?
> All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default
> without any rules programmed?

@ the sriov switchdev ARCH level, pf/pf-rep would work indeed as you described.

We will have pf rep for smartnic schemes where the the pf on the host
is not the manager of the eswitch but rather the smartnic driver instance.

on non smart env, there are some challenges to address for the pf
nic to be fully functional for the slow path (what you described), we
will get there down the road if there is a real need.

^ permalink raw reply

* Re: v6/sit tunnels and VRFs
From: Jeff Barnhill @ 2018-04-13 20:23 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev
In-Reply-To: <66ebbad0-68e4-ba2d-ffb6-2a8057e72b04@gmail.com>

Thanks for the response, David. I'm not questioning the need to stop
the fib lookup once the end of the VRF table is reached - I agree that
is needed.  I'm concerned with the difference in the response/error
returned from the failed lookup.

For instance, with vrf "unreachable default" route, I get this:
# ip route get 1.1.1.1 vrf vrf_258
RTNETLINK answers: No route to host

Without it (and assuming no match for 1.1.1.1 in local/main/default
tables), I get this:
# ip route get 1.1.1.1 vrf vrf_258
RTNETLINK answers: Network is unreachable

Which is also what happens when not using VRFs at all.

It seems that the ENETUNREACH response is still desirable in the VRF
case since the only difference (when using VRF vs. not) is that the
lookup should be restrained to a specific VRF.

Jeff



On Thu, Apr 12, 2018 at 10:25 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/12/18 10:54 AM, Jeff Barnhill wrote:
>> Hi David,
>>
>> In the slides referenced, you recommend adding an "unreachable
>> default" route to the end of each VRF route table.  In my testing (for
>> v4) this results in a change to fib lookup failures such that instead
>> of ENETUNREACH being returned, EHOSTUNREACH is returned since the fib
>> finds the unreachable route, versus failing to find a route
>> altogether.
>>
>> Have the implications of this been considered?  I don't see a
>> clean/easy way to achieve the old behavior without affecting non-VRF
>> routing (eg. remove the unreachable route and delete the non-VRF
>> rules).  I'm guessing that programmatically, it may not make much
>> difference, ie. lookup fails, but for debugging or to a user looking
>> at it, the difference matters.  Do you (or anyone else) have any
>> thoughts on this?
>
> We have recommended moving the local table down in the FIB rules:
>
> # ip ru ls
> 1000:   from all lookup [l3mdev-table]
> 32765:  from all lookup local
> 32766:  from all lookup main
> 32767:  from all lookup default
>
> and adding a default route to VRF tables:
>
> # ip ro ls vrf red
> unreachable default  metric 4278198272
> 172.16.2.0/24  proto bgp  metric 20
>         nexthop via 169.254.0.1  dev swp3 weight 1 onlink
>         nexthop via 169.254.0.1  dev swp4 weight 1 onlink
>
> # ip -6 ro ls vrf red
> 2001:db8:2::/64  proto bgp  metric 20
>         nexthop via fe80::202:ff:fe00:e  dev swp3 weight 1
>         nexthop via fe80::202:ff:fe00:f  dev swp4 weight 1
> anycast fe80:: dev lo  proto kernel  metric 0  pref medium
> anycast fe80:: dev lo  proto kernel  metric 0  pref medium
> fe80::/64 dev swp3  proto kernel  metric 256  pref medium
> fe80::/64 dev swp4  proto kernel  metric 256  pref medium
> ff00::/8 dev swp3  metric 256  pref medium
> ff00::/8 dev swp4  metric 256  pref medium
> unreachable default dev lo  metric 4278198272  error -101 pref medium
>
> Over the last 2 years we have not seen any negative side effects to this
> and is what you want to have proper VRF separation.
>
> Without a default route lookups will proceed to the next fib rule which
> means a lookup in the next table and barring other PBR rules will be the
> main table. It will lead to wrong lookups.
>
> Here is an example:
>   ip netns add foo
>   ip netns exec foo bash
>   ip li set lo up
>   ip li add red type vrf table 123
>   ip li set red up
>   ip li add dummy1 type dummy
>   ip addr add 10.100.1.1/24 dev dummy1
>   ip li set dummy1 master red
>   ip li set dummy1 up
>   ip li add dummy2 type dummy
>   ip addr add 10.100.1.1/24 dev dummy2
>   ip li set dummy2 up
>   ip ro get 10.100.2.2
>   ip ro get 10.100.2.2 vrf red
>
> # ip ru ls
> 0:      from all lookup local
> 1000:   from all lookup [l3mdev-table]
> 32766:  from all lookup main
> 32767:  from all lookup default
>
> # ip ro ls
> 10.100.1.0/24 dev dummy2 proto kernel scope link src 10.100.1.1
> 10.100.2.0/24 via 10.100.1.2 dev dummy2
>
> # ip ro ls vrf red
> 10.100.1.0/24 dev dummy1 proto kernel scope link src 10.100.1.1
>
> That's the setup. What happens on route lookups?
> # ip ro get vrf red 10.100.2.1
> 10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
>     cache
>
> which is clearly wrong. Let's look at the lookup sequence
>
> # perf record -e fib:* ip ro get vrf red 10.100.2.1
> 10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
>     cache
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.003 MB perf.data (4 samples) ]
>
> #  perf script --fields trace:trace
> table 255 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> table 123 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> table 254 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
> nexthop dev dummy2 oif 4 src 10.100.1.1
>
> The first one is because I did not move the local table down.
> The second one is the correct vrf lookup
> The third one is the continuation to the next table - the main table.
>
> Adding a default route:
> # ip ro add vrf red unreachable default
>
> And the lookup is proper:
> # ip ro get vrf red 10.100.2.1
> RTNETLINK answers: No route to host

^ permalink raw reply

* Re: v6/sit tunnels and VRFs
From: David Ahern @ 2018-04-13 20:31 UTC (permalink / raw)
  To: Jeff Barnhill; +Cc: netdev
In-Reply-To: <CAL6e_pdihSXvYFQQ0MAXFtw=eUjJPfR9kS9j5QQNSqCNzdQteg@mail.gmail.com>

On 4/13/18 2:23 PM, Jeff Barnhill wrote:
> It seems that the ENETUNREACH response is still desirable in the VRF
> case since the only difference (when using VRF vs. not) is that the
> lookup should be restrained to a specific VRF.

VRF is just policy routing to a table. If the table wants the lookup to
stop, then it needs a default route. What you are referring to is the
lookup goes through all tables and does not find an answer so it fails
with -ENETUNREACH. I do not know of any way to make that happen with the
existing default route options and in the past 2+ years we have not hit
any s/w that discriminates -ENETUNREACH from -EHOSTUNREACH.

I take it this is code from your internal code base. Why does it care
between those two failures?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox