* Re: [PATCH] RDS: Heap OOB write in rds_message_alloc_sgs()
From: David Miller @ 2018-01-03 16:24 UTC (permalink / raw)
To: simo.ghannam; +Cc: netdev
In-Reply-To: <5a4be135.1296df0a.90c9.baea@mx.google.com>
From: simo.ghannam@gmail.com
Date: Tue, 2 Jan 2018 19:44:34 +0000
> From: Mohamed Ghannam <simo.ghannam@gmail.com>
>
> When args->nr_local is 0, nr_pages gets also 0 due some size
> calculation via rds_rm_size(), which is later used to allocate
> pages for DMA, this bug produces a heap Out-Of-Bound write access
> to a specific memory region.
>
> Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-01-02
From: David Miller @ 2018-01-03 16:26 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180102202045.72780-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 2 Jan 2018 12:20:43 -0800
> This series contains fixes for e1000 and e1000e.
>
> Tushar Dave adds a check to the driver so that it won't attempt to disable a
> device that is already disabled for e1000.
>
> Benjamin Poirier provides a fix to e1000e, where a previous commit that
> Benjamin submitted changed the meaning of the return value for
> "check_for_link" for copper media and not all the instances were properly
> updated. Benjamin fixes the remaining instances that needed the change.
Pulled, thanks Jeff.
^ permalink raw reply
* Re: [PATCH net] sctp: fix error path in sctp_stream_init
From: David Miller @ 2018-01-03 16:30 UTC (permalink / raw)
To: marcelo.leitner; +Cc: netdev, linux-sctp, lucien.xin, vyasevich, nhorman
In-Reply-To: <74b65f56d6912704f3ef79a46242cc2ca567e4e3.1514928664.git.marcelo.leitner@gmail.com>
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Tue, 2 Jan 2018 19:44:37 -0200
> syzbot noticed a NULL pointer dereference panic in sctp_stream_free()
> which was caused by an incomplete error handling in sctp_stream_init().
> By not clearing stream->outcnt, it made a for() in sctp_stream_free()
> think that it had elements to free, but not, leading to the panic.
>
> As suggested by Xin Long, this patch also simplifies the error path by
> moving it to the only if() that uses it.
>
> See-also: https://www.spinics.net/lists/netdev/msg473756.html
> See-also: https://www.spinics.net/lists/netdev/msg465024.html
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Fixes: f952be79cebd ("sctp: introduce struct sctp_stream_out_ext")
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Applied, thank you.
^ permalink raw reply
* Re: [PATCH net] vxlan: trivial indenting fix.
From: David Miller @ 2018-01-03 16:33 UTC (permalink / raw)
To: u9012063; +Cc: netdev
In-Reply-To: <1514930719-57528-1-git-send-email-u9012063@gmail.com>
From: William Tu <u9012063@gmail.com>
Date: Tue, 2 Jan 2018 14:05:19 -0800
> Fix indentation of reserved_flags2 field in vxlanhdr_gpe.
>
> Signed-off-by: William Tu <u9012063@gmail.com>
Applied, thanks William.
^ permalink raw reply
* Re: [PATCH net-next 2/2] l2tp: add peer_offset parameter
From: Guillaume Nault @ 2018-01-03 16:35 UTC (permalink / raw)
To: Lorenzo Bianconi; +Cc: James Chapman, David S. Miller, netdev, Hangbin Liu
In-Reply-To: <CAJ0CqmWkeb3sKGWy+xgiwCxP=MHA9Zs7cq9-F+6tTAON47R4Jg@mail.gmail.com>
On Wed, Jan 03, 2018 at 04:06:28PM +0100, Lorenzo Bianconi wrote:
> I agree to remove offset parameter in this case. What about (as
> already suggested by James) to take into account possible alignment
> issues with previous version of L2TPv3 protocol using 'L2 specific
> sublayer'?
>
I think James was refering to the general architecture of L2TPv3, where
such issues should be handled by pseudo-wire specific headers. I don't
think he was talking about implementing arbitrary padding using an
L2 specific sublayer. None of the standardised headers allow arbitrary
padding. And implementing our own would make us imcompatible with any
other implementation.
> I guess, on the kernel side (we will need to patch iproute2 on
> userspace side), we need just to properly initialized the 'l2specific'
> field to 0 since otherwise we will have the same memleak issue there
> if assume we can have l2specific_len != {0,4}.
>
That would produce the same frame format as what the 'offset' option
was supposed to produce (if it did properly initialise its padding
bits). That is, we'd have an arbitrary number of padding bits inserted
between the l2-specific header and the l2 frame (L2TP's payload). These
frames are invalid, and doing that brings us nothing compared to
keeping the offset option.
> Moreover does it worth to add some sanity checks in netlink code to
> enforce the relation between l2specific_len and l2specific_type?
>
Yes, certainly.
> At
> the moment there are no guarantee that if l2specific_type is set to
> L2TP_L2SPECTYPE_DEFAULT, l2specific_len will be grater or equal than
> 4.
>
Thanks for pointing this out. That needs to be fixed. We don't support
anything but the default l2spec layer (or no l2spec layer at all). So
we should only accept L2TP_L2SPECTYPE_NONE and L2TP_L2SPECTYPE_DEFAULT.
And we shouldn't need to use l2specific_len at all. The default l2spec
header is always four bytes long, so we don't need to rely on a user
supplied value. Looking at the code, it seems that invalid usage of
L2TP_ATTR_L2SPEC_LEN allows leaking memory on the network or sending
corrupted frames (depending on if its value is too small or too big).
Do you want to take care of this?
^ permalink raw reply
* Re: [PATCH net-next 1/2] virtio_net: Introduce VIRTIO_NET_F_MASTER feature bit
From: David Miller @ 2018-01-03 16:36 UTC (permalink / raw)
To: sridhar.samudrala
Cc: mst, stephen, netdev, virtualization, virtio-dev,
jesse.brandeburg
In-Reply-To: <1514939738-22423-2-git-send-email-sridhar.samudrala@intel.com>
From: Sridhar Samudrala <sridhar.samudrala@intel.com>
Date: Tue, 2 Jan 2018 16:35:37 -0800
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index fc353b518288..a9b4e0836786 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -56,6 +56,7 @@
> #define VIRTIO_NET_F_MQ 22 /* Device supports Receive Flow
> * Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> +#define VIRTIO_NET_F_MASTER 62 /* act as master for a VF device */
Please respin this series with something in this commit message which
explains why we are using "62" rather than something like "24" here.
Thank you.
^ permalink raw reply
* Re: [PATCH v2 net,stable 0/2] net: fec: clean up in the cases of probe error
From: David Miller @ 2018-01-03 16:40 UTC (permalink / raw)
To: fugang.duan; +Cc: festevam, netdev, troy.kisky, andrew
In-Reply-To: <1514947170-8887-1-git-send-email-fugang.duan@nxp.com>
From: Fugang Duan <fugang.duan@nxp.com>
Date: Wed, 3 Jan 2018 10:39:28 +0800
> The simple patches just clean up in the cases of probe error like restore dev_id and
> handle the defer probe when regulator is still not ready.
>
> v2:
> * Fabio Estevam's comment to suggest split v1 to separate patches.
Series applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCHv2] 3c59x: fix missing dma_mapping_error check and bad ring refill logic
From: Neil Horman @ 2018-01-03 16:41 UTC (permalink / raw)
To: David Miller; +Cc: nhorman, netdev, tedheadster, klassert
In-Reply-To: <20180103.102831.1474605321968855786.davem@davemloft.net>
On Wed, Jan 03, 2018 at 10:28:31AM -0500, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 03 Jan 2018 10:26:06 -0500 (EST)
>
> > From: Neil Horman <nhorman@redhat.com>
> > Date: Wed, 3 Jan 2018 10:13:33 -0500
> >
> >> Thats exactly what this patch does, instead of creating a second loop to
> >> traverse all the emptied ring buffers, now I:
> >>
> >> 1) Pre-allocate a new skb when I know I'm going to receive the in-place skb
> >> 2) Map the skb into the appropriate dma device domain
> >> 3) If (1) and (2) succede, then I swap the newly allocate skb and dma address
> >> with the old one and recieve the old into the network stack
> >> 4) If (1) or (2) fail, then I goto clear_complete, which leaves the old skb and
> >> dma address in place, sets the buffer status back to 0 (indicating completion),
> >> and write the new ring status back to the hardware
> >>
> >> This is what you wanted, a pre-allocate and swap-if-successful, recycle-if-not
> >> approach, rather than the leave-a-hole-in-the-ring approach that is there
> >> currently, no? Or did I miss something else?
> >
> > I misread the code sorry, you're absolutely right.
> >
> > I'll apply this patch, thanks Neil. :)
>
> Hmmm, maybe we need a V3 after all :)
>
> CC [M] drivers/net/ethernet/3com/3c59x.o
> drivers/net/ethernet/3com/3c59x.c: In function ‘boomerang_rx’:
> drivers/net/ethernet/3com/3c59x.c:2605:13: warning: unused variable ‘dma’ [-Wunused-variable]
> dma_addr_t dma;
> ^~~
Thats....odd, I built it twice here, and it didn't bomb out. I must not have
-Werror enabled, apologies. I'll respin
Neil
^ permalink raw reply
* Re: [RFC PATCH net-next 03/19] ipv6: Clear nexthop flags upon netdev up
From: Ido Schimmel @ 2018-01-03 16:43 UTC (permalink / raw)
To: David Ahern; +Cc: Ido Schimmel, netdev, davem, roopa, nicolas.dichtel, mlxsw
In-Reply-To: <df0a7a4f-8e7b-e97a-edc7-572c74d9e7ef@gmail.com>
On Wed, Jan 03, 2018 at 08:32:51AM -0700, David Ahern wrote:
> On 1/3/18 12:44 AM, Ido Schimmel wrote:
> >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> >>> index ed06b1190f05..b6405568ed7b 100644
> >>> --- a/net/ipv6/addrconf.c
> >>> +++ b/net/ipv6/addrconf.c
> >>> @@ -3484,6 +3484,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
> >>> if (run_pending)
> >>> addrconf_dad_run(idev);
> >>>
> >>> + /* Device has an address by now */
> >>> + rt6_sync_up(dev, RTNH_F_DEAD);
> >>> +
> >>
> >> Seems like this should be in the NETDEV_UP section, say after
> >> addrconf_permanent_addr.
> >
> > Unless the `keep_addr_on_down` sysctl is set, then at this stage the
> > netdev doesn't have an IP address and we shouldn't clear the dead flag
> > just yet.
> >
> > This is consistent with IPv4 that clears the dead flag from nexthops in
> > a multipath route only if the nexthop device has an IP address. When the
> > last IPv4 address is removed from a netdev all the routes using it are
> > flushed and there's nothing to clear upon NETDEV_UP.
>
> I have a bug about that IPv4 handling from the FRR team:
>
> $ ip link add dummy1 type dummy
> $ ip li set dummy1 up
> $ ip route add 1.1.1.0/24 dev dummy1
>
> $ ip addr add dev dummy1 2.2.2.1/24
> $ ip ro ls | grep dummy1
> 1.1.1.0/24 dev dummy1 scope link
> 2.2.2.0/24 dev dummy1 proto kernel scope link src 2.2.2.1
>
> $ ip addr del dev dummy1 2.2.2.1/24
> $ ip ro ls | grep dummy1
> <no outpu>
>
> The 1.1.1.0/24 route was removed as well the 2.2.2.0 connected route.
If you're going to skip the flushing in this case, at least mark the
nexthops as dead.
And this is my second reason to have rt6_sync_up() where I put it. I'm
preparing another set which sends FIB_EVENT_NH_ADD events from
rt6_sync_up() similar to what we've in fib_sync_up(). When mlxsw (others
in the future) processes the event it needs to add the nexthop back to
the forwarding plane. To do that, it needs to have a RIF for the
nexthop device. For the nexthop device to have a RIF, it needs at least
one IP address configured on the netdev.
Agree / disagree?
^ permalink raw reply
* Re: [PATCHv4 0/2] capability controlled user-namespaces
From: Eric W. Biederman @ 2018-01-03 16:44 UTC (permalink / raw)
To: Mahesh Bandewar
Cc: LKML, James Morris, Netdev, Kernel-hardening, Linux API,
Linux Security, Serge Hallyn, Michael Kerrisk, Kees Cook,
Eric Dumazet, David Miller, Mahesh Bandewar
In-Reply-To: <20180103072642.161742-1-mahesh@bandewar.net>
Mahesh Bandewar <mahesh@bandewar.net> writes:
> From: Mahesh Bandewar <maheshb@google.com>
>
> TL;DR version
> -------------
> Creating a sandbox environment with namespaces is challenging
> considering what these sandboxed processes can engage into. e.g.
> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
> Current form of user-namespaces, however, if changed a bit can allow
> us to create a sandbox environment without locking down user-
> namespaces.
In other conversations it appears it has been pointed out that user
namespaces are not necessarily safe under no_new_privs. In theory
user namespaces should be safe but in practice not so much.
So let me ask. Would your concerns be addressed if we simply made
creation and joining of user namespaces impossible in a no_new_privs
sandbox?
Eric
>
> Detailed version
> ----------------
>
> Problem
> -------
> User-namespaces in the current form have increased the attack surface as
> any process can acquire capabilities which are not available to them (by
> default) by performing combination of clone()/unshare()/setns() syscalls.
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <sched.h>
> #include <netinet/in.h>
>
> int main(int ac, char **av)
> {
> int sock = -1;
>
> printf("Attempting to open RAW socket before unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock before unshare().\n");
> close(sock);
> sock = -1;
> }
>
> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
> perror("unshare() failed: ");
> return 1;
> }
>
> printf("Attempting to open RAW socket after unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock after unshare().\n");
> close(sock);
> sock = -1;
> }
>
> return 0;
> }
>
> The above example shows how easy it is to acquire NET_RAW capabilities
> and once acquired, these processes could take benefit of above mentioned
> or similar issues discovered/undiscovered with malicious intent. Note
> that this is just an example and the problem/solution is not limited
> to NET_RAW capability *only*.
>
> The easiest fix one can apply here is to lock-down user-namespaces which
> many of the distros do (i.e. don't allow users to create user namespaces),
> but unfortunately that prevents everyone from using them.
>
> Approach
> --------
> Introduce a notion of 'controlled' user-namespaces. Every process on
> the host is allowed to create user-namespaces (governed by the limit
> imposed by per-ns sysctl) however, mark user-namespaces created by
> sandboxed processes as 'controlled'. Use this 'mark' at the time of
> capability check in conjunction with a global capability whitelist.
> If the capability is not whitelisted, processes that belong to
> controlled user-namespaces will not be allowed.
>
> Processes that do not have CAP_SYS_ADMIN in init-ns can *only* create
> controlled user-namespaces. In other words, user-namespaces created by
> privileged processes (those which have CAP_SYS_ADMIN in init-ns) are
> not controlled. A hierarchy underneath any controlled user-ns is always
> controlled.
>
> A global whitelist is list of capabilities governed by a sysctl
> (kernel.controlled_userns_caps_whitelist) which is available to
> (privileged) user in init-ns to modify while it's applicable to all
> controlled user-namespaces on the host irrespective of when that user-ns
> was created.
>
> Marking user-namespaces controlled without modifying the whitelist is
> equivalent of the current behavior. The default value of whitelist includes
> all capabilities so that the compatibility is maintained. However it gives
> admins fine-grained ability to control various capabilities system wide
> without locking down user-namespaces.
>
> Example
> -------
> Here is the example that demonstrates the behavior of a kernel that has
> this patch-set applied. It uses the same c-code from this commit-log and
> is called acquire_raw.c -
>
> (a) The 'root' user has all the capabilities all the time (before and
> after taking capability).
>
> root@vm0:~# id
> uid=0(root) gid=0(root) groups=0(root)
>
> root@vm0:~# sysctl -q kernel.controlled_userns_caps_whitelist
> kernel.controlled_userns_caps_whitelist = 1f,ffffffff
>
> root@vm0:~# ./acquire_raw
> Attempting to open RAW socket before unshare()...
> Successfully opened RAW-Sock before unshare().
> Attempting to open RAW socket after unshare()...
> Successfully opened RAW-Sock after unshare().
>
> root@vm0:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
> kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
>
> root@vm0:~# ./acquire_raw
> Attempting to open RAW socket before unshare()...
> Successfully opened RAW-Sock before unshare().
> Attempting to open RAW socket after unshare()...
> Successfully opened RAW-Sock after unshare().
>
> (b) Unprivileged user cannot change the mask.
>
> mahesh@vm0:~$ id
> uid=1000(mahesh) gid=1000(mahesh)
> groups=1000(mahesh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),118(lpadmin),128(sambashare)
>
> mahesh@vm0:~$ sysctl -q kernel.controlled_userns_caps_whitelist
> kernel.controlled_userns_caps_whitelist = 1f,ffffffff
>
> mahesh@vm0:~$ sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
> sysctl: permission denied on key 'kernel.controlled_userns_caps_whitelist'
>
> (c) Unprivileged user does not have CAP_NET_RAW in init-ns but can get
> that capability inside child-user-ns when the controlled_userns_caps
> mask is unchanged (current behavior).
>
> mahesh@vm0:~$ sysctl -q kernel.controlled_userns_caps_whitelist
> kernel.controlled_userns_caps_whitelist = 1f,ffffffff
>
> mahesh@vm0:~$ ./acquire_raw
> Attempting to open RAW socket before unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
> Attempting to open RAW socket after unshare()...
> Successfully opened RAW-Sock after unshare().
>
> (d) Changing the controlled_userns_caps_whitelist mask will prevent user
> for acquiring 'controlled capability' inside user-namespace.
>
> mahesh@vm0:~$ sysctl -q kernel.controlled_userns_caps_whitelist
> kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
> mahesh@vm0:~$ ./acquire_raw
> Attempting to open RAW socket before unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
> Attempting to open RAW socket after unshare()...
> socket() SOCK_RAW failed: : Operation not permitted
>
>
> Please see individual patches in this series.
>
> Mahesh Bandewar (2):
> capability: introduce sysctl for controlled user-ns capability whitelist
> userns: control capabilities of some user namespaces
>
> Documentation/sysctl/kernel.txt | 21 +++++++++++++++++
> include/linux/capability.h | 7 ++++++
> include/linux/user_namespace.h | 25 ++++++++++++++++++++
> kernel/capability.c | 52 +++++++++++++++++++++++++++++++++++++++++
> kernel/sysctl.c | 5 ++++
> kernel/user_namespace.c | 4 ++++
> security/commoncap.c | 8 +++++++
> 7 files changed, 122 insertions(+)
^ permalink raw reply
* Re: [RFC PATCH net-next 03/19] ipv6: Clear nexthop flags upon netdev up
From: David Ahern @ 2018-01-03 16:56 UTC (permalink / raw)
To: Ido Schimmel; +Cc: Ido Schimmel, netdev, davem, roopa, nicolas.dichtel, mlxsw
In-Reply-To: <20180103164331.GA4926@splinter>
On 1/3/18 9:43 AM, Ido Schimmel wrote:
> On Wed, Jan 03, 2018 at 08:32:51AM -0700, David Ahern wrote:
>> On 1/3/18 12:44 AM, Ido Schimmel wrote:
>>>>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>>>>> index ed06b1190f05..b6405568ed7b 100644
>>>>> --- a/net/ipv6/addrconf.c
>>>>> +++ b/net/ipv6/addrconf.c
>>>>> @@ -3484,6 +3484,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
>>>>> if (run_pending)
>>>>> addrconf_dad_run(idev);
>>>>>
>>>>> + /* Device has an address by now */
>>>>> + rt6_sync_up(dev, RTNH_F_DEAD);
>>>>> +
>>>>
>>>> Seems like this should be in the NETDEV_UP section, say after
>>>> addrconf_permanent_addr.
>>>
>>> Unless the `keep_addr_on_down` sysctl is set, then at this stage the
>>> netdev doesn't have an IP address and we shouldn't clear the dead flag
>>> just yet.
>>>
>>> This is consistent with IPv4 that clears the dead flag from nexthops in
>>> a multipath route only if the nexthop device has an IP address. When the
>>> last IPv4 address is removed from a netdev all the routes using it are
>>> flushed and there's nothing to clear upon NETDEV_UP.
>>
>> I have a bug about that IPv4 handling from the FRR team:
>>
>> $ ip link add dummy1 type dummy
>> $ ip li set dummy1 up
>> $ ip route add 1.1.1.0/24 dev dummy1
>>
>> $ ip addr add dev dummy1 2.2.2.1/24
>> $ ip ro ls | grep dummy1
>> 1.1.1.0/24 dev dummy1 scope link
>> 2.2.2.0/24 dev dummy1 proto kernel scope link src 2.2.2.1
>>
>> $ ip addr del dev dummy1 2.2.2.1/24
>> $ ip ro ls | grep dummy1
>> <no outpu>
>>
>> The 1.1.1.0/24 route was removed as well the 2.2.2.0 connected route.
>
> If you're going to skip the flushing in this case, at least mark the
> nexthops as dead.
On a down event, yes. If the device is still up then a route such as:
$ ip route add 1.1.1.0/24 dev dummy1
should still be usable even without an address on it.
>
> And this is my second reason to have rt6_sync_up() where I put it. I'm
> preparing another set which sends FIB_EVENT_NH_ADD events from
> rt6_sync_up() similar to what we've in fib_sync_up(). When mlxsw (others
On a tangent here, but I have been meaning to ask why you have
FIB_EVENT_NH_ADD events as opposed to handling netdev events. What does
a FIB_EVENT_NH_ADD provide that you can't do from a netdev event handler?
> in the future) processes the event it needs to add the nexthop back to
> the forwarding plane. To do that, it needs to have a RIF for the
> nexthop device. For the nexthop device to have a RIF, it needs at least
> one IP address configured on the netdev.
Why is that?
$ ip addr sh dev swp1s0.51
44: swp1s0.51@swp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue master vrf1101 state UP group default qlen 1000
link/ether 7c:fe:90:e8:3a:7d brd ff:ff:ff:ff:ff:ff
$ ip ro add vrf vrf1101 1.1.1.0/24 dev swp1s0.51
$ ip ro ls vrf vrf1101
unreachable default metric 8192
1.1.1.0/24 dev swp1s0.51 scope link offload
In this case, I take it mlxsw allocates a rif because of the vlan. The
above does not work on just swp1s0 -- ie., that route is not offloaded:
$ # ip ro ls
...
1.1.1.0/24 dev swp1s0 scope link
...
Interesting.
^ permalink raw reply
* Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device
From: Alexander Duyck @ 2018-01-03 16:59 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Brandeburg, Jesse, Sridhar Samudrala, Michael S. Tsirkin,
Stephen Hemminger, Netdev, virtualization, virtio-dev,
Alexander Duyck
In-Reply-To: <20180102181603.7fe4fd1e@cakuba.netronome.com>
On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Tue, 2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>> This patch series enables virtio to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration of a VM
>> with a direct attached VF without the need to setup a bond/team between a
>> VF and virtio net device in the guest.
>>
>> The hypervisor needs to unplug the VF device from the guest on the source
>> host and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed, the
>> destination hypervisor sets the MAC filter on the VF and plugs it back to
>> the guest to switch over to VF datapath.
>>
>> It is based on netvsc implementation and it may be possible to make this code
>> generic and move it to a common location that can be shared by netvsc and virtio.
>>
>> This patch series is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> How does the notion of a device which is both a bond and a leg of a
> bond fit with Alex's recent discussions about feature propagation?
> Which propagation rules will apply to VirtIO master? Meaning of the
> flags on a software upper device may be different. Why muddy the
> architecture like this and not introduce a synthetic bond device?
It doesn't really fit with the notion I had. I think there may have
been a bit of a disconnect as I have been out for the last week or so
for the holidays.
My thought on this was that the feature bit should be spawning a new
para-virtual bond device and that bond should have the virto and the
VF as slaves. Also I thought there was some discussion about trying to
reuse as much of the netvsc code as possible for this so that we could
avoid duplication of effort and have the two drivers use the same
approach. It seems like it should be pretty straight forward since you
would have the feature bit in the case of virto, and netvsc just does
this sort of thing by default if I am not mistaken.
- Alex
^ permalink raw reply
* RE: [EXT] Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Stefan Chulski @ 2018-01-03 17:00 UTC (permalink / raw)
To: Russell King - ARM Linux
Cc: Thomas Petazzoni, Andrew Lunn, Florian Fainelli, Yan Markman,
Jason Cooper, netdev, Antoine Tenart,
linux-kernel@vger.kernel.org, kishon@ti.com, Nadav Haklai,
Miquèl Raynal, Gregory Clément, Marcin Wojtas,
David S. Miller, linux-arm-kernel@lists.infradead.org,
Sebastian Hesselbarth
In-Reply-To: <20180101132520.GB28752@n2100.armlinux.org.uk>
> > > -----Original Message-----
> > > Hi Russell,
> > >
> > > Indeed. RGMII MAC behaves same way, although it shouldn't be named
> > > as 'in- band' to be on par with the specifications. Anyway - this
> > > one is rather a stub for being able to work with ACPI, so once the
> > > MDIO bus works there, this will be out of any concerns.
> >
> > Hi Marcin,
> >
> > This is correct.
> > "in-band" supported only for SGMII mode.
> > IRQ link interrupt depend on "in-band"' auto negation only if "in-band"'
> enabled.
> > But IRQ link interrupt could be triggered with "in-band", "out-band" or with
> specific fixed speed/duplex/flow_contol.
>
> Hi Stefan,
>
> How does this work in RGMII mode - is this handled by the PP2 polling the PHY
> to get the speed, duplex and flow control settings?
>
IRQ interrupt doesn't handled speed, duplex and flow control settings.
It's just raised if number of criterions met:
1) Physical signal detected by MAC
2) MAC auto negotiation succeeded(valid only auto negotiation enabled in MAC and "in-band" bypass disabled)
So if auto negotiation mechanism disabled in MAC or bypassed, link status would changes to up and IRQ interrupt be triggered.
In case of RGMII mode obviously we don't have "in-band" auto negotiation and "out-band" cannot be used in Kernel(due to missed locks).
So auto negotiation should be disabled on MAC level and speed/duplex/flow_contol would be negotiate by PHY.
phylink/phylib infrastructure should provide speed/duplex/flow_contol(that were agreed between PHY's) to ppv2 driver.
Stefan.
^ permalink raw reply
* [PATCH] sh_eth: fix TSU resource handling
From: Sergei Shtylyov @ 2018-01-03 17:09 UTC (permalink / raw)
To: netdev, linux-renesas-soc; +Cc: linux-sh, Nobuhiro Iwamatsu, Sergei Shtylyov
[-- Attachment #1: sh_eth-fix-TSU-resource-handling.patch --]
[-- Type: text/plain, Size: 2103 bytes --]
When switching the driver to the managed device API, I managed to break
the case of a dual Ether devices sharing a single TSU: the 2nd Ether port
wouldn't probe. Iwamatsu-san has tried to fix this but his patch was buggy
and he then dropped the ball...
The solution is to limit calling devm_request_mem_region() to the first
of the two ports sharing the same TSU, so devm_ioremap_resource() can't
be used anymore for the TSU resource...
Fixes: d5e07e69218f ("sh_eth: use managed device API")
Reported-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
---
The patch is against Dave Miller's 'net.git' repo.
drivers/net/ethernet/renesas/sh_eth.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
Index: net/drivers/net/ethernet/renesas/sh_eth.c
===================================================================
--- net.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net/drivers/net/ethernet/renesas/sh_eth.c
@@ -3225,10 +3225,29 @@ static int sh_eth_drv_probe(struct platf
/* ioremap the TSU registers */
if (mdp->cd->tsu) {
struct resource *rtsu;
+
rtsu = platform_get_resource(pdev, IORESOURCE_MEM, 1);
- mdp->tsu_addr = devm_ioremap_resource(&pdev->dev, rtsu);
- if (IS_ERR(mdp->tsu_addr)) {
- ret = PTR_ERR(mdp->tsu_addr);
+ if (!rtsu) {
+ dev_err(&pdev->dev, "no TSU resource\n");
+ ret = -ENODEV;
+ goto out_release;
+ }
+ /* We can only request the TSU region for the first port
+ * of the two sharing this TSU for the probe to succeed...
+ */
+ if (devno % 2 == 0 &&
+ !devm_request_mem_region(&pdev->dev, rtsu->start,
+ resource_size(rtsu),
+ dev_name(&pdev->dev))) {
+ dev_err(&pdev->dev, "can't request TSU resource.\n");
+ ret = -EBUSY;
+ goto out_release;
+ }
+ mdp->tsu_addr = devm_ioremap(&pdev->dev, rtsu->start,
+ resource_size(rtsu));
+ if (!mdp->tsu_addr) {
+ dev_err(&pdev->dev, "TSU region ioremap() failed.\n");
+ ret = -ENOMEM;
goto out_release;
}
mdp->port = devno % 2;
^ permalink raw reply
* Re: [PATCH net-next 0/3] nfp: flower: repr link state
From: David Miller @ 2018-01-03 17:18 UTC (permalink / raw)
To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20180103031901.4165-1-jakub.kicinski@netronome.com>
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 2 Jan 2018 19:18:58 -0800
> Dirk says:
>
> This series provides two updates towards the link state of reprs in
> the flower nfp app.
>
> Patch #1 improves the way link state is reported for reprs. Instead of
> starting with an assumed 'UP' state, always assume the link state is
> 'DOWN' and then modify this only on events received from firmware.
>
> Patch #2 adds a new nfp_app hook, repr_preclean. This callback is
> executed before reprs are removed from the app context and is executed
> per repr.
>
> Patch #3 implements the new REIFY control message, used to indicate
> when reprs are created and destroyed. Firmware uses these messages
> to prevent communication about any particular port when the driver
> doesn't know about the repr yet or anymore.
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH net v2] cxgb4: Fix FW flash errors
From: David Miller @ 2018-01-03 17:20 UTC (permalink / raw)
To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh, arjun, leedom
In-Reply-To: <1514960047-25727-1-git-send-email-ganeshgr@chelsio.com>
From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Wed, 3 Jan 2018 11:44:07 +0530
> From: Arjun Vynipadath <arjun@chelsio.com>
>
> commit 96ac18f14a5a ("cxgb4: Add support for new flash parts")
> removed initialization of adapter->params.sf_fw_start causing issues
> while flashing firmware to card. We no longer need sf_fw_start
> in adapter->params as we already have macros defined for FW flash
> addresses.
>
> Fixes: 96ac18f14a5a ("cxgb4: Add support for new flash parts")
> Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
> Signed-off-by: Casey Leedom <leedom@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
> ---
> V2: Corrected the commit log
Applied, thank you.
^ permalink raw reply
* Re: [PATCH net V2 0/2] bug fixes for ENA Ethernet driver
From: David Miller @ 2018-01-03 17:21 UTC (permalink / raw)
To: netanel
Cc: netdev, dwmw, zorik, matua, saeedb, msw, aliguori, nafea, evgenys,
gtzalik
In-Reply-To: <1514960250-107256-1-git-send-email-netanel@amazon.com>
From: <netanel@amazon.com>
Date: Wed, 3 Jan 2018 06:17:28 +0000
> From: Netanel Belgazal <netanel@amazon.com>
>
> Changes from V1:
> Revome incorrect "ena: invoke netif_carrier_off() only after netdev
> registered" patch
>
> This patchset contains 2 bug fixes:
> * handle rare race condition during MSI-X initialization
> * fix error processing in ena_down()
Series applied, thanks.
^ permalink raw reply
* Re: [patch net-next v4 00/10] net: sched: allow qdiscs to share filter block instances
From: Jiri Pirko @ 2018-01-03 17:22 UTC (permalink / raw)
To: David Ahern
Cc: netdev, davem, jhs, xiyou.wangcong, mlxsw, andrew, vivien.didelot,
f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
idosch, jakub.kicinski, simon.horman, pieter.jansenvanvuuren,
john.hurley, alexander.h.duyck, ogerlitz, john.fastabend, daniel
In-Reply-To: <d6cff5a4-b240-2a19-2af0-8b60de9f5435@gmail.com>
Wed, Jan 03, 2018 at 04:57:23PM CET, dsahern@gmail.com wrote:
>On 1/3/18 2:40 AM, Jiri Pirko wrote:
>> Wed, Jan 03, 2018 at 03:07:36AM CET, dsahern@gmail.com wrote:
>>> On 1/2/18 12:49 PM, Jiri Pirko wrote:
>>>> DaveA, please consider following example:
>>>>
>>>> $ tc qdisc add dev ens7 ingress
>>>> $ tc qdisc
>>>> qdisc ingress ffff: dev ens7 parent ffff:fff1 block 1
>>>>
>>>> Now I have one device with one qdisc attached.
>>>>
>>>> I will add some filters, for example:
>>>> $ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop
>>>>
>>>> No sharing is happening. The user is doing what he is used to do.
>>>>
>>>> Now user decides to share this filters with another device. As you can
>>>> see above, the block created for ens7 qdisc instance has id "1".
>>>> User can simply do:
>>>>
>>>> tc qdisc add dev ens8 ingress block 1
>>>>
>>>> And the block gets shared among ens7 ingress qdisc instance and ens8
>>>> ingress qdisc instance.
>>>>
>>>> What is wrong with this? The approach you suggest would disallow this
>>>
>>> Conceptually, absolutely nothing. We all agree that a shared block
>>> feature is needed. So no argument on sharing the filters across devices.
>>>
>>> The disagreement is in how they should be managed. I think my last
>>> response concisely captures my concerns -- the principle of least surprise.
>>>
>>> So with the initial commands above, all is fine. Then someone is
>>> debugging a problem or wants to add another filter to ens8, so they run:
>>>
>>> $ tc filter add dev ens8 ingress protocol ip pref 25 flower dst_ip
>>> 192.168.1.0/16 action drop
>>>
>>> Then traffic flows through ens7 break and some other user is struggling
>>> to understand what just happened. That the new filter magically appears
>>> on ens7 when the user operated on ens8 is a surprise. Nothing about that
>>> last command acknowledges that it is changing a shared resource.
>>
>> Given the fact that user configured sharing between ens7 and ens8 and he
>> can easily see that by "$ tc qdisc show" I don't see anything wrong
>> about it, no surprise. Either the user knows what is he doing or not.
>
>tc is one of the most difficult commands for users to understand and get
>right. The API behind the command even more so. There seems to be a
>general agreement on this.
>
>To someone like you who is well versed in tc semantics this may seem
>obvious, but I contend that even you would slip up here at some point.
>There is too much distance between the filter management and the qdisc
>listing a part of which shows a block id - not that it is shared or
>anything else, just of the many words in the output there is 'block N'.
>
>>>
>>> Consider the commands being run by different people, and a time span
>>> between. Allowing the shared block to be configured by any device using
>>> the block is just setting up users for errors and confusion.
>>
>> No confusion. Everything is visible, all info is in the manpage. The
>> same story as always.
>>
>>
>>>
>>>> forcing user to explicitly create some block entity and then to attach
>>>> it to qdisc instances. I don't really see good reason for it. Could you
>>>> please clear this up for me?
>>>
>>> It forces the user to acknowledge it is changing a resource that may be
>>> shared by more than one device.
>>>
>>> $ tc filter add dev ens8 ingress protocol ip pref 25 flower dst_ip
>>> 192.168.1.0/16 action drop
>>> Error: This qdisc is a shared block. Use the block API to configure.
>>>
>>> $ tc qdisc show dev ens8
>>> qdisc ingress ffff: dev ens7 parent ffff:fff1 block 1
>>>
>>> $ tc filter add block 1 protocol ip pref 25 flower dst_ip 192.168.1.0/16
>>> action drop
>>>
>>> Now there are no surprises. I have to know that ens8 is using block 1,
>>> and I have to specify that block when adding a filter.
>>
>> On contrary. This is surprising! Consider my original example extended
>> by your approach and limitations:
>
>Nope, I was not extending your approach; I was using your examples to
>show why I disagree with the approach. As I mentioned in past responses,
>I believe the block lifecycle should be independent of any device.
>
>$ tc qdisc add block 1
qdisc add block seems odd as in this point, block 1 has nothing to do
with any qdisc
>$ tc filter add block 1 ....
>
>$ tc qdisc add dev ens7 ingress block 1
>$ tc qdisc add dev ens8 ingress block 1
So when I do just:
tc qdisc add dev ens7 ingress
there won't be a block id created. I would like to be able to have it
done on one device and share on the second when I please to. With your
limitation, I have remove all, add block, add all back again.
>
>$ tc filter show block 1
>(filters listed)
>
>$ tc filter show dev ens7 ingress
So the command will return 0 as everything is ok. List is empty, yet
there are still filters being processed on this qdisc. That is surprise
from where I stand. I as a user would expect a list of all filters being
used here. This what you propose is clearly a breakage.
I think that both of us have different expectations about how this
should work. I said already that "tc filter add block 1 ...." and
"tc filter show block 1" from your approach is fine with me. I even
think that the explicit creation and destruction of block ("tc qdisc add
block 1" and "tc qdisc del block 1") is fine.
However I don't agree about breaking the existing filter add and show
and also imposibility to make not-shared block shared in the runtime
before defining it first.
Now, I suggest I will implement the block api extensions you suggest
without the limitations and send it as a part of this patchset. Would
you be ok with that?
Thanks!
>Info: This qdisc is shared. Use the block api to list filters.
>
>(This is very similar to how I am handling nexthop objects for routes. I
>can show some examples if desired, but don't want this tangent to go too
>far.)
>
>>
>> $ tc qdisc add dev ens7 ingress
>> $ tc qdisc
>> qdisc ingress ffff: dev ens7 parent ffff:fff1 block 1
>> $ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop
>>
>> So far, everything is good. Now I add qdisc with block 1 to ens8:
>> $ tc qdisc add dev ens8 ingress block 1
>>
>> And I do:
>> $ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop
>> Should it Error out or pass by your limitations?
>>
>> Assume it should pass.
>> I do:
>> $ tc filter add dev ens8 ingress protocol ip pref 25 flower dst_ip 192.168.1.0/16 action drop
>> Error: This qdisc is a shared block. Use the block API to configure.
>>
>> This will error out as you wrote. Now I do show:
>>
>> $ tc qdisc show dev ens8
>> qdisc ingress ffff: dev ens7 parent ffff:fff1 block 1
>>
>> As you wrote, there is "ens7" in output of ens8 qdisc. That is
>> surprising.
>
>that's a typo on my part with copy-paste-modify of commands while
>writing an email; that was not intentional to show ens7 on an ens8 device.
ok.
>
>
>>
>> What would following commands show with your limitations:
>> $ tc filter show dev ens7 ingress
>> $ tc filter show dev ens8 ingress
>
>see above
>
>>
>> All filters should be listed under ens7 and ens8 should be blank? I
>> cannot add filters to ens8 with your limitations so I guess the show for
>> it should be blank. But there are actually rules there! That is another
>> surprise and breakage!
>>
>> Now I continue and remove the qdisc from ens7:
>> $ tc qdisc add dev ens7 ingress
>>
>> The block 1 is still there for ens8. So what happens now? What is the
>> output of "filter show dev ens8 ingress" and "qdisc show dev ens8"?
>> Will "add dev ens8 ingress" magically start to work now? This is another
>> set of surprises and breakages.
>>
>> So as I see it with your limitations, there is a lot of surprises
>> introduced.
>>
>> Note that I gave a lot of thoughts to all this. The approach I suggest
>> is the cleanest and does not break anything. Also, it is easily
>> extendable by adding the block handle to add/del/list the filters.
>> But the current commands should not be broken. Please.
>>
>> If you want, I can implement the block handle extension as a part of this
>> patchset. I wanted to do it as a follow-up to limit the number of
>> patches in the set so DaveM would not have reason to hate me :)
>>
>>
>>>
>>>
>>> BTW, is there an option to list all devices using the same shared block
>>> - short of listing all and grepping?
>>
>> $ tc qdisc show
>>
>>
>
^ permalink raw reply
* Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
From: Russell King - ARM Linux @ 2018-01-03 17:29 UTC (permalink / raw)
To: Andrew Lunn
Cc: Antoine Tenart, Florian Fainelli, thomas.petazzoni, ymarkman,
jason, netdev, linux-kernel, kishon, nadavh, miquel.raynal,
gregory.clement, stefanc, mw, davem, linux-arm-kernel,
sebastian.hesselbarth
In-Reply-To: <20180103150830.GB3401@lunn.ch>
On Wed, Jan 03, 2018 at 04:08:30PM +0100, Andrew Lunn wrote:
> > > >>> diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
> > > >>> index 4f8423a948d5..70459a28f3a1 100644
> > > >>> --- a/include/linux/phy/phy.h
> > > >>> +++ b/include/linux/phy/phy.h
> > > >>> @@ -28,6 +28,7 @@ enum phy_mode {
> > > >>> PHY_MODE_USB_DEVICE,
> > > >>> PHY_MODE_USB_OTG,
> > > >>> PHY_MODE_SGMII,
> > > >>> + PHY_MODE_SGMII_2_5G,
> > > >>> PHY_MODE_10GKR,
> > > >>> PHY_MODE_UFS_HS_A,
> > > >>> PHY_MODE_UFS_HS_B,
> > > >>
> > > >> There was a discussion maybe last month about adding 2.5G SGMII. I
> > > >> would prefer 2500SGMII. Putting the number first makes it uniform with
> > > >> the other defines, 1000BASEX, 25000BASEX, 10GKR.
> > > >
> > > > Good to know. I wasn't completely sure how to name this mode properly,
> > > > but I'm fine with PHY_MODE_2500SGMII. I'll update the patches and send a
> > > > v2 (without the dt part).
> > >
> > > And since you are respinning, please make sure you update phy_modes() in
> > > the same header file as well as
> > > Documentation/devicetree/bindings/net/ethernet.txt with the newly added
> > > PHY interface mode.
> >
> > Actually it's a generic PHY mode I'm adding, not a network PHY mode.
> > There's no phy_modes() function for generic PHYs (and this 2500BaseX
> > mode already is supported in the network PHY modes).
>
> Hi Antoine
>
> Don't you need it in both include/linux/phy/phy.h and
> include/linux/phy.h?
Is it really 2.5G SGMII or 2.5G Base-X - iow, which configuration word
is permissible at 2.5G speeds for the pp2? We already have 2500BASEX
in linux/phy.h, which is the 802.3z compliant 16-bit configuration
word rather than the Cisco variant.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
^ permalink raw reply
* Re: [PATCH net-next v2 0/7] Resolve races in phy accessors
From: Russell King - ARM Linux @ 2018-01-03 17:32 UTC (permalink / raw)
To: David Miller; +Cc: andrew, f.fainelli, netdev
In-Reply-To: <20180103.110431.2081799093348577405.davem@davemloft.net>
On Wed, Jan 03, 2018 at 11:04:31AM -0500, David Miller wrote:
> From: Russell King - ARM Linux <linux@armlinux.org.uk>
> Date: Tue, 2 Jan 2018 10:52:18 +0000
>
> > This series resolves races with various accesses to PHY registers.
> > The first five patches are necessary before we add phylink support
> > to mvneta, the remaining three are merely cleanups for unobserved
> > races, and hence are less critical.
>
> Series applied.
aieee, this should have been applied before the mvneta patch set (as
mentioned in the covering email to that set, as well as the cover
message you quoted above) because the mvneta patch set makes these
races visible. Well, I guess folk are going to have to put up with
bisect issues if they hit a commit between the two merges in your tree.
Well, I guess what's done is done, I did my best to avoid it being
visible.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
^ permalink raw reply
* [net 0/4][pull request] Intel Wired LAN Driver Updates 2018-01-03
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene
This series contains fixes for i40e and i40evf.
Amritha removes the UDP support for big buffer cloud filters since it is
not supported and having UDP enabled is a bug.
Alex fixes a bug in the __i40e_chk_linearize() which did not take into
account large (16K or larger) fragments that are split over 2 descriptors,
which could result in a transmit hang.
Jake fixes an issue where a devices own MAC address could be removed from
the unicast address list, so force a check on every address sync to ensure
removal does not happen.
Jiri Pirko fixes the return value when a filter configuration is not
supported, do not return "invalid" but return "not supported" so that
the core can react correctly.
The following are changes since commit c095508770aebf1b9218e77026e48345d719b17c:
RDS: Heap OOB write in rds_message_alloc_sgs()
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 40GbE
Alexander Duyck (1):
i40e/i40evf: Account for frags split over multiple descriptors in
check linearize
Amritha Nambiar (1):
i40e: Remove UDP support for big buffer
Jacob Keller (1):
i40e: don't remove netdev->dev_addr when syncing uc list
Jiri Pirko (1):
i40e: flower: Fix return value for unsupported offload
drivers/net/ethernet/intel/i40e/i40e_main.c | 37 +++++++++++++++++++--------
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 26 ++++++++++++++++---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 26 ++++++++++++++++---
3 files changed, 72 insertions(+), 17 deletions(-)
--
2.15.1
^ permalink raw reply
* [net 2/4] i40e/i40evf: Account for frags split over multiple descriptors in check linearize
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
To: davem; +Cc: Alexander Duyck, netdev, nhorman, sassmann, jogreene,
Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>
From: Alexander Duyck <alexander.h.duyck@intel.com>
The original code for __i40e_chk_linearize didn't take into account the
fact that if a fragment is 16K in size or larger it has to be split over 2
descriptors and the smaller of those 2 descriptors will be on the trailing
edge of the transmit. As a result we can get into situations where we didn't
catch requests that could result in a Tx hang.
This patch takes care of that by subtracting the length of all but the
trailing edge of the stale fragment before we test for sum. By doing this
we can guarantee that we have all cases covered, including the case of a
fragment that spans multiple descriptors. We don't need to worry about
checking the inner portions of this since 12K is the maximum aligned DMA
size and that is larger than any MSS will ever be since the MTU limit for
jumbos is something on the order of 9K.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 26 +++++++++++++++++++++++---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 26 +++++++++++++++++++++++---
2 files changed, 46 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4566d66ffc7c..5bc2748ac468 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3047,10 +3047,30 @@ bool __i40e_chk_linearize(struct sk_buff *skb)
/* Walk through fragments adding latest fragment, testing it, and
* then removing stale fragments from the sum.
*/
- stale = &skb_shinfo(skb)->frags[0];
- for (;;) {
+ for (stale = &skb_shinfo(skb)->frags[0];; stale++) {
+ int stale_size = skb_frag_size(stale);
+
sum += skb_frag_size(frag++);
+ /* The stale fragment may present us with a smaller
+ * descriptor than the actual fragment size. To account
+ * for that we need to remove all the data on the front and
+ * figure out what the remainder would be in the last
+ * descriptor associated with the fragment.
+ */
+ if (stale_size > I40E_MAX_DATA_PER_TXD) {
+ int align_pad = -(stale->page_offset) &
+ (I40E_MAX_READ_REQ_SIZE - 1);
+
+ sum -= align_pad;
+ stale_size -= align_pad;
+
+ do {
+ sum -= I40E_MAX_DATA_PER_TXD_ALIGNED;
+ stale_size -= I40E_MAX_DATA_PER_TXD_ALIGNED;
+ } while (stale_size > I40E_MAX_DATA_PER_TXD);
+ }
+
/* if sum is negative we failed to make sufficient progress */
if (sum < 0)
return true;
@@ -3058,7 +3078,7 @@ bool __i40e_chk_linearize(struct sk_buff *skb)
if (!nr_frags--)
break;
- sum -= skb_frag_size(stale++);
+ sum -= stale_size;
}
return false;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 50864f99446d..1ba29bb85b67 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -2012,10 +2012,30 @@ bool __i40evf_chk_linearize(struct sk_buff *skb)
/* Walk through fragments adding latest fragment, testing it, and
* then removing stale fragments from the sum.
*/
- stale = &skb_shinfo(skb)->frags[0];
- for (;;) {
+ for (stale = &skb_shinfo(skb)->frags[0];; stale++) {
+ int stale_size = skb_frag_size(stale);
+
sum += skb_frag_size(frag++);
+ /* The stale fragment may present us with a smaller
+ * descriptor than the actual fragment size. To account
+ * for that we need to remove all the data on the front and
+ * figure out what the remainder would be in the last
+ * descriptor associated with the fragment.
+ */
+ if (stale_size > I40E_MAX_DATA_PER_TXD) {
+ int align_pad = -(stale->page_offset) &
+ (I40E_MAX_READ_REQ_SIZE - 1);
+
+ sum -= align_pad;
+ stale_size -= align_pad;
+
+ do {
+ sum -= I40E_MAX_DATA_PER_TXD_ALIGNED;
+ stale_size -= I40E_MAX_DATA_PER_TXD_ALIGNED;
+ } while (stale_size > I40E_MAX_DATA_PER_TXD);
+ }
+
/* if sum is negative we failed to make sufficient progress */
if (sum < 0)
return true;
@@ -2023,7 +2043,7 @@ bool __i40evf_chk_linearize(struct sk_buff *skb)
if (!nr_frags--)
break;
- sum -= skb_frag_size(stale++);
+ sum -= stale_size;
}
return false;
--
2.15.1
^ permalink raw reply related
* [net 3/4] i40e: don't remove netdev->dev_addr when syncing uc list
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>
From: Jacob Keller <jacob.e.keller@intel.com>
In some circumstances, such as with bridging, it is possible that the
stack will add a devices own MAC address to its unicast address list.
If, later, the stack deletes this address, then the i40e driver will
receive a request to remove this address.
The driver stores its current MAC address as part of the MAC/VLAN hash
array, since it is convenient and matches exactly how the hardware
expects to be told which traffic to receive.
This causes a problem, since for more devices, the MAC address is stored
separately, and requests to delete a unicast address should not have the
ability to remove the filter for the MAC address.
Fix this by forcing a check on every address sync to ensure we do not
remove the device address.
There is a very narrow possibility of a race between .set_mac and
.set_rx_mode, if we don't change netdev->dev_addr before updating our
internal MAC list in .set_mac. This might be possible if .set_rx_mode is
going to remove MAC "XYZ" from the list, at the same time as .set_mac
changes our dev_addr to MAC "XYZ", we might possibly queue a delete,
then an add in .set_mac, then queue a delete in .set_rx_mode's
dev_uc_sync and then update netdev->dev_addr. We can avoid this by
moving the copy into dev_addr prior to the changes to the MAC filter
list.
A similar race on the other side does not cause problems, as if we're
changing our MAC form A to B, and we race with .set_rx_mode, it could
queue a delete from A, we'd update our address, and allow the delete.
This seems like a race, but in reality we're about to queue a delete of
A anyways, so it would not cause any issues.
A race in the initialization code is unlikely because the netdevice has
not yet been fully initialized and the stack should not be adding or
removing addresses yet.
Note that we don't (yet) need similar code for the VF driver because it
does not make use of __dev_uc_sync and __dev_mc_sync, but instead roles
its own method for handling updates to the MAC/VLAN list, which already
has code to protect against removal of the hardware address.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index fffd4868defb..9e4b78e447f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1573,11 +1573,18 @@ static int i40e_set_mac(struct net_device *netdev, void *p)
else
netdev_info(netdev, "set new mac address %pM\n", addr->sa_data);
+ /* Copy the address first, so that we avoid a possible race with
+ * .set_rx_mode(). If we copy after changing the address in the filter
+ * list, we might open ourselves to a narrow race window where
+ * .set_rx_mode could delete our dev_addr filter and prevent traffic
+ * from passing.
+ */
+ ether_addr_copy(netdev->dev_addr, addr->sa_data);
+
spin_lock_bh(&vsi->mac_filter_hash_lock);
i40e_del_mac_filter(vsi, netdev->dev_addr);
i40e_add_mac_filter(vsi, addr->sa_data);
spin_unlock_bh(&vsi->mac_filter_hash_lock);
- ether_addr_copy(netdev->dev_addr, addr->sa_data);
if (vsi->type == I40E_VSI_MAIN) {
i40e_status ret;
@@ -1923,6 +1930,14 @@ static int i40e_addr_unsync(struct net_device *netdev, const u8 *addr)
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
+ /* Under some circumstances, we might receive a request to delete
+ * our own device address from our uc list. Because we store the
+ * device address in the VSI's MAC/VLAN filter list, we need to ignore
+ * such requests and not delete our device address from this list.
+ */
+ if (ether_addr_equal(addr, netdev->dev_addr))
+ return 0;
+
i40e_del_mac_filter(vsi, addr);
return 0;
--
2.15.1
^ permalink raw reply related
* [net 4/4] i40e: flower: Fix return value for unsupported offload
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
To: davem; +Cc: Jiri Pirko, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>
From: Jiri Pirko <jiri@mellanox.com>
When filter configuration is not supported, drivers should return
-EOPNOTSUPP so the core can react correctly.
Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9e4b78e447f8..42dcaefc4c19 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7371,7 +7371,7 @@ static int i40e_configure_clsflower(struct i40e_vsi *vsi,
if (tc < 0) {
dev_err(&vsi->back->pdev->dev, "Invalid traffic class\n");
- return -EINVAL;
+ return -EOPNOTSUPP;
}
if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) ||
--
2.15.1
^ permalink raw reply related
* [net 1/4] i40e: Remove UDP support for big buffer
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
To: davem; +Cc: Amritha Nambiar, netdev, nhorman, sassmann, jogreene,
Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>
From: Amritha Nambiar <amritha.nambiar@intel.com>
Since UDP based filters are not supported via big buffer cloud
filters, remove UDP support. Also change a few return types to
indicate unsupported vs invalid configuration.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 321d8be80871..fffd4868defb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6038,8 +6038,8 @@ static int i40e_validate_and_set_switch_mode(struct i40e_vsi *vsi)
/* Set Bit 7 to be valid */
mode = I40E_AQ_SET_SWITCH_BIT7_VALID;
- /* Set L4type to both TCP and UDP support */
- mode |= I40E_AQ_SET_SWITCH_L4_TYPE_BOTH;
+ /* Set L4type for TCP support */
+ mode |= I40E_AQ_SET_SWITCH_L4_TYPE_TCP;
/* Set cloud filter mode */
mode |= I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL;
@@ -6969,18 +6969,18 @@ static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
is_valid_ether_addr(filter->src_mac)) ||
(is_multicast_ether_addr(filter->dst_mac) &&
is_multicast_ether_addr(filter->src_mac)))
- return -EINVAL;
+ return -EOPNOTSUPP;
- /* Make sure port is specified, otherwise bail out, for channel
- * specific cloud filter needs 'L4 port' to be non-zero
+ /* Big buffer cloud filter needs 'L4 port' to be non-zero. Also, UDP
+ * ports are not supported via big buffer now.
*/
- if (!filter->dst_port)
- return -EINVAL;
+ if (!filter->dst_port || filter->ip_proto == IPPROTO_UDP)
+ return -EOPNOTSUPP;
/* adding filter using src_port/src_ip is not supported at this stage */
if (filter->src_port || filter->src_ipv4 ||
!ipv6_addr_any(&filter->ip.v6.src_ip6))
- return -EINVAL;
+ return -EOPNOTSUPP;
/* copy element needed to add cloud filter from filter */
i40e_set_cld_element(filter, &cld_filter.element);
@@ -6991,7 +6991,7 @@ static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
is_multicast_ether_addr(filter->src_mac)) {
/* MAC + IP : unsupported mode */
if (filter->dst_ipv4)
- return -EINVAL;
+ return -EOPNOTSUPP;
/* since we validated that L4 port must be valid before
* we get here, start with respective "flags" value
--
2.15.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox