* Re: [PATCH rdma-next 0/3] RDMA RX RoCE Steering Support
From: Doug Ledford @ 2019-08-22 15:29 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Mark Zhang,
Saeed Mahameed, linux-netdev
In-Reply-To: <20190821140204.GG4459@mtr-leonro.mtl.com>
[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]
On Wed, 2019-08-21 at 17:02 +0300, Leon Romanovsky wrote:
> On Tue, Aug 20, 2019 at 01:54:59PM -0400, Doug Ledford wrote:
> > On Mon, 2019-08-19 at 14:36 +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > >
> > > Hi,
> > >
> > > This series from Mark extends mlx5 with RDMA_RX RoCE flow steering
> > > support
> > > for DEVX and QP objects.
> > >
> > > Thanks
> > >
> > > Mark Zhang (3):
> > > net/mlx5: Add per-namespace flow table default miss action
> > > support
> > > net/mlx5: Create bypass and loopback flow steering namespaces
> > > for
> > > RDMA
> > > RX
> > > RDMA/mlx5: RDMA_RX flow type support for user applications
> >
> > I have no objection to this series.
>
> Thanks, first two patches were applied to mlx5-next
>
> e6806e9a63a7 net/mlx5: Create bypass and loopback flow steering
> namespaces for RDMA RX
> f66ad830b114 net/mlx5: Add per-namespace flow table default miss
> action support
mlx5-next merged into for-next, final patch applied, thanks.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Toke Høiland-Jørgensen @ 2019-08-22 15:28 UTC (permalink / raw)
To: Daniel Borkmann, Stephen Hemminger, Alexei Starovoitov
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <2cf18fed-f2ec-bea3-658e-07ba8124148a@iogearbox.net>
Daniel Borkmann <daniel@iogearbox.net> writes:
> On 8/22/19 3:38 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 8/22/19 2:04 PM, Toke Høiland-Jørgensen wrote:
>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>> On 8/22/19 12:43 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>>> On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> This adds a configure check for libbpf and renames functions to allow
>>>>>>>> lib/bpf.c to be compiled with it present. This makes it possible to
>>>>>>>> port functionality piecemeal to use libbpf.
>>>>>>>>
>>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>> ---
>>>>>>>> configure | 16 ++++++++++++++++
>>>>>>>> include/bpf_util.h | 6 +++---
>>>>>>>> ip/ipvrf.c | 4 ++--
>>>>>>>> lib/bpf.c | 33 +++++++++++++++++++--------------
>>>>>>>> 4 files changed, 40 insertions(+), 19 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/configure b/configure
>>>>>>>> index 45fcffb6..5a89ee9f 100755
>>>>>>>> --- a/configure
>>>>>>>> +++ b/configure
>>>>>>>> @@ -238,6 +238,19 @@ check_elf()
>>>>>>>> fi
>>>>>>>> }
>>>>>>>>
>>>>>>>> +check_libbpf()
>>>>>>>> +{
>>>>>>>> + if ${PKG_CONFIG} libbpf --exists; then
>>>>>>>> + echo "HAVE_LIBBPF:=y" >>$CONFIG
>>>>>>>> + echo "yes"
>>>>>>>> +
>>>>>>>> + echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
>>>>>>>> + echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
>>>>>>>> + else
>>>>>>>> + echo "no"
>>>>>>>> + fi
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> check_selinux()
>>>>>>>
>>>>>>> More of an implementation detail at this point in time, but want to
>>>>>>> make sure this doesn't get missed along the way: as discussed at
>>>>>>> bpfconf [0] best for iproute2 to handle libbpf support would be the
>>>>>>> same way of integration as pahole does, that is, to integrate it via
>>>>>>> submodule [1] to allow kernel and libbpf features to be in sync with
>>>>>>> iproute2 releases and therefore easily consume extensions we're adding
>>>>>>> to libbpf to aide iproute2 integration.
>>>>>>
>>>>>> I can sorta see the point wrt keeping in sync with kernel features. But
>>>>>> how will this work with distros that package libbpf as a regular
>>>>>> library? Have you guys given up on regular library symbol versioning for
>>>>>> libbpf?
>>>>>
>>>>> Not at all, and I hope you know that. ;-)
>>>>
>>>> Good! Didn't really expect you had, just checking ;)
>>>>
>>>>> The reason I added lib/bpf.c integration into iproute2 directly back
>>>>> then was exactly such that users can start consuming BPF for tc and
>>>>> XDP via iproute2 /everywhere/ with only a simple libelf dependency
>>>>> which is also available on all distros since pretty much forever. If
>>>>> it was an external library, we could have waited till hell freezes
>>>>> over and initial distro adoption would have pretty much taken forever:
>>>>> to pick one random example here wrt the pace of some downstream
>>>>> distros [0]. The main rationale is pretty much the same as with added
>>>>> kernel features that land complementary iproute2 patches for that
>>>>> kernel release and as libbpf is developed alongside it is reasonable
>>>>> to guarantee user expectations that iproute2 released for kernel
>>>>> version x can make use of BPF features added to kernel x with same
>>>>> loader support from x.
>>>>
>>>> Well, for iproute2 I would expect this to be solved by version
>>>> dependencies. I.e. iproute2 version X would depend on libbpf version Y+
>>>> (like, I dunno, the version of libbpf included in the same kernel source
>>>> tree as the kernel version iproute2 is targeting? :)).
>>>
>>> This sounds nice in theory, but from what I've seen major (!) distros
>>> already seem to have a hard time releasing kernel x along with iproute2
>>> package x, concrete example was that distro kernel was on 4.13 and its
>>> official iproute2 package on 4.9,
>>
>> If the iproute2 package is not being updated at all I don't really see
>> how it would make any difference whether libbpf is vendored or not? :)
>>
>>> adding yet another variable that needs to be in sync with kernel is
>>> simply impractical especially for a _core_ package like iproute2 that
>>> should have as little dependencies as possible. I also don't want to
>>> make a bet on whether libbpf will be available on every distro that
>>> also ships iproute2. Hence approach that pahole (and also bcc by the
>>> way) takes is most reasonable to have the best user experience.
>>
>> Most users are going to get iproute2 from their distro packages anyway,
>> so if distros are incompetent at packaging, my bet is that you're going
>> to run into issues one way or another.
>>
>> But OK, if you think it is easier to work around bad distros by
>> vendoring, you guys are the maintainers, so that's up to you. But can we
>> at least put in the version dependency and let the build system pick up
>> a system libbpf if it exists and is compatible? That way distros that
>> *are* competent can still link it dynamically...
>
> Yeah that would be fine by me to use this as a fallback, and I think that
> iproute2's configure script should be able to easily handle this
> situation.
Cool, I can live with that :)
-Toke
^ permalink raw reply
* RFC: very rough draft of a bpf permission model
From: Andy Lutomirski @ 2019-08-22 15:17 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Daniel Borkmann, Alexei Starovoitov, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrUWQbPK3Pc6P5i_UqHPXJmZVyvuYXfq+VRtD6A3emaRhw@mail.gmail.com>
BPF security strawman, v0.1
This is very rough. Most of this, especially the API details, needs
work before it's ready to implement. The whole concept also needs
review.
= Goals =
The overall goal is to make it possible to use eBPF without having
what is effectively administrator access. For example, an eBPF user
should not be able to directly tamper with other processes (unless
this permission is explicitly granted) and should not be able to
read or write other users' eBPF maps.
It should be possible to use eBPF inside a user namespace without breaking
the userns security model.
Due to the risk of speculation attacks and such being carried out via
eBPF, it should not become possible to use too much of eBPF without the
administrator's permission. (NB: it is already possible to use
*classic* BPF without any permission, and classic BPF is translated
internally to eBPF, so this goal can only be met to a limited extent.)
= Definitions =
Global capability: A capability bit in the caller's effective mask, so
long as the caller is in the root user namespace. Tasks in non-root
user namespaces never have global capabilibies. This is what capable()
checks.
Namespace capability: A capability over a specific user namespace.
Tasks in a user namespace have all the capabilities in their effective
mask over their user namespace. A namespace capability generally
indicates that the capability applies to the user namespace itself and
to all non-user namespaces that live in the user namespace. For
example, CAP_NET_ADMIN means that you can configure all networks
namespaces in the current user namespace. This is what ns_capable()
checks.
Anything that requires a global capability will not work in a non-root
user namespace.
= unprivileged_bpf_disabled =
Nothing in here supercedes unprivileged_bpf_disabled. If
unprivileged_bpf_disabled = 1, then these proposals should not allow anything
that is disallowed today. The idea is to make unprivileged_bpf_disabled=0
both safer and more useful.
= Test runs =
Global CAP_SYS_ADMIN is needed to test-run a program. Test-running a program
exposes its own attack surface. It's also the only way to run a program at
all if you merely have permission to load the program but not to attach it
anywhere. Some of the proposed changes below will make it possible to load
most program types without
= Access to programs and maps =
There are two basic security concerns when accessing programs and maps:
the attack surface against the kernel and the ability to access other
people's maps.
Unprivileged processes may read a map if they have an FMODE_READ descriptor
for the map. Unprivileged processes may write a map if they have an
FMODE_WRITE descriptor to the map. Unprivileged processes may open a
persistent map with a mode consistent with the permissions in bpffs.
Unprivileged processes may create a bpffs inode for an existing map
if the have an RW file descriptor for the map. (This is a change to
current behavior. Daniel, Alexei thought the current behavior was
intentional. Do you recall whether this is the case?)
The _BY_ID map APIs inherently have no concept of ownership of maps. These
APIs will continue to require global CAP_SYS_ADMIN.
The small number of things that currently require the _BY_ID APIs, e.g.,
reading maps of maps, can be addressed if needed with new APIs that
return fds instead of ids. Otherwise using them will continue to require
global capabilities.
Unprivileged processes may create exactly the set of maps that they can
create today. Future proposals may extend this by a variety of means;
this current proposal makes no changes.
= Program loading =
Loading a program carries the following risks:
- It exposes the attack surface in the program verifier itself. That is
possible, although unlikely, that merely verifying a malicious program
could crash or otherwise cause a kernel malfunction.
- It exposes the attack surface of insufficient checks in the verifier.
That is, a verifier bug could allow a malicious program that is dangerous
when run.
- It exposes all of the functions that the program type can call.
Some functions, e.g. bpf_probe_read(), should require privilege to call.
- It exposes resource attacks. Currently, privileged users can load programs
that use more resources than unprivileged users can load.
- It exposes pointer-to-integer conversions. This requires global
capabilities.
- The program could contain speculation attack gadgets.
- Loading a program is a prerequisite to attaching the program.
I propose the following:
Flag functions that require privilege as such. Loading a program that calls
such a function will require a global capability. The privileged functions are
mainly used for tracing, I think, and kernel tracing should require global
capabilities.
Loading a program that uses privileged verifier features (function calls or
pointer-to-integer-conversions) will continue to require privileges.
Loading a function that uses excessive resources can continue to require
global capabilities or it could use a new set of cgroup settings that
adjust the bpf complexity limits.
Loading a function that bypasses the various speculation attack hardening
features (e.g. constant blinding) requires global capabilities.
Other than this, bpf program types can have a new flag set to allow
them to be loaded without any privileges. Some bpf program types
may need additional care, e.g. perf bpf events. They can be attached
without privilege even in current kernels, and this might need to change.
(optional) Add an API to load a program where the program source comes from a
file specified by id instead of in memory. This would allow LSMs to require
that bpf() programs be appropriately labeled. If LSMs require use of this
API, it will make it much harder to exploit the verifier or speculation bugs.
As a possible future extension, a way to selectively grant the ability to
use specific program types without privilege could be useful. This
could be done with a cgroup option, for example.
= Cgroup attach =
Cgroups have their own hierarchy that does not necessarily follow the
namespace hierarchy. Unless cgroups integrate with namespaces in ways
that they currently don't, namespace capabilites cannot be used to grant
permission to operate on cgroups.
I propose that attaching and detaching bpf programs to cgroups use a
permission model similar to the model for changing non-bpf cgroup
settings. In particular, each bpf_attach_type will get a new file in a
cgroup directory. So there will be
/sys/fs/cgroup/cgroup_name/bpf.inet_ingress, bpf.inet_egress, etc.
A new API will be added to bpf() to attach and detach programs. The new
API will take an fd to the bpf.attach_type file instead of to the cgroup
directory. It will require FMODE_WRITE. This API will *not* require
any capability.
To prevent anyone with a delegated cgroup from automatically being
able to use all bpf program types, the new bpf.attach_type files
will be opt-in as part of the hierarchy. This could be done by writing
"+bpf.*" or "+bpf.inet_ingress" to cgroup.subtree_control to make
all the bpf.attach_type files or just bpf.inet_ingress available
in descendents of the cgroup in question. This could alternatively
be a new bpf.subtree_control file if that seems better.
The result of these changes will be that root can use the old
attach API or the new attach API. Unprivileged programs cannot
use the old attach API. Unprivileged programs can use the new
attach API if they are explicitly granted permission by all their
ancestor cgroup managers.
= Additional mitigations =
Optional: there may be cases where a user can load a bpf program
but can't attach or otherwise execute it. Nonetheless, it's plausible
that such a program could be speculatively executed. The kernel could
mitigate this by only marking a JITted bpf program executable when it
is first attached or test-run.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-22 15:16 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <98fee747-795a-ff10-fa98-10ddb5afcc03@iogearbox.net>
On Thu, Aug 22, 2019 at 7:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> > On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> >>> It tries to make the kernel respect the access modes for fds. Without
> >>> this patch, there seem to be some holes: nothing looked at program fds
> >>> and, unless I missed something, you could take a readonly fd for a
> >>> program, pin the program, and reopen it RW.
> >>
> >> I think it's by design. iirc Daniel had a use case for something like this.
> >
> > That seems odd. Daniel, can you elaborate?
>
> [ ... catching up late. ]
>
> Not from my side, the change was added by Chenbo back then for Android
> use-case to replace xt_qtaguid and xt_owner with BPF programs and to
> allow unprivileged applications to read maps. More on their architecture:
>
> https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor
>
> From the cover-letter:
>
> [...]
> The network-control daemon (netd) creates and loads an eBPF object for
> network packet filtering and analysis. It passes the object FD to an
> unprivileged network monitor app (netmonitor), which is not allowed to
> create, modify or load eBPF objects, but is allowed to read the traffic
> stats from the map.
> [...]
I suspect that this use case is, in fact, mostly broken in current
kernels. An unprivileged process with a read-only fd to a bpf map can
BPF_OBJ_PIN the map and then re-open it read-write. As far as I can
tell, the only thing mitigating this is that it won't work unless the
attacker has write access to some directory in bpffs.
> > Trusted by whom? In a non-nested container, the container manager
> > *might* be trusted by the outside world. In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted. I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> [...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
> ideally be possible for that container to install BPF programs for
> mangling, dropping, forwarding etc as long as it's only affecting it's
> /own/ netns like the rest of networking subsystem controls that work
> in that case. I would actually like to get to this at some point and
> make it more approachable as long as there is a way for an admin to
> /opt into it/ via policy (aka not by default).
For better or for worse, I think this would need a massive
re-architecting of the way bpf filtering works. bpf filters attach to
cgroups, which aren't scoped to network namespaces at all. So we need
a different permission model.
> Thinking out loud, I'd
> love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
> customizable seccomp policy. Meaning, there could be several CAP_BPF
> type sub-policies e.g. from allowing everything (equivalent to the
> /dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
> programmable user defined policy that can be tailored to specific
> needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
> or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
> HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
> their own netns, and if that is untrusted, then same restrictions/
> mitigations could be done by the verifier as with (current) unprivileged
> BPF, enabled via programmable policy as well. We wouldn't make any
> static/fixed assumptions, but allow users to define them based on their
> own use-cases. Haven't looked how feasible this would be, but something
> to take into consideration when we rework the current [admittedly
> suboptimal all-or-nothing] model we have. Is this something you had in
> mind as well for your wip proposal, Andy?
>
Hmm. Fine-grained seccomp stuff like this is very much in scope for
the seccomp discussion that's happening at LPC this year.
Unfortunately, I'm not there, but I'm participating via the mailing
list.
I also finally finished typing a very rough draft of my bpf ideas.
I'll email them out momentarily in a separate email. I think it
should come fairly close to doing what you want.
^ permalink raw reply
* Re: [PATCH net-next v2 1/3] net: ethernet: mediatek: Add basic PHYLINK support
From: René van Dorst @ 2019-08-22 15:11 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
Matthias Brugger, Frank Wunderlich, netdev, linux-mips,
linux-mediatek, Stefan Roese, linux-arm-kernel
In-Reply-To: <20190822142739.GS13294@shell.armlinux.org.uk>
Hi Russell,
Quoting Russell King - ARM Linux admin <linux@armlinux.org.uk>:
> On Wed, Aug 21, 2019 at 04:43:34PM +0200, René van Dorst wrote:
>> +static void mtk_mac_link_down(struct phylink_config *config,
>> unsigned int mode,
>> + phy_interface_t interface)
>> +{
>> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> + phylink_config);
>>
>> - return 0;
>> + mtk_w32(mac->hw, MAC_MCR_FORCE_LINK_DOWN, MTK_MAC_MCR(mac->id));
>> }
>
> You set the MAC_MCR_FORCE_MODE bit here...
>
>> +static void mtk_mac_link_up(struct phylink_config *config,
>> unsigned int mode,
>> + phy_interface_t interface,
>> + struct phy_device *phy)
>> {
>> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> + phylink_config);
>> + u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
>>
>> + mcr |= MAC_MCR_TX_EN | MAC_MCR_RX_EN;
>> + mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
>> +}
>
> Looking at this, a link_down() followed by a link_up() would result in
> this register containing MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN |
> MAC_MCR_RX_EN ? Is that actually correct? (MAC_MCR_FORCE_LINK isn't
> set, so it looks to me like it still forces the link down.)
Thanks for reviewing.
Probably not.
I assumed that mac_config() is always called before link_up()
I simply can make it the opposite of link_up()
like this:
static void mtk_mac_link_down(struct phylink_config *config, unsigned
int mode,
phy_interface_t interface)
{
struct mtk_mac *mac = container_of(config, struct mtk_mac,
phylink_config);
u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
mcr &= (MAC_MCR_TX_EN | MAC_MCR_RX_EN);
mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
}
>
> Note that link up/down forcing should not be done for in-band AN.
>
This means that mac_config() of the SGMII patch is also incorrect?
mac_config() always sets the MAC in a force mode.
But the SGMII block is set in AN.
Mainline code seems to do the same.
Puts the SGMII block in AN or forced mode and always set the MAC in
forced mode.
>> +static void mtk_validate(struct phylink_config *config,
>> + unsigned long *supported,
>> + struct phylink_link_state *state)
>> +{
>> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> + phylink_config);
>> + __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
>>
>> + if (state->interface != PHY_INTERFACE_MODE_NA &&
>> + state->interface != PHY_INTERFACE_MODE_MII &&
>> + state->interface != PHY_INTERFACE_MODE_GMII &&
>> + !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
>> + phy_interface_mode_is_rgmii(state->interface)) &&
>> + !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
>> + !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
>> + linkmode_zero(supported);
>> + return;
>> }
>>
>> + phylink_set_port_modes(mask);
>> + phylink_set(mask, Autoneg);
>>
>> + if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
>> + phylink_set(mask, 1000baseT_Full);
>> + } else {
>> + phylink_set(mask, 10baseT_Half);
>> + phylink_set(mask, 10baseT_Full);
>> + phylink_set(mask, 100baseT_Half);
>> + phylink_set(mask, 100baseT_Full);
>> +
>> + if (state->interface != PHY_INTERFACE_MODE_MII) {
>> + phylink_set(mask, 1000baseT_Half);
>> + phylink_set(mask, 1000baseT_Full);
>> + phylink_set(mask, 1000baseX_Full);
>> + }
>> + }
>>
>> + phylink_set(mask, Pause);
>> + phylink_set(mask, Asym_Pause);
>>
>> + linkmode_and(supported, supported, mask);
>> + linkmode_and(state->advertising, state->advertising, mask);
>> }
>
> This looks fine.
OK.
> Thanks.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up
Greats,
René
^ permalink raw reply
* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Vladimir Oltean @ 2019-08-22 14:58 UTC (permalink / raw)
To: Richard Cochran
Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822141641.GB1437@localhost>
On Thu, 22 Aug 2019 at 17:16, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 11:17:23PM +0300, Vladimir Oltean wrote:
> > Of course PPS with a dedicated hardware receiver that can take input
> > compare timestamps is always preferable. However non-Ethernet
> > synchronization in the field looks to me like "make do with whatever
> > you can". I'm not sure a plain GPIO that raises an interrupt is better
> > than an interrupt-driven serial protocol controller - it's (mostly)
> > the interrupts that throw off the precision of the software timestamp.
> > And use Miroslav's pps-gpio-poll module and you're back from where you
> > started (try to make a sw timestamp as precise as possible).
>
> Right, it might be better, might not. You can consider hacking a
> local time stamp into the ISR. Also, if one of your MACs has a input
> event pin, you can feed the switch's PPS output in there.
>
> > wouldn't be my first choice. But DSA could have that built-in, and
> > with the added latency benefit of a MAC-to-MAC connection.
> > Too bad the mv88e6xxx driver can't do loopback timestamping, that's
> > already 50% of the DSA drivers that support PTP at all. An embedded
> > solution for this is less compelling now.
>
> Let me back track on my statement about mv88e6xxx. At the time, I
> didn't see any practical way to use the CPU port for synchronization,
> but I forget exactly the details. Maybe it is indeed possible,
> somehow. If you can find a way that will work on your switch and on
> the Marvell, then I'd like to hear about it.
>
> Thinking back...
>
> One problem is this. PTP requires a delay measurement. You can send
> a delay request from the host, but there will never be a reply.
>
I don't think I understand the problem here.
You need to think about this as a sort of degenerate PTP where the
master and the slave are under the same device's management, not the
full stack. I never actually said I want to make ptp4l work over the
CPU port.
So instead of the typical:
Master (device A) Slave (device B)
| | Tstamps known
t1 |------\ Sync | to slave
| \-----\ |
| \-----\ |
| \----->| t2 t2
|------\ Follow_up |
| \-----\ |
| \-----\ |
| \----->| t1 t1, t2
| |
| Delay_req /------| t3 t1, t2, t3
| /-----/ |
| /-----/ |
t4 |<-----/ |
| |
|------\ Follow_up |
| \-----\ |
| \-----\ |
| \----->| t1, t2, t3, t4
| |
| |
| |
| |
v time v
You'd have something like this:
Master (DSA master port) Slave (switch CPU port)
| | Tstamps known
| | to slave
| Local_sync_req |
t1 |------\ | t1
| \-----\ |
| \-----\ |
| \----->| t2 t1, t2
| |
| Local_sync_resp /------| t3 t1, t2, t3
| /-----/ |
| /-----/ |
t4 |<-----/ | t1, t2, t3, t4
| |
| |
v time v
As far as I understand PTP, the other messages are just protocol
blah-blah because the slave doesn't know what the master knows, which
is clearly not applicable here.
t1, t2, t3, t4 still keep the same definitions though (master TX
timestamp, slave RX timestamp, slave TX timestamp, master RX
timestamp).
I'm 90% certain the sja1105 can take an RX timestamp for a management
frame (e.g. one for which a TX timestamp was requested) sent in
loopback.
> Another problem is this. A Sync message arriving on an external port
> is time stamped there, but then it is encapsulated as a tagged DSA
> management message and delivered out the CPU port. At this point, it
> is no longer a PTP frame and will not be time stamped at the CPU port
> on egress.
>
But you don't mean a TX timestamp at the egress of swp4 here, do you?
+---------------------------------+
| Management CPU |
| |
| DSA master |
| +-----+ |
| | | |
| | | |
+-------------+-----+-------------+
eth0 ^ RX tstamp
|
| TX tstamp
(swp4) CPU port
+-------------+-----+-------------+
| Switch | | |
| | | |
| +-----+ |
| ^ T |
| | |
| +-----------+ |
| | |
| | |
| +-----+ +-----+ +-----+ +-----+ |
| | | | | | | | | |
| | | | | | | | | |
+-+-----+-+-----+-+-----+-+-----+-+
swp0 swp1 swp2 swp3
^
| RX tstamp
Why would that matter?
Or just the RX timestamp at eth0?
If you mean the latter, then yes, having HWTSTAMP_FILTER_ALL in the
rx_filter of the DSA master is a hard requirement.
> Thanks,
> Richard
^ permalink raw reply
* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Andrew Lunn @ 2019-08-22 14:56 UTC (permalink / raw)
To: Richard Cochran
Cc: Vladimir Oltean, Mark Brown, Hubert Feurstein, Miroslav Lichvar,
Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822141641.GB1437@localhost>
> Thinking back...
>
> One problem is this. PTP requires a delay measurement. You can send
> a delay request from the host, but there will never be a reply.
>
> Another problem is this. A Sync message arriving on an external port
> is time stamped there, but then it is encapsulated as a tagged DSA
> management message and delivered out the CPU port. At this point, it
> is no longer a PTP frame and will not be time stamped at the CPU port
> on egress.
I think so that both the host interface and the CPU port recognize the
frame and time stamp it, it needs to be untagged. Otherwise, as you
said, the hardware does not recognise it. I've never tried sending
untagged frames to the CPU port. I expect they are just dropped.
However, somebody might want to play with the TCAM. The TCAM can
redirect a packet out any port. I've no idea what the pipeline
ordering is, but it might be possible for the TCAM to redirect a frame
back to the host interface, before it gets dropped because it does not
have DSA tags? But is the TCAM before or after PTP in the pipeline?
Could you then get 4 timestamps for the same frame? Host egress,
switch ingress, switch egress, host ingress?
But how do you make this generic? Can other switches also loop a frame
back like this and do the same time stamping? How do you actually get
access to these time stamps split over two blocks of hardware?
So in theory, this might be possible, but in practice?
Andrew
^ permalink raw reply
* Re: WARNING in rollback_registered_many (2)
From: Andrey Konovalov @ 2019-08-22 14:54 UTC (permalink / raw)
To: syzbot, USB list
Cc: Larry Finger, avagin, David S. Miller, devel, Eric W . Biederman,
Eric Dumazet, florian.c.schilhabel, Greg Kroah-Hartman,
Kai Heng Feng, ktkhai, LKML, netdev, straube.linux,
syzkaller-bugs, tyhicks, Matthew Wilcox, Oliver Neukum,
Alan Stern
In-Reply-To: <CAAeHK+y-2DZ1sWUE5bESrd=dUAaGrHXzR5+gFJFgiAaWo+D2dw@mail.gmail.com>
On Thu, Aug 22, 2019 at 3:07 PM Andrey Konovalov <andreyknvl@google.com> wrote:
>
> On Wed, Aug 7, 2019 at 4:03 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> >
> > On Fri, Apr 12, 2019 at 1:32 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> > >
> > > On Fri, Apr 12, 2019 at 1:29 AM syzbot
> > > <syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com> wrote:
> > > >
> > > > syzbot has found a reproducer for the following crash on:
> > > >
> > > > HEAD commit: 9a33b369 usb-fuzzer: main usb gadget fuzzer driver
> > > > git tree: https://github.com/google/kasan/tree/usb-fuzzer
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=10d552b7200000
> > > > kernel config: https://syzkaller.appspot.com/x/.config?x=23e37f59d94ddd15
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=40918e4d826fb2ff9b96
> > > > compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=17a4c1af200000
> > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=121b274b200000
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com
> > > >
> > > > usb 1-1: r8712u: MAC Address from efuse = 00:e0:4c:87:00:00
> > > > usb 1-1: r8712u: Loading firmware from "rtlwifi/rtl8712u.bin"
> > > > usb 1-1: USB disconnect, device number 2
> > > > usb 1-1: Direct firmware load for rtlwifi/rtl8712u.bin failed with error -2
> > > > usb 1-1: r8712u: Firmware request failed
> > > > WARNING: CPU: 0 PID: 575 at net/core/dev.c:8152
> > > > rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > > Kernel panic - not syncing: panic_on_warn set ...
> > > > CPU: 0 PID: 575 Comm: kworker/0:4 Not tainted 5.1.0-rc4-319354-g9a33b36 #3
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > Workqueue: usb_hub_wq hub_event
> > > > Call Trace:
> > > > __dump_stack lib/dump_stack.c:77 [inline]
> > > > dump_stack+0xe8/0x16e lib/dump_stack.c:113
> > > > panic+0x29d/0x5f2 kernel/panic.c:214
> > > > __warn.cold+0x20/0x48 kernel/panic.c:571
> > > > report_bug+0x262/0x2a0 lib/bug.c:186
> > > > fixup_bug arch/x86/kernel/traps.c:179 [inline]
> > > > fixup_bug arch/x86/kernel/traps.c:174 [inline]
> > > > do_error_trap+0x130/0x1f0 arch/x86/kernel/traps.c:272
> > > > do_invalid_op+0x37/0x40 arch/x86/kernel/traps.c:291
> > > > invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
> > > > RIP: 0010:rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > > Code: 05 00 00 31 ff 44 89 fe e8 5a 15 f3 f4 45 84 ff 0f 85 49 ff ff ff e8
> > > > 1c 14 f3 f4 0f 1f 44 00 00 e8 12 14 f3 f4 e8 0d 14 f3 f4 <0f> 0b 4c 89 e7
> > > > e8 33 72 f2 f6 31 ff 41 89 c4 89 c6 e8 27 15 f3 f4
> > > > RSP: 0018:ffff88809d087698 EFLAGS: 00010293
> > > > RAX: ffff88809d058000 RBX: ffff888096240000 RCX: ffffffff8c7eb146
> > > > RDX: 0000000000000000 RSI: ffffffff8c7eb163 RDI: 0000000000000001
> > > > RBP: ffff88809d0877c8 R08: ffff88809d058000 R09: fffffbfff2708111
> > > > R10: fffffbfff2708110 R11: ffffffff93840887 R12: ffff888096240070
> > > > R13: dffffc0000000000 R14: ffff88809d087758 R15: 0000000000000000
> > > > rollback_registered+0xf7/0x1c0 net/core/dev.c:8228
> > > > unregister_netdevice_queue net/core/dev.c:9275 [inline]
> > > > unregister_netdevice_queue+0x1dc/0x2b0 net/core/dev.c:9268
> > > > unregister_netdevice include/linux/netdevice.h:2655 [inline]
> > > > unregister_netdev+0x1d/0x30 net/core/dev.c:9316
> > > > r871xu_dev_remove+0xe7/0x223 drivers/staging/rtl8712/usb_intf.c:604
> > > > usb_unbind_interface+0x1c9/0x980 drivers/usb/core/driver.c:423
> > > > __device_release_driver drivers/base/dd.c:1082 [inline]
> > > > device_release_driver_internal+0x436/0x4f0 drivers/base/dd.c:1113
> > > > bus_remove_device+0x302/0x5c0 drivers/base/bus.c:556
> > > > device_del+0x467/0xb90 drivers/base/core.c:2269
> > > > usb_disable_device+0x242/0x790 drivers/usb/core/message.c:1235
> > > > usb_disconnect+0x298/0x870 drivers/usb/core/hub.c:2197
> > > > hub_port_connect drivers/usb/core/hub.c:4940 [inline]
> > > > hub_port_connect_change drivers/usb/core/hub.c:5204 [inline]
> > > > port_event drivers/usb/core/hub.c:5350 [inline]
> > > > hub_event+0xcd2/0x3b00 drivers/usb/core/hub.c:5432
> > > > process_one_work+0x90f/0x1580 kernel/workqueue.c:2269
> > > > process_scheduled_works kernel/workqueue.c:2331 [inline]
> > > > worker_thread+0x7b0/0xe20 kernel/workqueue.c:2417
> > > > kthread+0x313/0x420 kernel/kthread.c:253
> > > > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> > > > Kernel Offset: disabled
> > > > Rebooting in 86400 seconds..
> > > >
> > >
> > > +linux-usb mailing list
> >
> > This USB bug is the most frequently triggered one right now with over
> > 27k kernel crashes.
>
> OK, this report is confusing. It was initially reported on the
> upstream instance a long time ago, but since then has stopped
> happening, as it was probably fixed. Then when we launched the USB
> fuzzing instance, it has started producing similarly named reports
> (with a different root cause though), and they were bucketed into this
> bug by syzkaller. I've improved parsing titles of such reports in
> syzkaller, so I'm invalidating this one, and syzbot should send a
> properly attributed USB report soon.
>
> #syz invalid
This has been reported as:
WARNING in r871xu_dev_remove
https://syzkaller.appspot.com/bug?extid=80899a8a8efe8968cde7
^ permalink raw reply
* general protection fault in sctp_inq_pop
From: syzbot @ 2019-08-22 14:53 UTC (permalink / raw)
To: davem, linux-kernel, linux-sctp, marcelo.leitner, netdev, nhorman,
syzkaller-bugs, vyasevich
Hello,
syzbot found the following crash on:
HEAD commit: 20e79a0a net: hns: add phy_attached_info() to the hns driver
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=10d6dfba600000
kernel config: https://syzkaller.appspot.com/x/.config?x=ce5e88233f2f83b0
dashboard link: https://syzkaller.appspot.com/bug?extid=4a0643a653ac375612d1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 15509 Comm: syz-executor.1 Not tainted 5.3.0-rc3+ #139
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:sctp_inq_pop+0x294/0xd80 net/sctp/inqueue.c:201
Code: 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 e3 08 00 00 49 8d 7d 02 4d 89 6c
24 60 48 b8 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 <0f> b6 0c 01 48
89 f8 83 e0 07 83 c0 01 38 c8 7c 08 84 c9 0f 85 4b
RSP: 0018:ffff888097d9ee40 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880968405d8 RCX: 0000000000000001
RDX: 0000000000005803 RSI: ffffffff86b236aa RDI: 000000000000000a
RBP: ffff888097d9ee90 R08: ffff88808fd44240 R09: fffffbfff14a914f
R10: fffffbfff14a914e R11: ffffffff8a548a77 R12: ffff888096840580
R13: 0000000000000008 R14: 0000000000000000 R15: ffff888097d9f478
FS: 00007f59f27b8700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f40b77fd000 CR3: 0000000063e8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385
sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
NF_HOOK include/linux/netfilter.h:305 [inline]
NF_HOOK include/linux/netfilter.h:299 [inline]
ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
dst_input include/net/dst.h:442 [inline]
ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
__netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
__netif_receive_skb_list_core+0x1a2/0x9d0 net/core/dev.c:5087
__netif_receive_skb_list net/core/dev.c:5149 [inline]
netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
gro_normal_list net/core/dev.c:5755 [inline]
gro_normal_one net/core/dev.c:5769 [inline]
napi_frags_finish net/core/dev.c:5782 [inline]
napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020
call_write_iter include/linux/fs.h:1870 [inline]
do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
do_iter_write fs/read_write.c:970 [inline]
do_iter_write+0x184/0x610 fs/read_write.c:951
vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
do_writev+0x15b/0x330 fs/read_write.c:1058
__do_sys_writev fs/read_write.c:1131 [inline]
__se_sys_writev fs/read_write.c:1128 [inline]
__x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4596e1
Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b9 fb ff c3 48
83 ec 08 e8 fa 2c 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48
89 c2 e8 43 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007f59f27b7ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000010c RCX: 00000000004596e1
RDX: 0000000000000001 RSI: 00007f59f27b7c00 RDI: 00000000000000f0
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f59f27b86d4
R13: 00000000004c8783 R14: 00000000004df5a0 R15: 00000000ffffffff
Modules linked in:
---[ end trace 4d09ea96a0c7705b ]---
RIP: 0010:sctp_inq_pop+0x294/0xd80 net/sctp/inqueue.c:201
Code: 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 e3 08 00 00 49 8d 7d 02 4d 89 6c
24 60 48 b8 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 <0f> b6 0c 01 48
89 f8 83 e0 07 83 c0 01 38 c8 7c 08 84 c9 0f 85 4b
RSP: 0018:ffff888097d9ee40 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880968405d8 RCX: 0000000000000001
RDX: 0000000000005803 RSI: ffffffff86b236aa RDI: 000000000000000a
RBP: ffff888097d9ee90 R08: ffff88808fd44240 R09: fffffbfff14a914f
R10: fffffbfff14a914e R11: ffffffff8a548a77 R12: ffff888096840580
R13: 0000000000000008 R14: 0000000000000000 R15: ffff888097d9f478
FS: 00007f59f27b8700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f40b77fd000 CR3: 0000000063e8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
^ permalink raw reply
* [PATCH net-next] net: hns3: Fix -Wunused-const-variable warning
From: YueHaibing @ 2019-08-22 14:49 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, lipeng321, tanhuazhong,
shenjian15, linyunsheng, liuzhongzhu, huangguangbin2, liweihang,
yuehaibing
Cc: netdev, linux-kernel
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h:542:30:
warning: meta_data_key_info defined but not used [-Wunused-const-variable=]
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h:553:30:
warning: tuple_key_info defined but not used [-Wunused-const-variable=]
The two variable is only used in hclge_main.c,
so just move the definition over there.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
.../ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 44 ++++++++++++++++++++++
.../ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 44 ----------------------
2 files changed, 44 insertions(+), 44 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 9d64c43..dde17be 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -364,6 +364,50 @@ static const enum hclge_opcode_type hclge_dfx_reg_opcode_list[] = {
HCLGE_OPC_DFX_SSU_REG_2
};
+static const struct key_info meta_data_key_info[] = {
+ { PACKET_TYPE_ID, 6},
+ { IP_FRAGEMENT, 1},
+ { ROCE_TYPE, 1},
+ { NEXT_KEY, 5},
+ { VLAN_NUMBER, 2},
+ { SRC_VPORT, 12},
+ { DST_VPORT, 12},
+ { TUNNEL_PACKET, 1},
+};
+
+static const struct key_info tuple_key_info[] = {
+ { OUTER_DST_MAC, 48},
+ { OUTER_SRC_MAC, 48},
+ { OUTER_VLAN_TAG_FST, 16},
+ { OUTER_VLAN_TAG_SEC, 16},
+ { OUTER_ETH_TYPE, 16},
+ { OUTER_L2_RSV, 16},
+ { OUTER_IP_TOS, 8},
+ { OUTER_IP_PROTO, 8},
+ { OUTER_SRC_IP, 32},
+ { OUTER_DST_IP, 32},
+ { OUTER_L3_RSV, 16},
+ { OUTER_SRC_PORT, 16},
+ { OUTER_DST_PORT, 16},
+ { OUTER_L4_RSV, 32},
+ { OUTER_TUN_VNI, 24},
+ { OUTER_TUN_FLOW_ID, 8},
+ { INNER_DST_MAC, 48},
+ { INNER_SRC_MAC, 48},
+ { INNER_VLAN_TAG_FST, 16},
+ { INNER_VLAN_TAG_SEC, 16},
+ { INNER_ETH_TYPE, 16},
+ { INNER_L2_RSV, 16},
+ { INNER_IP_TOS, 8},
+ { INNER_IP_PROTO, 8},
+ { INNER_SRC_IP, 32},
+ { INNER_DST_IP, 32},
+ { INNER_L3_RSV, 16},
+ { INNER_SRC_PORT, 16},
+ { INNER_DST_PORT, 16},
+ { INNER_L4_RSV, 32},
+};
+
static int hclge_mac_update_stats_defective(struct hclge_dev *hdev)
{
#define HCLGE_MAC_CMD_NUM 21
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 7c28933..7ff03b9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -539,50 +539,6 @@ struct key_info {
u8 key_length; /* use bit as unit */
};
-static const struct key_info meta_data_key_info[] = {
- { PACKET_TYPE_ID, 6},
- { IP_FRAGEMENT, 1},
- { ROCE_TYPE, 1},
- { NEXT_KEY, 5},
- { VLAN_NUMBER, 2},
- { SRC_VPORT, 12},
- { DST_VPORT, 12},
- { TUNNEL_PACKET, 1},
-};
-
-static const struct key_info tuple_key_info[] = {
- { OUTER_DST_MAC, 48},
- { OUTER_SRC_MAC, 48},
- { OUTER_VLAN_TAG_FST, 16},
- { OUTER_VLAN_TAG_SEC, 16},
- { OUTER_ETH_TYPE, 16},
- { OUTER_L2_RSV, 16},
- { OUTER_IP_TOS, 8},
- { OUTER_IP_PROTO, 8},
- { OUTER_SRC_IP, 32},
- { OUTER_DST_IP, 32},
- { OUTER_L3_RSV, 16},
- { OUTER_SRC_PORT, 16},
- { OUTER_DST_PORT, 16},
- { OUTER_L4_RSV, 32},
- { OUTER_TUN_VNI, 24},
- { OUTER_TUN_FLOW_ID, 8},
- { INNER_DST_MAC, 48},
- { INNER_SRC_MAC, 48},
- { INNER_VLAN_TAG_FST, 16},
- { INNER_VLAN_TAG_SEC, 16},
- { INNER_ETH_TYPE, 16},
- { INNER_L2_RSV, 16},
- { INNER_IP_TOS, 8},
- { INNER_IP_PROTO, 8},
- { INNER_SRC_IP, 32},
- { INNER_DST_IP, 32},
- { INNER_L3_RSV, 16},
- { INNER_SRC_PORT, 16},
- { INNER_DST_PORT, 16},
- { INNER_L4_RSV, 32},
-};
-
#define MAX_KEY_LENGTH 400
#define MAX_KEY_DWORDS DIV_ROUND_UP(MAX_KEY_LENGTH / 8, 4)
#define MAX_KEY_BYTES (MAX_KEY_DWORDS * 4)
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next v2 2/3] net: ethernet: mediatek: Re-add support SGMII
From: Russell King - ARM Linux admin @ 2019-08-22 14:44 UTC (permalink / raw)
To: René van Dorst
Cc: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
Matthias Brugger, Frank Wunderlich, netdev, linux-mips,
linux-mediatek, Stefan Roese, linux-arm-kernel
In-Reply-To: <20190821144336.9259-3-opensource@vdorst.com>
On Wed, Aug 21, 2019 at 04:43:35PM +0200, René van Dorst wrote:
> + if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_SGMII)) {
> + if (state->interface != PHY_INTERFACE_MODE_2500BASEX) {
> phylink_set(mask, 1000baseT_Full);
> phylink_set(mask, 1000baseX_Full);
> + } else {
> + phylink_set(mask, 2500baseT_Full);
> + phylink_set(mask, 2500baseX_Full);
> + }
If you can dynamically switch between 1000BASE-X and 2500BASE-X, then
you need to have both set. See mvneta.c:
if (pp->comphy || state->interface != PHY_INTERFACE_MODE_2500BASEX) {
phylink_set(mask, 1000baseT_Full);
phylink_set(mask, 1000baseX_Full);
}
if (pp->comphy || state->interface == PHY_INTERFACE_MODE_2500BASEX) {
phylink_set(mask, 2500baseT_Full);
phylink_set(mask, 2500baseX_Full);
}
What this is saying is, if we have a comphy (which is the serdes lane
facing component, where the data rate is setup) then we can support
both speeds (and so mask ends up with all four bits set.) Otherwise,
we only support a single-speed (1000Gbps for non-2500BASE-X etc.)
> + } else {
> + if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
> + phylink_set(mask, 1000baseT_Full);
> + } else {
> + phylink_set(mask, 10baseT_Half);
> + phylink_set(mask, 10baseT_Full);
> + phylink_set(mask, 100baseT_Half);
> + phylink_set(mask, 100baseT_Full);
> +
> + if (state->interface != PHY_INTERFACE_MODE_MII) {
> + phylink_set(mask, 1000baseT_Half);
> + phylink_set(mask, 1000baseT_Full);
> + phylink_set(mask, 1000baseX_Full);
> + }
I'm also wondering about the "MTK_HAS_CAPS(mac->hw->soc->caps,
MTK_SGMII)" above.
(Here comes a reason why using SGMII to cover all single-lane serdes
modes causes confusion - unfortunately, some folk use SGMII to describe
all these modes. So, I'm going to use the terminology "Cisco SGMII"
to mean exactly the SGMII format published by Cisco, "802.3 1000BASE-X"
to mean the original IEEE 802.3 format running at 1.25Gbps, and
"up-clocked 2500BASE-X" to mean the 3.125Gbps version of the 802.3
1000BASE-X protocol.)
Isn't this set for Cisco SGMII as well as for 802.3 1000BASE-X and
the up-clocked 2500BASE-X modes?
If so, is there a reason why 10Mbps and 100Mbps speeds aren't
supported on Cisco SGMII links?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply
* RE: [PATCH] bonding: force enable lacp port after link state recovery for 802.3ad
From: zhangsha (A) @ 2019-08-22 14:33 UTC (permalink / raw)
To: Jay Vosburgh
Cc: vfalico@gmail.com, andy@greyhouse.net, davem@davemloft.net,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, yuehaibing,
hunongda, Chenzhendong (alex)
In-Reply-To: <27042.1566342874@famine>
> -----Original Message-----
> From: Jay Vosburgh [mailto:jay.vosburgh@canonical.com]
> Sent: 2019年8月21日 7:15
> To: zhangsha (A) <zhangsha.zhang@huawei.com>
> Cc: vfalico@gmail.com; andy@greyhouse.net; davem@davemloft.net;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; yuehaibing
> <yuehaibing@huawei.com>; hunongda <hunongda@huawei.com>;
> Chenzhendong (alex) <alex.chen@huawei.com>
> Subject: Re: [PATCH] bonding: force enable lacp port after link state recovery
> for 802.3ad
>
> <zhangsha.zhang@huawei.com> wrote:
>
> >From: Sha Zhang <zhangsha.zhang@huawei.com>
> >
> >After the commit 334031219a84 ("bonding/802.3ad: fix slave link
> >initialization transition states") merged, the slave's link status will
> >be changed to BOND_LINK_FAIL from BOND_LINK_DOWN in the following
> >scenario:
> >- Driver reports loss of carrier and
> > bonding driver receives NETDEV_CHANGE notifier
> >- slave's duplex and speed is zerod and
> > its port->is_enabled is cleard to 'false';
> >- Driver reports link recovery and
> > bonding driver receives NETDEV_UP notifier;
> >- If speed/duplex getting failed here, the link status
> > will be changed to BOND_LINK_FAIL;
> >- The MII monotor later recover the slave's speed/duplex
> > and set link status to BOND_LINK_UP, but remains
> > the 'port->is_enabled' to 'false'.
> >
> >In this scenario, the lacp port will not be enabled even its speed and
> >duplex are valid. The bond will not send LACPDU's, and its state is
> >'AD_STATE_DEFAULTED' forever. The simplest fix I think is to force
> >enable lacp after port slave speed check in bond_miimon_commit. As
> >enabled, the lacp port can run its state machine normally after link
> >recovery.
> >
> >Signed-off-by: Sha Zhang <zhangsha.zhang@huawei.com>
> >---
> > drivers/net/bonding/bond_main.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/net/bonding/bond_main.c
> >b/drivers/net/bonding/bond_main.c index 931d9d9..379253a 100644
> >--- a/drivers/net/bonding/bond_main.c
> >+++ b/drivers/net/bonding/bond_main.c
> >@@ -2194,6 +2194,7 @@ static void bond_miimon_commit(struct bonding
> >*bond) {
> > struct list_head *iter;
> > struct slave *slave, *primary;
> >+ struct port *port;
> >
> > bond_for_each_slave(bond, slave, iter) {
> > switch (slave->new_link) {
> >@@ -2205,8 +2206,13 @@ static void bond_miimon_commit(struct bonding
> *bond)
> > * link status
> > */
> > if (BOND_MODE(bond) == BOND_MODE_8023AD &&
> >- slave->link == BOND_LINK_UP)
> >+ slave->link == BOND_LINK_UP) {
> >
> bond_3ad_adapter_speed_duplex_changed(slave);
> >+ if (slave->duplex == DUPLEX_FULL) {
> >+ port = &(SLAVE_AD_INFO(slave)-
> >port);
> >+ port->is_enabled = true;
> >+ }
> >+ }
>
> I don't believe that testing duplex here is correct; is_enabled is not
> controlled by duplex, but by carrier state. Duplex does affect whether or not
> a port is permitted to aggregate, but that's entirely separate logic (the
> AD_PORT_LACP_ENABLED flag).
>
> Would it be better to call bond_3ad_handle_link_change() here,
> instead of manually testing duplex and setting is_enabled?
>
> -J
Hi, Jay,
Thanks for the reply and I think bond_3ad_handle_link_change is indeed a better way here.
I will send a new patch later after having it tested.
>
> > continue;
> >
> > case BOND_LINK_UP:
> >--
> >1.8.3.1
> >
>
> ---
> -Jay Vosburgh, jay.vosburgh@canonical.com
^ permalink raw reply
* Re: [PATCH net-next v2 1/3] net: ethernet: mediatek: Add basic PHYLINK support
From: Russell King - ARM Linux admin @ 2019-08-22 14:27 UTC (permalink / raw)
To: René van Dorst
Cc: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
Matthias Brugger, Frank Wunderlich, netdev, linux-mips,
linux-mediatek, Stefan Roese, linux-arm-kernel
In-Reply-To: <20190821144336.9259-2-opensource@vdorst.com>
On Wed, Aug 21, 2019 at 04:43:34PM +0200, René van Dorst wrote:
> +static void mtk_mac_link_down(struct phylink_config *config, unsigned int mode,
> + phy_interface_t interface)
> +{
> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
> + phylink_config);
>
> - return 0;
> + mtk_w32(mac->hw, MAC_MCR_FORCE_LINK_DOWN, MTK_MAC_MCR(mac->id));
> }
You set the MAC_MCR_FORCE_MODE bit here...
> +static void mtk_mac_link_up(struct phylink_config *config, unsigned int mode,
> + phy_interface_t interface,
> + struct phy_device *phy)
> {
> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
> + phylink_config);
> + u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
>
> + mcr |= MAC_MCR_TX_EN | MAC_MCR_RX_EN;
> + mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
> +}
Looking at this, a link_down() followed by a link_up() would result in
this register containing MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN |
MAC_MCR_RX_EN ? Is that actually correct? (MAC_MCR_FORCE_LINK isn't
set, so it looks to me like it still forces the link down.)
Note that link up/down forcing should not be done for in-band AN.
> +static void mtk_validate(struct phylink_config *config,
> + unsigned long *supported,
> + struct phylink_link_state *state)
> +{
> + struct mtk_mac *mac = container_of(config, struct mtk_mac,
> + phylink_config);
> + __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
>
> + if (state->interface != PHY_INTERFACE_MODE_NA &&
> + state->interface != PHY_INTERFACE_MODE_MII &&
> + state->interface != PHY_INTERFACE_MODE_GMII &&
> + !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
> + phy_interface_mode_is_rgmii(state->interface)) &&
> + !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
> + !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
> + linkmode_zero(supported);
> + return;
> }
>
> + phylink_set_port_modes(mask);
> + phylink_set(mask, Autoneg);
>
> + if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
> + phylink_set(mask, 1000baseT_Full);
> + } else {
> + phylink_set(mask, 10baseT_Half);
> + phylink_set(mask, 10baseT_Full);
> + phylink_set(mask, 100baseT_Half);
> + phylink_set(mask, 100baseT_Full);
> +
> + if (state->interface != PHY_INTERFACE_MODE_MII) {
> + phylink_set(mask, 1000baseT_Half);
> + phylink_set(mask, 1000baseT_Full);
> + phylink_set(mask, 1000baseX_Full);
> + }
> + }
>
> + phylink_set(mask, Pause);
> + phylink_set(mask, Asym_Pause);
>
> + linkmode_and(supported, supported, mask);
> + linkmode_and(state->advertising, state->advertising, mask);
> }
This looks fine.
Thanks.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
^ permalink raw reply
* Re: [PATCH v12 3/5] dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver
From: Dan Murphy @ 2019-08-22 14:20 UTC (permalink / raw)
To: Marc Kleine-Budde, wg, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <bdf06ead-a2e8-09a9-8cdd-49b54ec9da72@pengutronix.de>
Marc
On 8/16/19 4:38 AM, Marc Kleine-Budde wrote:
> On 5/9/19 6:11 PM, Dan Murphy wrote:
>> DT binding documentation for TI TCAN4x5x driver.
>>
>> Signed-off-by: Dan Murphy <dmurphy@ti.com>
>> ---
>>
>> v12 - No changes - https://lore.kernel.org/patchwork/patch/1052300/
>>
>> v11 - No changes - https://lore.kernel.org/patchwork/patch/1051178/
>> v10 - No changes - https://lore.kernel.org/patchwork/patch/1050488/
>> v9 - No Changes - https://lore.kernel.org/patchwork/patch/1050118/
>> v8 - No Changes - https://lore.kernel.org/patchwork/patch/1047981/
>> v7 - Made device state optional - https://lore.kernel.org/patchwork/patch/1047218/
>> v6 - No changes - https://lore.kernel.org/patchwork/patch/1042445/
>>
>> .../devicetree/bindings/net/can/tcan4x5x.txt | 37 +++++++++++++++++++
>> 1 file changed, 37 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/net/can/tcan4x5x.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/can/tcan4x5x.txt b/Documentation/devicetree/bindings/net/can/tcan4x5x.txt
>> new file mode 100644
>> index 000000000000..c388f7d9feb1
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/can/tcan4x5x.txt
>> @@ -0,0 +1,37 @@
>> +Texas Instruments TCAN4x5x CAN Controller
>> +================================================
>> +
>> +This file provides device node information for the TCAN4x5x interface contains.
>> +
>> +Required properties:
>> + - compatible: "ti,tcan4x5x"
>> + - reg: 0
>> + - #address-cells: 1
>> + - #size-cells: 0
>> + - spi-max-frequency: Maximum frequency of the SPI bus the chip can
>> + operate at should be less than or equal to 18 MHz.
>> + - data-ready-gpios: Interrupt GPIO for data and error reporting.
>> + - device-wake-gpios: Wake up GPIO to wake up the TCAN device.
>> +
>> +See Documentation/devicetree/bindings/net/can/m_can.txt for additional
>> +required property details.
>> +
>> +Optional properties:
>> + - reset-gpios: Hardwired output GPIO. If not defined then software
>> + reset.
>> + - device-state-gpios: Input GPIO that indicates if the device is in
>> + a sleep state or if the device is active.
>> +
>> +Example:
>> +tcan4x5x: tcan4x5x@0 {
>> + compatible = "ti,tcan4x5x";
>> + reg = <0>;
>> + #address-cells = <1>;
>> + #size-cells = <1>;
>> + spi-max-frequency = <10000000>;
>> + bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
>> + data-ready-gpios = <&gpio1 14 GPIO_ACTIVE_LOW>;
> Can you convert this into a proper interrupt property? E.g.:
OK. Do you want v13 or do you want patches on top for net-next?
Dan
>
>> interrupt-parent = <&gpio4>;
>> interrupts = <13 0x2>;
> See:
> https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/net/can/microchip,mcp251x.txt#L21
> https://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git/tree/drivers/net/can/spi/mcp251x.c?h=mcp251x#n945
This second link says it invalid
Dan
>
>> + device-state-gpios = <&gpio3 21 GPIO_ACTIVE_HIGH>;
>> + device-wake-gpios = <&gpio1 15 GPIO_ACTIVE_HIGH>;
>> + reset-gpios = <&gpio1 27 GPIO_ACTIVE_LOW>;
>> +};
> Marc
>
^ permalink raw reply
* [PATCHv4 net 2/2] xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
From: Hangbin Liu @ 2019-08-22 14:19 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Julian Anastasov, Hangbin Liu
In-Reply-To: <20190822141949.29561-1-liuhangbin@gmail.com>
In decode_session{4,6} there is a possibility that the skb dst dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
- icmpv6_route_lookup
- xfrm_decode_session_reverse
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the skb_dst(skb)->dev is actually not needed as we
called decode_session{4,6} via xfrm_decode_session_reverse(), so oif is not
used by: fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
So make a dst dev check here should be clean and safe.
v4: No changes.
v3: No changes.
v2: fix the issue in decode_session{4,6} instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
net/xfrm/xfrm_policy.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 8ca637a72697..ec94f5795ea4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3269,7 +3269,7 @@ decode_session4(struct sk_buff *skb, struct flowi *fl, bool reverse)
struct flowi4 *fl4 = &fl->u.ip4;
int oif = 0;
- if (skb_dst(skb))
+ if (skb_dst(skb) && skb_dst(skb)->dev)
oif = skb_dst(skb)->dev->ifindex;
memset(fl4, 0, sizeof(struct flowi4));
@@ -3387,7 +3387,7 @@ decode_session6(struct sk_buff *skb, struct flowi *fl, bool reverse)
nexthdr = nh[nhoff];
- if (skb_dst(skb))
+ if (skb_dst(skb) && skb_dst(skb)->dev)
oif = skb_dst(skb)->dev->ifindex;
memset(fl6, 0, sizeof(struct flowi6));
--
2.19.2
^ permalink raw reply related
* [PATCHv4 0/2] fix dev null pointer dereference when send packets larger than mtu in collect_md mode
From: Hangbin Liu @ 2019-08-22 14:19 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Julian Anastasov, Hangbin Liu
In-Reply-To: <20190815060904.19426-1-liuhangbin@gmail.com>
When we send a packet larger than PMTU, we need to reply with
icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).
But with collect_md mode, kernel will crash while accessing the dst dev
as __metadata_dst_init() init dst->dev to NULL by default. Here is what
the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
...
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
We could not fix it in __metadata_dst_init() as there is no dev supplied.
Look in to the __icmp_send()/decode_session{4,6} code we could find the dst
dev is actually not needed. In __icmp_send(), we could get the net by skb->dev.
For decode_session{4,6}, as it was called by xfrm_decode_session_reverse()
in this scenario, the oif is not used by
fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
The reproducer is easy:
ovs-vsctl add-br br0
ip link set br0 up
ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$dst_addr
ip link set gre0 up
ip addr add ${local_gre6}/64 dev br0
ping6 $remote_gre6 -s 1500
The kernel will crash like
[40595.821651] BUG: kernel NULL pointer dereference, address: 0000000000000108
[40595.822411] #PF: supervisor read access in kernel mode
[40595.822949] #PF: error_code(0x0000) - not-present page
[40595.823492] PGD 0 P4D 0
[40595.823767] Oops: 0000 [#1] SMP PTI
[40595.824139] CPU: 0 PID: 2831 Comm: handler12 Not tainted 5.2.0 #57
[40595.824788] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8.1.0+2983+b2ae9c0a 04/01/2014
[40595.825680] RIP: 0010:__xfrm_decode_session+0x6b/0x930
[40595.826219] Code: b7 c0 00 00 00 b8 06 00 00 00 66 85 d2 0f b7 ca 48 0f 45 c1 44 0f b6 2c 06 48 8b 47 58 48 83 e0 fe 0f 84 f4 04 00 00 48 8b 00 <44> 8b 80 08 01 00 00 41 f6 c4 01 4c 89 e7
ba 58 00 00 00 0f 85 47
[40595.828155] RSP: 0018:ffffc90000a73438 EFLAGS: 00010286
[40595.828705] RAX: 0000000000000000 RBX: ffff8881329d7100 RCX: 0000000000000000
[40595.829450] RDX: 0000000000000000 RSI: ffff8881339e70ce RDI: ffff8881329d7100
[40595.830191] RBP: ffffc90000a73470 R08: 0000000000000000 R09: 000000000000000a
[40595.830936] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a73490
[40595.831682] R13: 000000000000002c R14: ffff888132ff1301 R15: ffff8881329d7100
[40595.832427] FS: 00007f5bfcfd6700(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
[40595.833266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40595.833883] CR2: 0000000000000108 CR3: 000000013a368000 CR4: 00000000000006f0
[40595.834633] Call Trace:
[40595.835392] ? rt6_multipath_hash+0x4c/0x390
[40595.835853] icmpv6_route_lookup+0xcb/0x1d0
[40595.836296] ? icmpv6_xrlim_allow+0x3e/0x140
[40595.836751] icmp6_send+0x537/0x840
[40595.837125] icmpv6_send+0x20/0x30
[40595.837494] tnl_update_pmtu.isra.27+0x19d/0x2a0 [ip_tunnel]
[40595.838088] ip_md_tunnel_xmit+0x1b6/0x510 [ip_tunnel]
[40595.838633] gre_tap_xmit+0x10c/0x160 [ip_gre]
[40595.839103] dev_hard_start_xmit+0x93/0x200
[40595.839551] sch_direct_xmit+0x101/0x2d0
[40595.839967] __dev_queue_xmit+0x69f/0x9c0
[40595.840399] do_execute_actions+0x1717/0x1910 [openvswitch]
[40595.840987] ? validate_set.isra.12+0x2f5/0x3d0 [openvswitch]
[40595.841596] ? reserve_sfa_size+0x31/0x130 [openvswitch]
[40595.842154] ? __ovs_nla_copy_actions+0x1b4/0xad0 [openvswitch]
[40595.842778] ? __kmalloc_reserve.isra.50+0x2e/0x80
[40595.843285] ? should_failslab+0xa/0x20
[40595.843696] ? __kmalloc+0x188/0x220
[40595.844078] ? __alloc_skb+0x97/0x270
[40595.844472] ovs_execute_actions+0x47/0x120 [openvswitch]
[40595.845041] ovs_packet_cmd_execute+0x27d/0x2b0 [openvswitch]
[40595.845648] genl_family_rcv_msg+0x3a8/0x430
[40595.846101] genl_rcv_msg+0x47/0x90
[40595.846476] ? __alloc_skb+0x83/0x270
[40595.846866] ? genl_family_rcv_msg+0x430/0x430
[40595.847335] netlink_rcv_skb+0xcb/0x100
[40595.847777] genl_rcv+0x24/0x40
[40595.848113] netlink_unicast+0x17f/0x230
[40595.848535] netlink_sendmsg+0x2ed/0x3e0
[40595.848951] sock_sendmsg+0x4f/0x60
[40595.849323] ___sys_sendmsg+0x2bd/0x2e0
[40595.849733] ? sock_poll+0x6f/0xb0
[40595.850098] ? ep_scan_ready_list.isra.14+0x20b/0x240
[40595.850634] ? _cond_resched+0x15/0x30
[40595.851032] ? ep_poll+0x11b/0x440
[40595.851401] ? _copy_to_user+0x22/0x30
[40595.851799] __sys_sendmsg+0x58/0xa0
[40595.852180] do_syscall_64+0x5b/0x190
[40595.852574] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[40595.853105] RIP: 0033:0x7f5c00038c7d
[40595.853489] Code: c7 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e f7 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d7 f7 ff ff 48 89
d0 48 83 c4 08 48 3d 01
[40595.855443] RSP: 002b:00007f5bfcf73c00 EFLAGS: 00003293 ORIG_RAX: 000000000000002e
[40595.856244] RAX: ffffffffffffffda RBX: 00007f5bfcf74a60 RCX: 00007f5c00038c7d
[40595.856990] RDX: 0000000000000000 RSI: 00007f5bfcf73c60 RDI: 0000000000000015
[40595.857736] RBP: 0000000000000004 R08: 0000000000000b7c R09: 0000000000000110
[40595.858613] R10: 0001000800050004 R11: 0000000000003293 R12: 000055c2d8329da0
[40595.859401] R13: 00007f5bfcf74120 R14: 0000000000000347 R15: 00007f5bfcf73c60
[40595.860185] Modules linked in: ip_gre ip_tunnel gre openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc bochs_drm ttm drm_kms_helper drm pcspkr joydev i2c_piix4 qemu_fw_cfg xfs libcrc32c virtio_net net_failover serio_raw failover ata_generic virtio_blk pata_acpi floppy
[40595.863155] CR2: 0000000000000108
[40595.863551] ---[ end trace 22209bbcacb4addd ]---
v4: Julian Anastasov remind skb->dev also could be NULL in icmp_send. We'd
better still use dst.dev and do a check to avoid crash.
v3: only replace pkg to packets in cover letter. So I didn't update the version
info in the follow up patches.
v2: fix it in __icmp_send() and decode_session{4,6} separately instead of
updating shared dst dev in {ip_md, ip6}_tunnel_xmit.
Hangbin Liu (2):
ipv4/icmp: fix rt dst dev null pointer dereference
xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md
mode
net/ipv4/icmp.c | 8 +++++++-
net/xfrm/xfrm_policy.c | 4 ++--
2 files changed, 9 insertions(+), 3 deletions(-)
--
2.19.2
^ permalink raw reply
* [PATCHv4 net 1/2] ipv4/icmp: fix rt dst dev null pointer dereference
From: Hangbin Liu @ 2019-08-22 14:19 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Julian Anastasov, Hangbin Liu
In-Reply-To: <20190822141949.29561-1-liuhangbin@gmail.com>
In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just try get it from skb->dev when rt->dst.dev is NULL.
v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
still use dst.dev and do a check to avoid crash.
v3: No changes.
v2: fix the issue in __icmp_send() instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
net/ipv4/icmp.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1510e951f451..001f03f76bc4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -582,7 +582,13 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
if (!rt)
goto out;
- net = dev_net(rt->dst.dev);
+
+ if (rt->dst.dev)
+ net = dev_net(rt->dst.dev);
+ else if (skb_in->dev)
+ net = dev_net(skb_in->dev);
+ else
+ goto out;
/*
* Find the original header. It is expected to be valid, of course.
--
2.19.2
^ permalink raw reply related
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Daniel Borkmann @ 2019-08-22 14:17 UTC (permalink / raw)
To: Andy Lutomirski, Alexei Starovoitov
Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
LSM List, Chenbo Feng
In-Reply-To: <CALCETrXEHL3+NAY6P6vUj7Pvd9ZpZsYC6VCLXOaNxb90a_POGw@mail.gmail.com>
On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
>>> It tries to make the kernel respect the access modes for fds. Without
>>> this patch, there seem to be some holes: nothing looked at program fds
>>> and, unless I missed something, you could take a readonly fd for a
>>> program, pin the program, and reopen it RW.
>>
>> I think it's by design. iirc Daniel had a use case for something like this.
>
> That seems odd. Daniel, can you elaborate?
[ ... catching up late. ]
Not from my side, the change was added by Chenbo back then for Android
use-case to replace xt_qtaguid and xt_owner with BPF programs and to
allow unprivileged applications to read maps. More on their architecture:
https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor
From the cover-letter:
[...]
The network-control daemon (netd) creates and loads an eBPF object for
network packet filtering and analysis. It passes the object FD to an
unprivileged network monitor app (netmonitor), which is not allowed to
create, modify or load eBPF objects, but is allowed to read the traffic
stats from the map.
[...]
Iuuc, netd would be in charge with the ability to r/w into maps and
pin them, but with the ability to to hand off some map fds as r/o to
unprivileged applications in order for them to query data.
>> Hence unprivileged bpf is actually something that can be deprecated.
There is actually a publicly known use-case on unprivileged bpf wrt
socket filters, see the SO_ATTACH_BPF on sockets section as an example:
https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/
If I'd have to take a good guess, I'd think it's major use-case is in
SO_ATTACH_REUSEPORT_EBPF in the wild, I don't think the sysctl can be
outright flipped or deprecated w/o breaking existing applications unless
it's cleanly modeled into some sort of customizable CAP_BPF* type policy
(more below) where this would be the lowest common denominator.
> I hope not. There are a couple setsockopt uses right now, and and
> seccomp will surely want it someday. And the bpf-inside-container use
> case really is unprivileged bpf -- containers are, in many (most?)
> cases, explicitly not trusted by the host.
[...]
>> Inside containers and inside nested containers we need to start processes
>> that will use bpf. All of the processes are trusted.
>
> Trusted by whom? In a non-nested container, the container manager
> *might* be trusted by the outside world. In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted. I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.
[...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
ideally be possible for that container to install BPF programs for
mangling, dropping, forwarding etc as long as it's only affecting it's
/own/ netns like the rest of networking subsystem controls that work
in that case. I would actually like to get to this at some point and
make it more approachable as long as there is a way for an admin to
/opt into it/ via policy (aka not by default). Thinking out loud, I'd
love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
customizable seccomp policy. Meaning, there could be several CAP_BPF
type sub-policies e.g. from allowing everything (equivalent to the
/dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
programmable user defined policy that can be tailored to specific
needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
their own netns, and if that is untrusted, then same restrictions/
mitigations could be done by the verifier as with (current) unprivileged
BPF, enabled via programmable policy as well. We wouldn't make any
static/fixed assumptions, but allow users to define them based on their
own use-cases. Haven't looked how feasible this would be, but something
to take into consideration when we rework the current [admittedly
suboptimal all-or-nothing] model we have. Is this something you had in
mind as well for your wip proposal, Andy?
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-22 14:16 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hrtzU1XL-0m+BG5TYZvVh8WN6hgcM7CV5taHyq2MsR5dw@mail.gmail.com>
On Wed, Aug 21, 2019 at 11:17:23PM +0300, Vladimir Oltean wrote:
> Of course PPS with a dedicated hardware receiver that can take input
> compare timestamps is always preferable. However non-Ethernet
> synchronization in the field looks to me like "make do with whatever
> you can". I'm not sure a plain GPIO that raises an interrupt is better
> than an interrupt-driven serial protocol controller - it's (mostly)
> the interrupts that throw off the precision of the software timestamp.
> And use Miroslav's pps-gpio-poll module and you're back from where you
> started (try to make a sw timestamp as precise as possible).
Right, it might be better, might not. You can consider hacking a
local time stamp into the ISR. Also, if one of your MACs has a input
event pin, you can feed the switch's PPS output in there.
> wouldn't be my first choice. But DSA could have that built-in, and
> with the added latency benefit of a MAC-to-MAC connection.
> Too bad the mv88e6xxx driver can't do loopback timestamping, that's
> already 50% of the DSA drivers that support PTP at all. An embedded
> solution for this is less compelling now.
Let me back track on my statement about mv88e6xxx. At the time, I
didn't see any practical way to use the CPU port for synchronization,
but I forget exactly the details. Maybe it is indeed possible,
somehow. If you can find a way that will work on your switch and on
the Marvell, then I'd like to hear about it.
Thinking back...
One problem is this. PTP requires a delay measurement. You can send
a delay request from the host, but there will never be a reply.
Another problem is this. A Sync message arriving on an external port
is time stamped there, but then it is encapsulated as a tagged DSA
management message and delivered out the CPU port. At this point, it
is no longer a PTP frame and will not be time stamped at the CPU port
on egress.
Thanks,
Richard
^ permalink raw reply
* Re: [net] devlink: Add method for time-stamp on reporter's dump
From: Andrew Lunn @ 2019-08-22 14:06 UTC (permalink / raw)
To: Aya Levin; +Cc: David S. Miller, Jiri Pirko, netdev, linux-kernel
In-Reply-To: <1566461871-21992-1-git-send-email-ayal@mellanox.com>
On Thu, Aug 22, 2019 at 11:17:51AM +0300, Aya Levin wrote:
> When setting the dump's time-stamp, use ktime_get_real in addition to
> jiffies. This simplifies the user space implementation and bypasses
> some inconsistent behavior with translating jiffies to current time.
Hi Aya
Is this year 2038 safe? I don't know enough about this to answer the
question myself.
Thanks
Andrew
^ permalink raw reply
* Re: WARNING in rollback_registered_many (2)
From: Oliver Neukum @ 2019-08-22 14:06 UTC (permalink / raw)
To: Andrey Konovalov, syzbot, USB list
Cc: Kai Heng Feng, tyhicks, David S. Miller, devel, straube.linux,
Eric Dumazet, syzkaller-bugs, florian.c.schilhabel,
Matthew Wilcox, Greg Kroah-Hartman, Larry Finger, Alan Stern,
LKML, netdev, avagin, ktkhai, Eric W . Biederman
In-Reply-To: <CAAeHK+w+asSQ3axWymToQ+uzPfEAYS2QimVBL85GuJRBtxkjDA@mail.gmail.com>
Am Mittwoch, den 07.08.2019, 16:03 +0200 schrieb Andrey Konovalov:
I may offer a preliminary analysis.
Regards
Oliver
> On Fri, Apr 12, 2019 at 1:32 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> >
> > On Fri, Apr 12, 2019 at 1:29 AM syzbot
> > <syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com> wrote:
> > >
> > > syzbot has found a reproducer for the following crash on:
> > >
> > > HEAD commit: 9a33b369 usb-fuzzer: main usb gadget fuzzer driver
> > > git tree: https://github.com/google/kasan/tree/usb-fuzzer
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10d552b7200000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=23e37f59d94ddd15
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=40918e4d826fb2ff9b96
> > > compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=17a4c1af200000
> > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=121b274b200000
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com
> > >
> > > usb 1-1: r8712u: MAC Address from efuse = 00:e0:4c:87:00:00
> > > usb 1-1: r8712u: Loading firmware from "rtlwifi/rtl8712u.bin"
> > > usb 1-1: USB disconnect, device number 2
Disconnect will run which leads to
static void r871xu_dev_remove(struct usb_interface *pusb_intf)
{
struct net_device *pnetdev = usb_get_intfdata(pusb_intf);
struct usb_device *udev = interface_to_usbdev(pusb_intf);
if (pnetdev) {
^^^ This is supposed to save us
struct _adapter *padapter = netdev_priv(pnetdev);
usb_set_intfdata(pusb_intf, NULL);
release_firmware(padapter->fw);
/* never exit with a firmware callback pending */
wait_for_completion(&padapter->rtl8712_fw_ready);
if (drvpriv.drv_registered)
padapter->surprise_removed = true;
unregister_netdev(pnetdev); /* will call netdev_close() */
So we will call unregister_netdev()
> > > usb 1-1: Direct firmware load for rtlwifi/rtl8712u.bin failed with error -2
> > > usb 1-1: r8712u: Firmware request failed
So we ran into the error handling of:
static void rtl871x_load_fw_cb(const struct firmware *firmware, void *context)
{
struct _adapter *adapter = context;
complete(&adapter->rtl8712_fw_ready);
if (!firmware) {
struct usb_device *udev = adapter->dvobjpriv.pusbdev;
struct usb_interface *usb_intf = adapter->pusb_intf;
dev_err(&udev->dev, "r8712u: Firmware request failed\n");
usb_put_dev(udev);
usb_set_intfdata(usb_intf, NULL);
^^^ This is supposed to save us from deregistering an unregistered device
but it comes too late. We have already called complete.
return;
}
adapter->fw = firmware;
/* firmware available - start netdev */
register_netdev(adapter->pnetdev);
register_netdev() is not called.
> > > Kernel panic - not syncing: panic_on_warn set ...
> > > CPU: 0 PID: 575 Comm: kworker/0:4 Not tainted 5.1.0-rc4-319354-g9a33b36 #3
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Workqueue: usb_hub_wq hub_event
> > > Call Trace:
> > > __dump_stack lib/dump_stack.c:77 [inline]
> > > dump_stack+0xe8/0x16e lib/dump_stack.c:113
> > > panic+0x29d/0x5f2 kernel/panic.c:214
> > > __warn.cold+0x20/0x48 kernel/panic.c:571
> > > report_bug+0x262/0x2a0 lib/bug.c:186
> > > fixup_bug arch/x86/kernel/traps.c:179 [inline]
> > > fixup_bug arch/x86/kernel/traps.c:174 [inline]
> > > do_error_trap+0x130/0x1f0 arch/x86/kernel/traps.c:272
> > > do_invalid_op+0x37/0x40 arch/x86/kernel/traps.c:291
> > > invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
This kills us.
> > > RIP: 0010:rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > Code: 05 00 00 31 ff 44 89 fe e8 5a 15 f3 f4 45 84 ff 0f 85 49 ff ff ff e8
> > > 1c 14 f3 f4 0f 1f 44 00 00 e8 12 14 f3 f4 e8 0d 14 f3 f4 <0f> 0b 4c 89 e7
> > > e8 33 72 f2 f6 31 ff 41 89 c4 89 c6 e8 27 15 f3 f4
> > > RSP: 0018:ffff88809d087698 EFLAGS: 00010293
> > > RAX: ffff88809d058000 RBX: ffff888096240000 RCX: ffffffff8c7eb146
> > > RDX: 0000000000000000 RSI: ffffffff8c7eb163 RDI: 0000000000000001
> > > RBP: ffff88809d0877c8 R08: ffff88809d058000 R09: fffffbfff2708111
> > > R10: fffffbfff2708110 R11: ffffffff93840887 R12: ffff888096240070
> > > R13: dffffc0000000000 R14: ffff88809d087758 R15: 0000000000000000
> > > rollback_registered+0xf7/0x1c0 net/core/dev.c:8228
> > > unregister_netdevice_queue net/core/dev.c:9275 [inline]
> > > unregister_netdevice_queue+0x1dc/0x2b0 net/core/dev.c:9268
> > > unregister_netdevice include/linux/netdevice.h:2655 [inline]
> > > unregister_netdev+0x1d/0x30 net/core/dev.c:9316
> > > r871xu_dev_remove+0xe7/0x223 drivers/staging/rtl8712/usb_intf.c:604
We end up here:
static void rollback_registered_many(struct list_head *head)
{
struct net_device *dev, *tmp;
LIST_HEAD(close_head);
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
list_for_each_entry_safe(dev, tmp, head, unreg_list) {
/* Some devices call without registering
* for initialization unwind. Remove those
* devices and proceed with the remaining.
*/
if (dev->reg_state == NETREG_UNINITIALIZED) {
pr_debug("unregister_netdevice: device %s/%p never was registered\n",
dev->name, dev);
WARN_ON(1);
^ permalink raw reply
* Re: [PATCH bpf-next 1/4] xsk: avoid store-tearing when assigning queues
From: Björn Töpel @ 2019-08-22 13:51 UTC (permalink / raw)
To: Hillf Danton
Cc: ast@kernel.org, daniel@iogearbox.net, netdev@vger.kernel.org,
Björn Töpel, magnus.karlsson@intel.com,
magnus.karlsson@gmail.com, bpf@vger.kernel.org,
jonathan.lemon@gmail.com,
syzbot+c82697e3043781e08802@syzkaller.appspotmail.com,
i.maximets@samsung.com
In-Reply-To: <5d5e980f.1c69fb81.f8d9b.71f2SMTPIN_ADDED_MISSING@mx.google.com>
On Thu, 22 Aug 2019 at 15:26, Hillf Danton <hdanton@sina.com> wrote:
>
> >
>
> > /* Make sure queue is ready before it can be seen by others */
>
> > smp_wmb();
>
>
>
> Hehe, who put mb here and for what?
>
That was from an earlier commit, and it's a barrier paired with the
lock-less reading of queues in xsk_mmap. Uhm... not sure I answered
your question?
Björn
>
>
> >- *queue = q;
>
> >+ WRITE_ONCE(*queue, q);
>
> > return 0;
>
>
^ permalink raw reply
* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Daniel Borkmann @ 2019-08-22 13:45 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Stephen Hemminger,
Alexei Starovoitov
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <87zhk1nwvc.fsf@toke.dk>
On 8/22/19 3:38 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 8/22/19 2:04 PM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>> On 8/22/19 12:43 PM, Toke Høiland-Jørgensen wrote:
>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>> On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
>>>>>>> This adds a configure check for libbpf and renames functions to allow
>>>>>>> lib/bpf.c to be compiled with it present. This makes it possible to
>>>>>>> port functionality piecemeal to use libbpf.
>>>>>>>
>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>> ---
>>>>>>> configure | 16 ++++++++++++++++
>>>>>>> include/bpf_util.h | 6 +++---
>>>>>>> ip/ipvrf.c | 4 ++--
>>>>>>> lib/bpf.c | 33 +++++++++++++++++++--------------
>>>>>>> 4 files changed, 40 insertions(+), 19 deletions(-)
>>>>>>>
>>>>>>> diff --git a/configure b/configure
>>>>>>> index 45fcffb6..5a89ee9f 100755
>>>>>>> --- a/configure
>>>>>>> +++ b/configure
>>>>>>> @@ -238,6 +238,19 @@ check_elf()
>>>>>>> fi
>>>>>>> }
>>>>>>>
>>>>>>> +check_libbpf()
>>>>>>> +{
>>>>>>> + if ${PKG_CONFIG} libbpf --exists; then
>>>>>>> + echo "HAVE_LIBBPF:=y" >>$CONFIG
>>>>>>> + echo "yes"
>>>>>>> +
>>>>>>> + echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
>>>>>>> + echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
>>>>>>> + else
>>>>>>> + echo "no"
>>>>>>> + fi
>>>>>>> +}
>>>>>>> +
>>>>>>> check_selinux()
>>>>>>
>>>>>> More of an implementation detail at this point in time, but want to
>>>>>> make sure this doesn't get missed along the way: as discussed at
>>>>>> bpfconf [0] best for iproute2 to handle libbpf support would be the
>>>>>> same way of integration as pahole does, that is, to integrate it via
>>>>>> submodule [1] to allow kernel and libbpf features to be in sync with
>>>>>> iproute2 releases and therefore easily consume extensions we're adding
>>>>>> to libbpf to aide iproute2 integration.
>>>>>
>>>>> I can sorta see the point wrt keeping in sync with kernel features. But
>>>>> how will this work with distros that package libbpf as a regular
>>>>> library? Have you guys given up on regular library symbol versioning for
>>>>> libbpf?
>>>>
>>>> Not at all, and I hope you know that. ;-)
>>>
>>> Good! Didn't really expect you had, just checking ;)
>>>
>>>> The reason I added lib/bpf.c integration into iproute2 directly back
>>>> then was exactly such that users can start consuming BPF for tc and
>>>> XDP via iproute2 /everywhere/ with only a simple libelf dependency
>>>> which is also available on all distros since pretty much forever. If
>>>> it was an external library, we could have waited till hell freezes
>>>> over and initial distro adoption would have pretty much taken forever:
>>>> to pick one random example here wrt the pace of some downstream
>>>> distros [0]. The main rationale is pretty much the same as with added
>>>> kernel features that land complementary iproute2 patches for that
>>>> kernel release and as libbpf is developed alongside it is reasonable
>>>> to guarantee user expectations that iproute2 released for kernel
>>>> version x can make use of BPF features added to kernel x with same
>>>> loader support from x.
>>>
>>> Well, for iproute2 I would expect this to be solved by version
>>> dependencies. I.e. iproute2 version X would depend on libbpf version Y+
>>> (like, I dunno, the version of libbpf included in the same kernel source
>>> tree as the kernel version iproute2 is targeting? :)).
>>
>> This sounds nice in theory, but from what I've seen major (!) distros
>> already seem to have a hard time releasing kernel x along with iproute2
>> package x, concrete example was that distro kernel was on 4.13 and its
>> official iproute2 package on 4.9,
>
> If the iproute2 package is not being updated at all I don't really see
> how it would make any difference whether libbpf is vendored or not? :)
>
>> adding yet another variable that needs to be in sync with kernel is
>> simply impractical especially for a _core_ package like iproute2 that
>> should have as little dependencies as possible. I also don't want to
>> make a bet on whether libbpf will be available on every distro that
>> also ships iproute2. Hence approach that pahole (and also bcc by the
>> way) takes is most reasonable to have the best user experience.
>
> Most users are going to get iproute2 from their distro packages anyway,
> so if distros are incompetent at packaging, my bet is that you're going
> to run into issues one way or another.
>
> But OK, if you think it is easier to work around bad distros by
> vendoring, you guys are the maintainers, so that's up to you. But can we
> at least put in the version dependency and let the build system pick up
> a system libbpf if it exists and is compatible? That way distros that
> *are* competent can still link it dynamically...
Yeah that would be fine by me to use this as a fallback, and I think that
iproute2's configure script should be able to easily handle this situation.
That way we don't compromise on user experience.
Thanks,
Daniel
^ permalink raw reply
* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Toke Høiland-Jørgensen @ 2019-08-22 13:38 UTC (permalink / raw)
To: Daniel Borkmann, Stephen Hemminger, Alexei Starovoitov
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <4ca44e39-9b32-909f-df8d-f565eae57498@iogearbox.net>
Daniel Borkmann <daniel@iogearbox.net> writes:
> On 8/22/19 2:04 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 8/22/19 12:43 PM, Toke Høiland-Jørgensen wrote:
>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>> On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
>>>>>> This adds a configure check for libbpf and renames functions to allow
>>>>>> lib/bpf.c to be compiled with it present. This makes it possible to
>>>>>> port functionality piecemeal to use libbpf.
>>>>>>
>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>> ---
>>>>>> configure | 16 ++++++++++++++++
>>>>>> include/bpf_util.h | 6 +++---
>>>>>> ip/ipvrf.c | 4 ++--
>>>>>> lib/bpf.c | 33 +++++++++++++++++++--------------
>>>>>> 4 files changed, 40 insertions(+), 19 deletions(-)
>>>>>>
>>>>>> diff --git a/configure b/configure
>>>>>> index 45fcffb6..5a89ee9f 100755
>>>>>> --- a/configure
>>>>>> +++ b/configure
>>>>>> @@ -238,6 +238,19 @@ check_elf()
>>>>>> fi
>>>>>> }
>>>>>>
>>>>>> +check_libbpf()
>>>>>> +{
>>>>>> + if ${PKG_CONFIG} libbpf --exists; then
>>>>>> + echo "HAVE_LIBBPF:=y" >>$CONFIG
>>>>>> + echo "yes"
>>>>>> +
>>>>>> + echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
>>>>>> + echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
>>>>>> + else
>>>>>> + echo "no"
>>>>>> + fi
>>>>>> +}
>>>>>> +
>>>>>> check_selinux()
>>>>>
>>>>> More of an implementation detail at this point in time, but want to
>>>>> make sure this doesn't get missed along the way: as discussed at
>>>>> bpfconf [0] best for iproute2 to handle libbpf support would be the
>>>>> same way of integration as pahole does, that is, to integrate it via
>>>>> submodule [1] to allow kernel and libbpf features to be in sync with
>>>>> iproute2 releases and therefore easily consume extensions we're adding
>>>>> to libbpf to aide iproute2 integration.
>>>>
>>>> I can sorta see the point wrt keeping in sync with kernel features. But
>>>> how will this work with distros that package libbpf as a regular
>>>> library? Have you guys given up on regular library symbol versioning for
>>>> libbpf?
>>>
>>> Not at all, and I hope you know that. ;-)
>>
>> Good! Didn't really expect you had, just checking ;)
>>
>>> The reason I added lib/bpf.c integration into iproute2 directly back
>>> then was exactly such that users can start consuming BPF for tc and
>>> XDP via iproute2 /everywhere/ with only a simple libelf dependency
>>> which is also available on all distros since pretty much forever. If
>>> it was an external library, we could have waited till hell freezes
>>> over and initial distro adoption would have pretty much taken forever:
>>> to pick one random example here wrt the pace of some downstream
>>> distros [0]. The main rationale is pretty much the same as with added
>>> kernel features that land complementary iproute2 patches for that
>>> kernel release and as libbpf is developed alongside it is reasonable
>>> to guarantee user expectations that iproute2 released for kernel
>>> version x can make use of BPF features added to kernel x with same
>>> loader support from x.
>>
>> Well, for iproute2 I would expect this to be solved by version
>> dependencies. I.e. iproute2 version X would depend on libbpf version Y+
>> (like, I dunno, the version of libbpf included in the same kernel source
>> tree as the kernel version iproute2 is targeting? :)).
>
> This sounds nice in theory, but from what I've seen major (!) distros
> already seem to have a hard time releasing kernel x along with iproute2
> package x, concrete example was that distro kernel was on 4.13 and its
> official iproute2 package on 4.9,
If the iproute2 package is not being updated at all I don't really see
how it would make any difference whether libbpf is vendored or not? :)
> adding yet another variable that needs to be in sync with kernel is
> simply impractical especially for a _core_ package like iproute2 that
> should have as little dependencies as possible. I also don't want to
> make a bet on whether libbpf will be available on every distro that
> also ships iproute2. Hence approach that pahole (and also bcc by the
> way) takes is most reasonable to have the best user experience.
Most users are going to get iproute2 from their distro packages anyway,
so if distros are incompetent at packaging, my bet is that you're going
to run into issues one way or another.
But OK, if you think it is easier to work around bad distros by
vendoring, you guys are the maintainers, so that's up to you. But can we
at least put in the version dependency and let the build system pick up
a system libbpf if it exists and is compatible? That way distros that
*are* competent can still link it dynamically...
-Toke
^ permalink raw reply
* Re: [PATCH 4.14.y stable] xfrm: policy: remove pcpu policy cache
From: Florian Westphal @ 2019-08-22 13:37 UTC (permalink / raw)
To: Greg KH
Cc: Florian Westphal, stable, vakul.garg, netdev, Kristian Evensen,
Steffen Klassert
In-Reply-To: <20190822130941.GA15754@kroah.com>
Greg KH <greg@kroah.com> wrote:
> On Thu, Aug 22, 2019 at 01:21:09PM +0200, Florian Westphal wrote:
> > commit e4db5b61c572475bbbcf63e3c8a2606bfccf2c9d upstream.
> >
> > Kristian Evensen says:
> > In a project I am involved in, we are running ipsec (Strongswan) on
> > different mt7621-based routers. Each router is configured as an
> > initiator and has around ~30 tunnels to different responders (running
> > on misc. devices). Before the flow cache was removed (kernel 4.9), we
> > got a combined throughput of around 70Mbit/s for all tunnels on one
> > router. However, we recently switched to kernel 4.14 (4.14.48), and
> > the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
> > drop of around 20%. Reverting the flow cache removal restores, as
> > expected, performance levels to that of kernel 4.9.
> >
> > When pcpu xdst exists, it has to be validated first before it can be
> > used.
> >
> > A negative hit thus increases cost vs. no-cache.
> >
> > As number of tunnels increases, hit rate decreases so this pcpu caching
> > isn't a viable strategy.
> >
> > Furthermore, the xdst cache also needs to run with BH off, so when
> > removing this the bh disable/enable pairs can be removed too.
> >
> > Kristian tested a 4.14.y backport of this change and reported
> > increased performance:
> >
> > In our tests, the throughput reduction has been reduced from around -20%
> > to -5%. We also see that the overall throughput is independent of the
> > number of tunnels, while before the throughput was reduced as the number
> > of tunnels increased.
> >
> > Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > ---
> > Vakul Garg reports traffic going via ipsec tunnels will cause the kernel
> > to spin in an infinite loop due to xfrm policy reference count
> > overflowing and becoming 0.
> > The refcount leak is in the pcpu cache. Instead of fixing this, just
> > remove the pcpu cache -- its not present in any other stable release.
> > Vakul reported that this patch fixes the problem.
> >
> > There are no major deviations from the upstream revert; conflicts
> > were only due to context.
>
> Now queued up, does 4.9.y also need this?
No, 4.14 was the first kernel with this thing and its already gone in
4.19.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox