Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-31 18:43 UTC (permalink / raw)
  To: Boris Pismenny
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	oss-drivers@netronome.com, edumazet@google.com, Aviad Yehezkel,
	davejwatson@fb.com, john.fastabend@gmail.com,
	daniel@iogearbox.net, willemb@google.com,
	xiyou.wangcong@gmail.com, fw@strlen.de,
	alexei.starovoitov@gmail.com
In-Reply-To: <c86943db-f098-3ce8-f0d1-4f04ba676d40@mellanox.com>

On Wed, 31 Jul 2019 13:57:26 +0000, Boris Pismenny wrote:
> > diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
> > index 048e5ca44824..2bc3ab5515d8 100644
> > --- a/Documentation/networking/tls-offload.rst
> > +++ b/Documentation/networking/tls-offload.rst
> > @@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
> >   Known bugs
> >   ==========
> >   
> > -skb_orphan() leaks clear text
> > ------------------------------
> > -
> > -Currently drivers depend on the :c:member:`sk` member of
> > -:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
> > -encryption. Any operation which removes or does not preserve the socket
> > -association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
> > -will cause the driver to miss the packets and lead to clear text leaks.
> > -
> >   Redirects leak clear text
> >   -------------------------  
> Doesn't this patch cover the redirect case as well?

Ah, good catch! I thought this entry was for socket redirect, which
will be a separate fix, but it seems we don't have an entry for that
one. 

> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 228db3998e46..90f3f552c789 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -814,6 +814,7 @@ enum sock_flags {
> >   	SOCK_TXTIME,
> >   	SOCK_XDP, /* XDP is attached */
> >   	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > +	SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
> >   };
> >   
> >   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
> >   	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> >   }
> >   
> > +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +	skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> > +#endif
> > +}  
> 
> Have you considered the following alternative to calling 
> sk_set_tx_crypto() - Add a new MSG_DECRYPTE flag that will set the 
> skb->decrypted bit in the do_tcp_sendpages function.
> 
> Then, the rest of the TCP code can propagate this bit from the existing skb.

That'd also work marking the socket as crypto seemed easy enough. I was
planning on using sk_rx_crypto_match() for socket redirect also, so
it'd be symmetrical. Is there a preference for using the internal flags?

> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index f62f0e7e3cdd..dee93eab02fe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >   			if (!skb)
> >   				goto wait_for_memory;
> >   
> > +			sk_set_tx_crypto(sk, skb);
> >   			skb_entail(sk, skb);
> >   			copy = size_goal;
> >   		}
> > @@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >   
> >   		i = skb_shinfo(skb)->nr_frags;
> >   		can_coalesce = skb_can_coalesce(skb, i, page, offset);
> > -		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > +		if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
> > +		    !sk_tx_crypto_match(sk, skb)) {  
> 
> I see that sk_tx_crypto_match is called only here to handle cases where 
> the socket expected crypto offload, while the skb is not marked 
> decrypted. AFAIU, this should not be possible, because we set the 
> skb->eor bit for the last plaintext skb before sending any traffic that 
> expects crypto offload.

Ack, I missed the tcp_skb_can_collapse_to() above.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 18:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Mickaël Salaün
  Cc: linux-kernel, Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Stephen Smalley,
	Tejun Heo, Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <20190727014048.3czy3n2hi6hfdy3m@ast-mbp.dhcp.thefacebook.com>


On 27/07/2019 03:40, Alexei Starovoitov wrote:
> On Sun, Jul 21, 2019 at 11:31:12PM +0200, Mickaël Salaün wrote:
>> FIXME: 64-bits in the doc

FYI, this FIXME was fixed, just not removed from this message. :)

>>
>> This new map store arbitrary values referenced by inode keys.  The map
>> can be updated from user space with file descriptor pointing to inodes
>> tied to a file system.  From an eBPF (Landlock) program point of view,
>> such a map is read-only and can only be used to retrieved a value tied
>> to a given inode.  This is useful to recognize an inode tagged by user
>> space, without access right to this inode (i.e. no need to have a write
>> access to this inode).
>>
>> Add dedicated BPF functions to handle this type of map:
>> * bpf_inode_htab_map_update_elem()
>> * bpf_inode_htab_map_lookup_elem()
>> * bpf_inode_htab_map_delete_elem()
>>
>> This new map require a dedicated helper inode_map_lookup_elem() because
>> of the key which is a pointer to an opaque data (only provided by the
>> kernel).  This act like a (physical or cryptographic) key, which is why
>> it is also not allowed to get the next key.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>
> there are too many things to comment on.
> Let's do this patch.
>
> imo inode_map concept is interesting, but see below...
>
>> +
>> +    /*
>> +     * Limit number of entries in an inode map to the maximum number of
>> +     * open files for the current process. The maximum number of file
>> +     * references (including all inode maps) for a process is then
>> +     * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
>> +     * is 0, then any entry update is forbidden.
>> +     *
>> +     * An eBPF program can inherit all the inode map FD. The worse case is
>> +     * to fill a bunch of arraymaps, create an eBPF program, close the
>> +     * inode map FDs, and start again. The maximum number of inode map
>> +     * entries can then be close to RLIMIT_NOFILE^3.
>> +     */
>> +    if (attr->max_entries > rlimit(RLIMIT_NOFILE))
>> +            return -EMFILE;
>
> rlimit is checked, but no fd are consumed.
> Once created such inode map_fd can be passed to a different process.
> map_fd can be pinned into bpffs.
> etc.
> what the value of the check?

I was looking for the most meaningful limit for a process, and rlimit is
the best I found. As the limit of open FD per processes, rlimit is not
perfect, but I think the semantic is close here (e.g. a process can also
pass FD through unix socket).

>
>> +
>> +    /* decorelate UAPI from kernel API */
>> +    attr->key_size = sizeof(struct inode *);
>> +
>> +    return htab_map_alloc_check(attr);
>> +}
>> +
>> +static void inode_htab_put_key(void *key)
>> +{
>> +    struct inode **inode = key;
>> +
>> +    if ((*inode)->i_state & I_FREEING)
>> +            return;
>
> checking the state without take a lock? isn't it racy?

This should only trigger when called from security_inode_free(). I'll
add a comment.

>
>> +    iput(*inode);
>> +}
>> +
>> +/* called from syscall or (never) from eBPF program */
>> +static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
>> +{
>> +    /* do not leak a file descriptor */
>
> what this comment suppose to mean?

Because a key is a reference to an inode, a possible return value for
this function could be a file descriptor pointing to this inode (the
same way a file descriptor is use to add an element). For now, I don't
want to implement a way for a process with such a map to extract such
inode, which I compare to a possible leak (of information, not kernel
memory nor object). This could be implemented in the future if there is
value in it (and probably some additional safeguards), though.

>
>> +    return -ENOTSUPP;
>> +}
>> +
>> +/* must call iput(inode) after this call */
>> +static struct inode *inode_from_fd(int ufd, bool check_access)
>> +{
>> +    struct inode *ret;
>> +    struct fd f;
>> +    int deny;
>> +
>> +    f = fdget(ufd);
>> +    if (unlikely(!f.file))
>> +            return ERR_PTR(-EBADF);
>> +    /* TODO?: add this check when called from an eBPF program too (already
>> +    * checked by the LSM parent hooks anyway) */
>> +    if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>> +    /* check if the FD is tied to a mount point */
>> +    /* TODO?: add this check when called from an eBPF program too */
>> +    if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>
> a bunch of TODOs do not inspire confidence.

I think the current implement is good, but these TODOs are here to draw
attention on particular points for which I would like external review
and opinion (hence the "?").

>
>> +    if (check_access) {
>> +            /*
>> +            * must be allowed to access attributes from this file to then
>> +            * be able to compare an inode to its map entry
>> +            */
>> +            deny = security_inode_getattr(&f.file->f_path);
>> +            if (deny) {
>> +                    ret = ERR_PTR(deny);
>> +                    goto put_fd;
>> +            }
>> +    }
>> +    ret = file_inode(f.file);
>> +    ihold(ret);
>> +
>> +put_fd:
>> +    fdput(f);
>> +    return ret;
>> +}
>> +
>> +/*
>> + * The key is a FD when called from a syscall, but an inode address when called
>> + * from an eBPF program.
>> + */
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
>> +{
>> +    void *ptr;
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +
>> +    rcu_read_lock();
>> +    ptr = htab_map_lookup_elem(map, &inode);
>> +    iput(inode);
>> +    if (IS_ERR(ptr)) {
>> +            ret = PTR_ERR(ptr);
>> +    } else if (!ptr) {
>> +            ret = -ENOENT;
>> +    } else {
>> +            ret = 0;
>> +            copy_map_value(map, value, ptr);
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>> +/* called from kernel */
>
> wrong comment?
> kernel side cannot call it, right?

This is called from bpf_inode_fd_htab_map_delete_elem() (code just
beneath), and from
kernel/bpf/syscall.c:bpf_inode_ptr_unlocked_htab_map_delet_elem() which
can be called by security_inode_free() (hook_inode_free_security).

>
>> +int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
>> +            struct inode **key, bool remove_in_inode)
>> +{
>> +    if (remove_in_inode)
>> +            landlock_inode_remove_map(*key, map);
>> +    return htab_map_delete_elem(map, key);
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* do not check inode access (similar to directory check) */
>> +    inode = inode_from_fd(*key, false);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
>> +            u64 map_flags)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = htab_map_update_elem(map, &inode, value, map_flags);
>> +    if (!ret)
>> +            ret = landlock_inode_add_map(inode, map);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +static void inode_htab_map_free(struct bpf_map *map)
>> +{
>> +    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>> +    struct hlist_nulls_node *n;
>> +    struct hlist_nulls_head *head;
>> +    struct htab_elem *l;
>> +    int i;
>> +
>> +    for (i = 0; i < htab->n_buckets; i++) {
>> +            head = select_bucket(htab, i);
>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>> +            }
>> +    }
>> +    htab_map_free(map);
>> +}
>
> user space can delete the map.
> that will trigger inode_htab_map_free() which will call
> landlock_inode_remove_map().
> which will simply itereate the list and delete from the list.

landlock_inode_remove_map() removes the reference to the map (being
freed) from the inode (with an RCU lock).

>
> While in parallel inode can be destoyed and hook_inode_free_security()
> will be called.
> I think nothing that protects from this race.

According to security_inode_free(), the inode is effectively freed after
the RCU grace period. However, I forgot to call bpf_map_inc() in
landlock_inode_add_map(), which would prevent the map to be freed
outside of the security_inode_free(). I'll fix that.

>
>> +
>> +/*
>> + * We need a dedicated helper to deal with inode maps because the key is a
>> + * pointer to an opaque data, only provided by the kernel.  This really act
>> + * like a (physical or cryptographic) key, which is why it is also not allowed
>> + * to get the next key with map_get_next_key().
>
> inode pointer is like cryptographic key? :)

I wanted to highlight the fact that, contrary to other map key types,
the value of this one should not be readable, only usable. A "secret
value" is more appropriate but still confusing. I'll rephrase that.

>
>> + */
>> +BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
>> +{
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +    return (unsigned long)htab_map_lookup_elem(map, &key);
>> +}
>> +
>> +const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
>> +    .func           = bpf_inode_map_lookup_elem,
>> +    .gpl_only       = false,
>> +    .pkt_access     = true,
>
> pkt_access ? :)

This slipped in with this rebase, I'll remove it. :)

>
>> +    .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
>> +    .arg1_type      = ARG_CONST_MAP_PTR,
>> +    .arg2_type      = ARG_PTR_TO_INODE,
>> +};
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index b2a8cb14f28e..e46441c42b68 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_peek_elem(map, value);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
>>      } else {
>>              rcu_read_lock();
>>              if (map->ops->map_lookup_elem_sys_only)
>> @@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_push_elem(map, value, attr->flags);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            rcu_read_lock();
>> +            err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
>> +            rcu_read_unlock();
>>      } else {
>>              rcu_read_lock();
>>              err = map->ops->map_update_elem(map, key, value, attr->flags);
>> @@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
>>      preempt_disable();
>>      __this_cpu_inc(bpf_prog_active);
>>      rcu_read_lock();
>> -    err = map->ops->map_delete_elem(map, key);
>> +    if (map->map_type == BPF_MAP_TYPE_INODE)
>> +            err = bpf_inode_fd_htab_map_delete_elem(map, key);
>> +    else
>> +            err = map->ops->map_delete_elem(map, key);
>>      rcu_read_unlock();
>>      __this_cpu_dec(bpf_prog_active);
>>      preempt_enable();
>> @@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
>>      return err;
>>  }
>>
>> +int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
>> +                                            struct inode **key, bool remove_in_inode)
>> +{
>> +    int err;
>> +
>> +    preempt_disable();
>> +    __this_cpu_inc(bpf_prog_active);
>> +    rcu_read_lock();
>> +    err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
>> +    rcu_read_unlock();
>> +    __this_cpu_dec(bpf_prog_active);
>> +    preempt_enable();
>> +    maybe_wait_bpf_programs(map);
>
> if that function was actually doing synchronize_rcu() the consequences
> would have been unpleasant. Fortunately it's a nop in this case.
> Please read the code carefully before copy-paste.
> Also what do you think the reason of bpf_prog_active above?
> What is the reason of rcu_read_lock above?

The RCU is used as for every map modifications (usually from userspace).
I wasn't sure about the other protections so I kept the same (generic)
checks as in map_delete_elem() (just above) because this function follow
the same semantic. What can I safely remove?

>
> I think the patch set needs to shrink at least in half to be reviewable.
> The way you tie seccomp and lsm is probably the biggest obstacle
> than any of the bugs above.
> Can you drop seccomp ? and do it as normal lsm ?

The seccomp/enforcement part is needed to have a minimum viable product,
i.e. a process able to sandbox itself. Are you suggesting to first merge
a version when it is only possible to create inode maps but not use them
in an useful way (i.e. for sandboxing)? I can do it if it's OK with you,
and I hope it will not be a problem for the security folks if it can
help to move forward.

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* [PATCH] net: mdio-octeon: Fix build error and Kconfig warning
From: Nathan Chancellor @ 2019-07-31 18:50 UTC (permalink / raw)
  To: davem
  Cc: andrew, broonie, devel, f.fainelli, gregkh, hkallweit1,
	kernel-build-reports, linux-arm-kernel, linux-next,
	natechancellor, netdev, willy, kbuild test robot, Randy Dunlap
In-Reply-To: <20190731.094150.851749535529247096.davem@davemloft.net>

arm allyesconfig warns:

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
&& 64BIT && HAS_IOMEM [=y] && OF_MDIO [=y]
  Selected by [y]:
  - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC &&
NETDEVICES [=y] || COMPILE_TEST [=y])

and errors:

In file included from ../drivers/net/phy/mdio-octeon.c:14:
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
function 'writeq'; did you mean 'writeb'?
[-Werror=implicit-function-declaration]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
      |                                    ^~~~~~
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro
'oct_mdio_writeq'
   56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
      |  ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

This allows MDIO_OCTEON to be built with COMPILE_TEST as well and
includes the proper header for readq/writeq. This does not address
the several -Wint-to-pointer-cast and -Wpointer-to-int-cast warnings
that appeared as a result of commit 171a9bae68c7 ("staging/octeon:
Allow test build on !MIPS") in these files.

Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---
 drivers/net/phy/Kconfig       | 2 +-
 drivers/net/phy/mdio-cavium.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 20f14c5fbb7e..ed2edf4b5b0e 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -159,7 +159,7 @@ config MDIO_MSCC_MIIM
 
 config MDIO_OCTEON
 	tristate "Octeon and some ThunderX SOCs MDIO buses"
-	depends on 64BIT
+	depends on 64BIT || COMPILE_TEST
 	depends on HAS_IOMEM && OF_MDIO
 	select MDIO_CAVIUM
 	help
diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
index ed5f9bb5448d..b7f89ad27465 100644
--- a/drivers/net/phy/mdio-cavium.h
+++ b/drivers/net/phy/mdio-cavium.h
@@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
 	return cvmx_read_csr(addr);
 }
 #else
+#include <linux/io-64-nonatomic-lo-hi.h>
+
 #define oct_mdio_writeq(val, addr)	writeq(val, (void *)addr)
 #define oct_mdio_readq(addr)		readq((void *)addr)
 #endif
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Alexei Starovoitov @ 2019-07-31 18:58 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	Kernel Hardening, Linux API, Linux-Fsdevel, LSM List,
	Network Development
In-Reply-To: <a870c2c9-d2f7-e0fa-c8cc-35dbf8b5b87d@ssi.gouv.fr>

On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
<mickael.salaun@ssi.gouv.fr> wrote:
> >> +    for (i = 0; i < htab->n_buckets; i++) {
> >> +            head = select_bucket(htab, i);
> >> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> >> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
> >> +            }
> >> +    }
> >> +    htab_map_free(map);
> >> +}
> >
> > user space can delete the map.
> > that will trigger inode_htab_map_free() which will call
> > landlock_inode_remove_map().
> > which will simply itereate the list and delete from the list.
>
> landlock_inode_remove_map() removes the reference to the map (being
> freed) from the inode (with an RCU lock).

I'm going to ignore everything else for now and focus only on this bit,
since it's fundamental issue to address before this discussion can
go any further.
rcu_lock is not a spin_lock. I'm pretty sure you know this.
But you're arguing that it's somehow protecting from the race
I mentioned above?

^ permalink raw reply

* [PATCH] net/mlx5e: always initialize frag->last_in_page
From: Qian Cai @ 2019-07-31 19:02 UTC (permalink / raw)
  To: davem; +Cc: saeedm, tariqt, netdev, linux-kernel, Qian Cai

The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue
memory scheme") introduced an undefined behaviour below due to
"frag->last_in_page" is only initialized in
mlx5e_init_frags_partition() when,

if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE)

or after bailed out the loop,

for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++)

As the result, there could be some "frag" have uninitialized
value of "last_in_page".

Later, get_frag() obtains those "frag" and check "rag->last_in_page" in
mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always
initializing "frag->last_in_page" to "false" in
mlx5e_init_frags_partition().

UBSAN: Undefined behaviour in
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12
load of value 170 is not a valid value for type 'bool' (aka '_Bool')
Call trace:
 dump_backtrace+0x0/0x264
 show_stack+0x20/0x2c
 dump_stack+0xb0/0x104
 __ubsan_handle_load_invalid_value+0x104/0x128
 mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core]
 mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core]
 mlx5e_napi_poll+0x17c/0xa30 [mlx5_core]
 net_rx_action+0x248/0x940
 __do_softirq+0x350/0x7b8
 irq_exit+0x200/0x26c
 __handle_domain_irq+0xc8/0x128
 gic_handle_irq+0x138/0x228
 el1_irq+0xb8/0x140
 arch_cpu_idle+0x1a4/0x348
 do_idle+0x114/0x1b0
 cpu_startup_entry+0x24/0x28
 rest_init+0x1ac/0x1dc
 arch_call_rest_init+0x10/0x18
 start_kernel+0x4d4/0x57c

Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47eea6b3a1c3..96f5110a9b43 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -336,6 +336,7 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
 
 	next_frag.di = &rq->wqe.di[0];
 	next_frag.offset = 0;
+	next_frag.last_in_page = false;
 	prev = NULL;
 
 	for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++) {
-- 
1.8.3.1


^ permalink raw reply related

* Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Willem de Bruijn @ 2019-07-31 19:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, Network Development, oss-drivers, Eric Dumazet,
	borisp, aviadye, davejwatson, John Fastabend, Daniel Borkmann,
	Cong Wang, Florian Westphal, Alexei Starovoitov
In-Reply-To: <20190731111213.35237adc@cakuba.netronome.com>

On Wed, Jul 31, 2019 at 2:12 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> > On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > > sk_validate_xmit_skb() and drivers depend on the sk member of
> > > struct sk_buff to identify segments requiring encryption.
> > > Any operation which removes or does not preserve the original TLS
> > > socket such as skb_orphan() or skb_clone() will cause clear text
> > > leaks.
> > >
> > > Make the TCP socket underlying an offloaded TLS connection
> > > mark all skbs as decrypted, if TLS TX is in offload mode.
> > > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > > (or a socket with no validation) and decrypted flag set.
> > >
> > > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > > they all imply TLS offload. The new checks are guarded by
> > > CONFIG_TLS_DEVICE because that's the option guarding the
> > > sk_buff->decrypted member.
> > >
> > > Second, smaller issue with orphaning is that it breaks
> > > the guarantee that packets will be delivered to device
> > > queues in-order. All TLS offload drivers depend on that
> > > scheduling property. This means skb_orphan_partial()'s
> > > trick of preserving partial socket references will cause
> > > issues in the drivers. We need a full orphan, and as a
> > > result netem delay/throttling will cause all TLS offload
> > > skbs to be dropped.
> > >
> > > Reusing the sk_buff->decrypted flag also protects from
> > > leaking clear text when incoming, decrypted skb is redirected
> > > (e.g. by TC).
> > >
> > > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..b0c10b518e65 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> > >  }
> > >  EXPORT_SYMBOL(skb_set_owner_w);
> > >
> > > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > > +{
> > > +#ifdef CONFIG_TLS_DEVICE
> > > +       /* Drivers depend on in-order delivery for crypto offload,
> > > +        * partial orphan breaks out-of-order-OK logic.
> > > +        */
> > > +       if (skb->decrypted)
> > > +               return false;
> > > +#endif
> > > +#ifdef CONFIG_INET
> > > +       if (skb->destructor == tcp_wfree)
> > > +               return true;
> > > +#endif
> > > +       return skb->destructor == sock_wfree;
> > > +}
> > > +
> >
> > Just insert the skb->decrypted check into skb_orphan_partial for less
> > code churn?
>
> Okie.. skb_orphan_partial() is a little ugly but will do.
>
> > I also think that this is an independent concern from leaking plain
> > text, so perhaps could be a separate patch.

Just a suggestion and very much depending on how much uglier it
becomes otherwise ;)

> Do you mean the out-of-order stuff is a separate concern?
>
> It is, I had them separate at the first try, but GSO code looks at
> the destructor and IIRC only copies the socket if its still tcp_wfree.
> If we let partial orphan be we have to do temporary hairy stuff in
> tcp_gso_segment(). It's easier to just deal with partial orphan here.

Okay, sounds good.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-31 19:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <D4040C0C-47D6-4852-933C-59EB53C05242@fb.com>

On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Hi Andy,
> >>
> >>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Andy,
> >>>
> >>>
>
> [...]
>
> >>>
> >>
> >> I would like more comments on this.
> >>
> >> Currently, bpf permission is more or less "root or nothing", which we
> >> would like to change.
> >>
> >> The short term goal is to separate bpf from root, in other words, it is
> >> "all or nothing". Special user space utilities, such as systemd, would
> >> benefit from this. Once this is implemented, systemd can call sys_bpf()
> >> when it is not running as root.
> >
> > As generally nasty as Linux capabilities are, this sounds like a good
> > use for CAP_BPF_ADMIN.
>
> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

It's been done before -- it's not that hard.  IMO the main tricky bit
would be try be somewhat careful about defining exactly what
CAP_BPF_ADMIN does.

> > I don't see why you need to invent a whole new mechanism for this.
> > The entire cgroup ecosystem outside bpf() does just fine using the
> > write permission on files in cgroupfs to control access.  Why can't
> > bpf() do the same thing?
>
> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> we should have target concept for all these commands. But that is a
> much bigger project. OTOH, "all or nothing" model allows all these
> commands at once.

For BPF_PROG_LOAD, I admit I've never understood why permission is
required at all.  I think that CAP_SYS_ADMIN or similar should be
needed to get is_priv in the verifier, but I think that should mainly
be useful for tracing, and that requires lots of privilege anyway.
BPF_MAP_* is probably the trickiest part.  One solution would be some
kind of bpffs, but I'm sure other solutions are possible.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 19:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	Kernel Hardening, Linux API, Linux-Fsdevel, LSM List,
	Network Development
In-Reply-To: <CAADnVQLqkfVijWoOM29PxCL_yK6K0fr8B89r4c5EKgddevJhGQ@mail.gmail.com>

On 31/07/2019 20:58, Alexei Starovoitov wrote:
> On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
> <mickael.salaun@ssi.gouv.fr> wrote:
>>>> +    for (i = 0; i < htab->n_buckets; i++) {
>>>> +            head = select_bucket(htab, i);
>>>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>>>> +            }
>>>> +    }
>>>> +    htab_map_free(map);
>>>> +}
>>>
>>> user space can delete the map.
>>> that will trigger inode_htab_map_free() which will call
>>> landlock_inode_remove_map().
>>> which will simply itereate the list and delete from the list.
>>
>> landlock_inode_remove_map() removes the reference to the map (being
>> freed) from the inode (with an RCU lock).
>
> I'm going to ignore everything else for now and focus only on this bit,
> since it's fundamental issue to address before this discussion can
> go any further.
> rcu_lock is not a spin_lock. I'm pretty sure you know this.
> But you're arguing that it's somehow protecting from the race
> I mentioned above?
>

I was just clarifying your comment to avoid misunderstanding about what
is being removed.

As said in the full response, there is currently a race but, if I add a
bpf_map_inc() call when the map is referenced by inode->security, then I
don't see how a race could occur because such added map could only be
freed in a security_inode_free() (as long as it retains a reference to
this inode).

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH net-next 0/6] flow_offload: add indr-block in nf_table_offload
From: Jiri Pirko @ 2019-07-31 19:21 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <1564560753-32603-1-git-send-email-wenxu@ucloud.cn>

Wed, Jul 31, 2019 at 10:12:27AM CEST, wenxu@ucloud.cn wrote:
>From: wenxu <wenxu@ucloud.cn>
>
>This series patch make nftables offload support the vlan and
>tunnel device offload through indr-block architecture.
>
>The first four patches mv tc indr block to flow offload and
>rename to flow-indr-block.
>Because the new flow-indr-block can't get the tcf_block
>directly. The fifthe patch provide a callback list to get 
>flow_block of each subsystem immediately when the device
>register and contain a block.
>The last patch make nf_tables_offload support flow-indr-block.
>
>wenxu (6):
>  cls_api: modify the tc_indr_block_ing_cmd parameters.
>  cls_api: replace block with flow_block in tc_indr_block_dev
>  cls_api: add flow_indr_block_call function
>  flow_offload: move tc indirect block to flow offload
>  flow_offload: support get flow_block immediately
>  netfilter: nf_tables_offload: support indr block call

Wenxu, this is V5. Please repost is as V5. Also please provide per-patch
changelog, as you did with the previous versions. Thanks!

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31 17:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4Bzb6swYtf7J_m1bZo6o+aT1AcCXZX5ZBw7Uja=Tne2LCuw@mail.gmail.com>



> On Jul 31, 2019, at 10:18 AM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Wed, Jul 31, 2019 at 1:30 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> 
>>> On Tue, Jul 30, 2019 at 10:19 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>>>> 
>>>>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>>>>>> 
>>>>>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>>>>>> Every instruction that needs to be relocated has corresponding
>>>>>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>>>>>> to match recorded "local" relocation spec against potentially many
>>>>>>> compatible "target" types, creating corresponding spec. Details of the
>>>>>>> algorithm are noted in corresponding comments in the code.
>>>>>>> 
>>>>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>> 
>> [...]
>> 
>>>>>> 
>>>>> 
>>>>> I just picked the most succinct and non-repetitive form. It's
>>>>> immediately apparent which type it's implicitly converted to, so I
>>>>> felt there is no need to repeat it. Also, just (void *) is much
>>>>> shorter. :)
>>>> 
>>>> _All_ other code in btf.c converts the pointer to the target type.
>>> 
>>> Most in libbpf.c doesn't, though. Also, I try to preserve pointer
>>> constness for uses that don't modify BTF types (pretty much all of
>>> them in libbpf), so it becomes really verbose, despite extremely short
>>> variable names:
>>> 
>>> const struct btf_member *m = (const struct btf_member *)(t + 1);
>> 
>> I don't think being verbose is a big problem here. Overusing
> 
> Problem is too big and strong word to describe this :). It hurts
> readability and will often quite artificially force either wrapping
> the line or unnecessarily splitting declaration and assignment. Void *
> on the other hand is short and usually is in the same line as target
> type declaration, if not, you'll have to find local variable
> declaration to double-check type, if you are unsure.
> 
> Using (void *) + implicit cast to target pointer type is not
> unprecedented in libbpf:
> 
> $ rg ' = \((const )?struct \w+ \*\)' tools/lib/bpf/ | wc -l
> 52
> $ rg ' = \((const )?void \*\)' tools/lib/bpf/  | wc -l
> 35
> 
> 52 vs 35 is majority overall, but not by a landslide.
> 
>> (void *) feels like a bigger problem.
> 
> Why do you feel it's a problem? void * conveys that we have a piece of
> memory that we will need to reinterpret as some concrete pointer type.
> That's what we are doing, skipping btf_type and then interpreting
> memory after common btf_type prefix is some other type, depending on
> actual BTF kind. I don't think void * is misleading in any way.

(void *) hides some problem. For example:

	struct type_a *ptr_a = NULL;
	struct type_b *ptr_b = NULL;

	/* we want this */
	ptr_a = (struct type_a *)data;
	ptr_b = (struct type_b *)(data + offset);

	/* typo, should be ptr_b, compiler will complain */
	ptr_a = (struct type_b *)(data + offset);

	/* typo, should be ptr_b, compiler will ignore */
	ptr_a = (void *)(data + offset);

Such typo is not very rare. And it may be really painful to debug. 

That being said, I think we have spent too much time on this. I will 
let you make the final call. Either way:

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-07-31 19:26 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730153952.73de7f00@cakuba.netronome.com>

Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> All devlink instances are created in init_net and stay there for a
>> lifetime. Allow user to be able to move devlink instances into
>> namespaces.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
>I'm no namespace expert, but seems reasonable, so FWIW:
>
>Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
>If I read things right we will only send the devlink instance
>notification to other namespaces when it moves, but not
>notifications for sub-objects like ports. Is the expectation 
>that the user space dumps the objects it cares about or will
>the other notifications be added as needed (or option 3 - I 
>misread the code).

You are correct. I plan to take care of the notifications of all objects
in the follow-up patchset. But I can do it in this one if needed. Up to
you.


>
>I was also wondering it moving the devlink instance during a 
>multi-part transaction could be an issue. But AFAIU it should 
>be fine - the flashing doesn't release the instance lock, and 
>health stuff should complete correctly even if devlink is moved
>half way through?

Yes, I don't see any issue there as well.


^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Gunthorpe @ 2019-07-31 19:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <7555c949-ae6f-f105-6e1d-df21ddae9e4e@redhat.com>

On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:39, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > We used to use RCU to synchronize MMU notifier with worker. This leads
> > > calling synchronize_rcu() in invalidate_range_start(). But on a busy
> > > system, there would be many factors that may slow down the
> > > synchronize_rcu() which makes it unsuitable to be called in MMU
> > > notifier.
> > > 
> > > A solution is SRCU but its overhead is obvious with the expensive full
> > > memory barrier. Another choice is to use seqlock, but it doesn't
> > > provide a synchronization method between readers and writers. The last
> > > choice is to use vq mutex, but it need to deal with the worst case
> > > that MMU notifier must be blocked and wait for the finish of swap in.
> > > 
> > > So this patch switches use a counter to track whether or not the map
> > > was used. The counter was increased when vq try to start or finish
> > > uses the map. This means, when it was even, we're sure there's no
> > > readers and MMU notifier is synchronized. When it was odd, it means
> > > there's a reader we need to wait it to be even again then we are
> > > synchronized.
> > You just described a seqlock.
> 
> 
> Kind of, see my explanation below.
> 
> 
> > 
> > We've been talking about providing this as some core service from mmu
> > notifiers because nearly every use of this API needs it.
> 
> 
> That would be very helpful.
> 
> 
> > 
> > IMHO this gets the whole thing backwards, the common pattern is to
> > protect the 'shadow pte' data with a seqlock (usually open coded),
> > such that the mmu notififer side has the write side of that lock and
> > the read side is consumed by the thread accessing or updating the SPTE.
> 
> 
> Yes, I've considered something like that. But the problem is, mmu notifier
> (writer) need to wait for the vhost worker to finish the read before it can
> do things like setting dirty pages and unmapping page.  It looks to me
> seqlock doesn't provide things like this.  

The seqlock is usually used to prevent a 2nd thread from accessing the
VA while it is being changed by the mm. ie you use something seqlocky
instead of the ugly mmu_notifier_unregister/register cycle.

You are supposed to use something simple like a spinlock or mutex
inside the invalidate_range_start to serialized tear down of the SPTEs
with their accessors.

> write_seqcount_begin()
> 
> map = vq->map[X]
> 
> write or read through map->addr directly
> 
> write_seqcount_end()
> 
> 
> There's no rmb() in write_seqcount_begin(), so map could be read before
> write_seqcount_begin(), but it looks to me now that this doesn't harm at
> all, maybe we can try this way.

That is because it is a write side lock, not a read lock. IIRC
seqlocks have weaker barriers because the write side needs to be
serialized in some other way.

The requirement I see is you need invalidate_range_start to block
until another thread exits its critical section (ie stops accessing
the SPTEs). 

That is a spinlock/mutex.

You just can't invent a faster spinlock by open coding something with
barriers, it doesn't work.

Jason

^ permalink raw reply

* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Jason Gunthorpe @ 2019-07-31 19:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <31ef9ed4-d74a-3454-a57d-fa843a3a802b@redhat.com>

On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:41, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote:
> > > The vhost_set_vring_num_addr() could be called in the middle of
> > > invalidate_range_start() and invalidate_range_end(). If we don't reset
> > > invalidate_count after the un-registering of MMU notifier, the
> > > invalidate_cont will run out of sync (e.g never reach zero). This will
> > > in fact disable the fast accessor path. Fixing by reset the count to
> > > zero.
> > > 
> > > Reported-by: Michael S. Tsirkin <mst@redhat.com>
> > Did Michael report this as well?
> 
> 
> Correct me if I was wrong. I think it's point 4 described in
> https://lkml.org/lkml/2019/7/21/25.

I'm not sure what that is talking about

But this fixes what I described:

https://lkml.org/lkml/2019/7/22/554

Jason

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf: fix x64 JIT code generation for jmp to 1st insn
From: Song Liu @ 2019-07-31 19:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, netdev@vger.kernel.org,
	bpf@vger.kernel.org, Kernel Team
In-Reply-To: <20190731013827.2445262-2-ast@kernel.org>



> On Jul 30, 2019, at 6:38 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 
> Introduction of bounded loops exposed old bug in x64 JIT.
> JIT maintains the array of offsets to the end of all instructions to
> compute jmp offsets.
> addrs[0] - offset of the end of the 1st insn (that includes prologue).
> addrs[1] - offset of the end of the 2nd insn.
> JIT didn't keep the offset of the beginning of the 1st insn,
> since classic BPF didn't have backward jumps and valid extended BPF
> couldn't have a branch to 1st insn, because it didn't allow loops.
> With bounded loops it's possible to construct a valid program that
> jumps backwards to the 1st insn.
> Fix JIT by computing:
> addrs[0] - offset of the end of prologue == start of the 1st insn.
> addrs[1] - offset of the end of 1st insn.
> 
> Reported-by: syzbot+35101610ff3e83119b1b@syzkaller.appspotmail.com
> Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
> Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>

Do we need similar fix for x86_32? 

Thanks,
Song

> ---
> arch/x86/net/bpf_jit_comp.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index eaaed5bfc4a4..a56c95805732 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -390,8 +390,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> 
> 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
> 		      bpf_prog_was_classic(bpf_prog));
> +	addrs[0] = prog - temp;
> 
> -	for (i = 0; i < insn_cnt; i++, insn++) {
> +	for (i = 1; i <= insn_cnt; i++, insn++) {
> 		const s32 imm32 = insn->imm;
> 		u32 dst_reg = insn->dst_reg;
> 		u32 src_reg = insn->src_reg;
> @@ -1105,7 +1106,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> 		extra_pass = true;
> 		goto skip_init_addrs;
> 	}
> -	addrs = kmalloc_array(prog->len, sizeof(*addrs), GFP_KERNEL);
> +	addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
> 	if (!addrs) {
> 		prog = orig_prog;
> 		goto out_addrs;
> @@ -1115,7 +1116,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> 	 * Before first pass, make a rough estimation of addrs[]
> 	 * each BPF instruction is translated to less than 64 bytes
> 	 */
> -	for (proglen = 0, i = 0; i < prog->len; i++) {
> +	for (proglen = 0, i = 0; i <= prog->len; i++) {
> 		proglen += 64;
> 		addrs[i] = proglen;
> 	}
> -- 
> 2.20.0
> 


^ permalink raw reply

* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Michael S. Tsirkin @ 2019-07-31 19:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190731193252.GH3946@ziepe.ca>

On Wed, Jul 31, 2019 at 04:32:52PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote:
> > 
> > On 2019/7/31 下午8:41, Jason Gunthorpe wrote:
> > > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote:
> > > > The vhost_set_vring_num_addr() could be called in the middle of
> > > > invalidate_range_start() and invalidate_range_end(). If we don't reset
> > > > invalidate_count after the un-registering of MMU notifier, the
> > > > invalidate_cont will run out of sync (e.g never reach zero). This will
> > > > in fact disable the fast accessor path. Fixing by reset the count to
> > > > zero.
> > > > 
> > > > Reported-by: Michael S. Tsirkin <mst@redhat.com>
> > > Did Michael report this as well?
> > 
> > 
> > Correct me if I was wrong. I think it's point 4 described in
> > https://lkml.org/lkml/2019/7/21/25.
> 
> I'm not sure what that is talking about
> 
> But this fixes what I described:
> 
> https://lkml.org/lkml/2019/7/22/554
> 
> Jason

These are two reasons for a possible counter imbalance.
Unsurprisingly they are both fixed if you reset the counter to 0.

-- 
MST

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-31 19:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ido Schimmel, Nikolay Aleksandrov, Horatiu Vultur, roopa, davem,
	bridge, netdev, linux-kernel
In-Reply-To: <20190731134550.GA23028@lunn.ch>

The 07/31/2019 15:45, Andrew Lunn wrote:
> > Here is how I see it:
> > 
> > Teach the SW bridge about non-IP multicast addresses. Initially the switch
> > should forward all MAC multicast frames to the CPU. Today MDB rules can be
> > installed (either static or dynamic by IGMP), which limit the flooding of IPv4/6
> > multicast streams. In the same way, we should have a way to install a rule
> > (FDM/ or MDB) to limit the flooding of L2 multicast frames.
> > 
> > If foreign interfaces (or br0 it self) is part of the destination list, then
> > traffic also needs to go to the CPU.
> > 
> > By doing this, we can for explicitly configured dst mac address:
> > - limit the flooding on the on the SW bridge interfaces
> > - limit the flooding on the on the HW bridge interfaces
> > - prevent them to go to the CPU if they are not needed
> This is all very complex because of all the different corner cases. So
> i don't think we want a user API to do the CPU part, we want the
> network stack to do it. Otherwise the user is going to get is wrong,
> break their network, and then come running to the list for help.
Not sure I really understand what to conclude from this... Their are already
many ways the user can break it (tc has great hooks for that), and I not think
we can really prevent the user in configuring something that break stuff (but
we should not make it too easy either).

Anyway, Horatiu has come a long way in creating a (surprising simple) patch
which allow us to limit the flooding of L2-multicast. It is following the
guidance from Nikolay, it is using the MDB database, and I beleive it is well
aligned with the existing sw-bridge design.

I hope it will be ready tomorrow, then we can have a look at it and see if it is
any good.

> This also fits with how we do things in DSA. There is deliberately no
> user space concept for configuring the DSA CPU port. To user space,
> the switch is just a bunch of Linux interfaces. Everything to do with
> the CPU port is hidden away in the DSA core layer, the DSA drivers,
> and a little bit in the bridge.

Understood, but as far as I understand, in DSA you still have the br0 interface,
which kind-of represent the traffic going to the CPU (like in pure SW bridge,
and SwitchDev offloaded SW-bridge).

/Allan


^ permalink raw reply

* Re: [PATCH bpf 2/2] selftests/bpf: tests for jmp to 1st insn
From: Song Liu @ 2019-07-31 19:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, daniel@iogearbox.net, netdev@vger.kernel.org,
	bpf@vger.kernel.org, Kernel Team
In-Reply-To: <20190731013827.2445262-3-ast@kernel.org>



> On Jul 30, 2019, at 6:38 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 
> Add 2 tests that check JIT code generation to jumps to 1st insn.
> 1st test is similar to syzbot reproducer.
> The backwards branch is never taken at runtime.
> 2nd test has branch to 1st insn that executes.
> The test is written as two bpf functions, since it's not possible
> to construct valid single bpf program that jumps to 1st insn.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
> tools/testing/selftests/bpf/verifier/loops1.c | 28 +++++++++++++++++++
> 1 file changed, 28 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/verifier/loops1.c b/tools/testing/selftests/bpf/verifier/loops1.c
> index 5e980a5ab69d..1fc4e61e9f9f 100644
> --- a/tools/testing/selftests/bpf/verifier/loops1.c
> +++ b/tools/testing/selftests/bpf/verifier/loops1.c
> @@ -159,3 +159,31 @@
> 	.errstr = "loop detected",
> 	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
> },
> +{
> +	"not-taken loop with back jump to 1st insn",
> +	.insns = {
> +	BPF_MOV64_IMM(BPF_REG_0, 123),
> +	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 4, -2),
> +	BPF_EXIT_INSN(),
> +	},
> +	.result = ACCEPT,
> +	.prog_type = BPF_PROG_TYPE_XDP,
> +	.retval = 123,
> +},
> +{
> +	"taken loop with back jump to 1st insn",
> +	.insns = {
> +	BPF_MOV64_IMM(BPF_REG_1, 10),
> +	BPF_MOV64_IMM(BPF_REG_2, 0),
> +	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 1),
> +	BPF_EXIT_INSN(),
> +	BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_1),
> +	BPF_ALU64_IMM(BPF_SUB, BPF_REG_1, 1),
> +	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -3),
> +	BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
> +	BPF_EXIT_INSN(),
> +	},
> +	.result = ACCEPT,
> +	.prog_type = BPF_PROG_TYPE_XDP,
> +	.retval = 55,
> +},
> -- 
> 2.20.0
> 


^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: David Ahern @ 2019-07-31 19:41 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski; +Cc: netdev, davem, sthemmin, mlxsw
In-Reply-To: <20190731192627.GB2324@nanopsycho>

On 7/31/19 1:26 PM, Jiri Pirko wrote:
> Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicinski@netronome.com wrote:
>> On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> All devlink instances are created in init_net and stay there for a
>>> lifetime. Allow user to be able to move devlink instances into
>>> namespaces.
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>
>> I'm no namespace expert, but seems reasonable, so FWIW:
>>
>> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>
>> If I read things right we will only send the devlink instance
>> notification to other namespaces when it moves, but not
>> notifications for sub-objects like ports. Is the expectation 
>> that the user space dumps the objects it cares about or will
>> the other notifications be added as needed (or option 3 - I 
>> misread the code).
> 
> You are correct. I plan to take care of the notifications of all objects
> in the follow-up patchset. But I can do it in this one if needed. Up to
> you.
> 

seems like it should be a part of this one. If a devlink instance
changes namespaces, all of its associated resources should as well.

Also, seems like you are missing a 'can this devlink instance be moved'
check. e.g., what happens if a resource controller has been configured
for the devlink instance and it is moved to a namespace whose existing
config exceeds those limits?

^ permalink raw reply

* Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail
From: Petko Manolov @ 2019-07-31 19:14 UTC (permalink / raw)
  To: Denis Kirjanov; +Cc: davem, netdev
In-Reply-To: <20190731191039.gip2sttd2og2olx6@carbon>

On 19-07-31 22:10:39, Petko Manolov wrote:
> On 19-07-30 15:13:57, Denis Kirjanov wrote:
> > get_registers() may fail with -ENOMEM and in this
> > case we can read a garbage from the status variable tmp.
> > 
> > Reported-by: syzbot+3499a83b2d062ae409d4@syzkaller.appspotmail.com
> > Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
> > ---
> >  drivers/net/usb/pegasus.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> > index 6d25dea5ad4b..f7d117d80cfb 100644
> > --- a/drivers/net/usb/pegasus.c
> > +++ b/drivers/net/usb/pegasus.c
> > @@ -282,7 +282,7 @@ static void mdio_write(struct net_device *dev, int phy_id, int loc, int val)
> >  static int read_eprom_word(pegasus_t *pegasus, __u8 index, __u16 *retdata)
> >  {
> >  	int i;
> > -	__u8 tmp;
> > +	__u8 tmp = 0;
> >  	__le16 retdatai;
> >  	int ret;
> 
> Unfortunately this patch does not fix anything.  Even if get_registers() fail 
> with -ENOMEM the "for" loop will cover for it and will exit only if the 
> operation was successful or the device got disconnected.  Please read the code 
> carefully.
> 
> So while the patch is harmless it isn't solving a problem.

Actually i am wrong - if "tmp" contains a random value it may accidentally have 
the EPROM_DONE bit set.

Dave, please apply the patch.


thanks,
Petko

^ permalink raw reply

* Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail
From: Petko Manolov @ 2019-07-31 19:10 UTC (permalink / raw)
  To: Denis Kirjanov; +Cc: davem, netdev
In-Reply-To: <20190730131357.30697-1-dkirjanov@suse.com>

On 19-07-30 15:13:57, Denis Kirjanov wrote:
> get_registers() may fail with -ENOMEM and in this
> case we can read a garbage from the status variable tmp.
> 
> Reported-by: syzbot+3499a83b2d062ae409d4@syzkaller.appspotmail.com
> Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
> ---
>  drivers/net/usb/pegasus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> index 6d25dea5ad4b..f7d117d80cfb 100644
> --- a/drivers/net/usb/pegasus.c
> +++ b/drivers/net/usb/pegasus.c
> @@ -282,7 +282,7 @@ static void mdio_write(struct net_device *dev, int phy_id, int loc, int val)
>  static int read_eprom_word(pegasus_t *pegasus, __u8 index, __u16 *retdata)
>  {
>  	int i;
> -	__u8 tmp;
> +	__u8 tmp = 0;
>  	__le16 retdatai;
>  	int ret;

Unfortunately this patch does not fix anything.  Even if get_registers() fail 
with -ENOMEM the "for" loop will cover for it and will exit only if the 
operation was successful or the device got disconnected.  Please read the code 
carefully.

So while the patch is harmless it isn't solving a problem.


cheers,
Petko

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-07-31 19:45 UTC (permalink / raw)
  To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <c4f83be2-adee-1595-f241-de4c26ea55ca@gmail.com>

Wed, Jul 31, 2019 at 09:41:10PM CEST, dsahern@gmail.com wrote:
>On 7/31/19 1:26 PM, Jiri Pirko wrote:
>> Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicinski@netronome.com wrote:
>>> On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>> All devlink instances are created in init_net and stay there for a
>>>> lifetime. Allow user to be able to move devlink instances into
>>>> namespaces.
>>>>
>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>
>>> I'm no namespace expert, but seems reasonable, so FWIW:
>>>
>>> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>>
>>> If I read things right we will only send the devlink instance
>>> notification to other namespaces when it moves, but not
>>> notifications for sub-objects like ports. Is the expectation 
>>> that the user space dumps the objects it cares about or will
>>> the other notifications be added as needed (or option 3 - I 
>>> misread the code).
>> 
>> You are correct. I plan to take care of the notifications of all objects
>> in the follow-up patchset. But I can do it in this one if needed. Up to
>> you.
>> 
>
>seems like it should be a part of this one. If a devlink instance
>changes namespaces, all of its associated resources should as well.

Okay. Will do.


>
>Also, seems like you are missing a 'can this devlink instance be moved'

I don't see why not.


>check. e.g., what happens if a resource controller has been configured
>for the devlink instance and it is moved to a namespace whose existing
>config exceeds those limits?

It's moved with all the values. The whole instance is moved.


^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: David Ahern @ 2019-07-31 19:46 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <20190731194502.GC2324@nanopsycho>

On 7/31/19 1:45 PM, Jiri Pirko wrote:
>> check. e.g., what happens if a resource controller has been configured
>> for the devlink instance and it is moved to a namespace whose existing
>> config exceeds those limits?
> 
> It's moved with all the values. The whole instance is moved.
> 

The values are moved, but the FIB in a namespace could already contain
more routes than the devlink instance allows.


^ permalink raw reply

* [PATCH 00/14] ARM: move lpc32xx and dove to multiplatform
From: Arnd Bergmann @ 2019-07-31 19:56 UTC (permalink / raw)
  To: soc, linux-arm-kernel, Vladimir Zapolskiy, Sylvain Lemieux,
	Russell King, Gregory Clement, Linus Walleij
  Cc: Jason Cooper, Andrew Lunn, Sebastian Hesselbarth, David S. Miller,
	Greg Kroah-Hartman, Alan Stern, Guenter Roeck, linux-gpio, netdev,
	linux-serial, linux-usb, linux-watchdog, Arnd Bergmann

I revisited some older patches here, getting two of the remaining
ARM platforms to build with ARCH_MULTIPLATFORM like most others do.

In case of lpc32xx, I created a new set of patches, which seemed
easier than digging out what I did for an older release many
years ago.

For dove, the patches are basically what I had proposed back in
2015 when all other ARMv6/ARMv7 machines became part of a single
kernel build. I don't know what the state is mach-dove support is,
compared to the DT based support in mach-mvebu for the same
hardware. If they are functionally the same, we could also just
remove mach-dove rather than applying my patches.

I also created patches to remove the w90x900 and ks8695 platforms
that seem to have lost their last users a few years ago.
I will post them separately, but plan to apply them in the same
branch for linux-5.4 if there are no objections.

      Arnd

Arnd Bergmann (14):
  usb: ohci-nxp: enable compile-testing
  usb: udc: lpc32xx: allow compile-testing
  watchdog: pnx4008_wdt: allow compile-testing
  serial: lpc32xx_hs: allow compile-testing
  gpio: lpc32xx: allow building on non-lpc32xx targets
  net: lpc-enet: factor out iram access
  net: lpc-enet: move phy setup into platform code
  net: lpc-enet: allow compile testing
  serial: lpc32xx: allow compile testing
  ARM: lpc32xx: clean up header files
  ARM: lpc32xx: allow multiplatform build
  ARM: dove: clean up mach/*.h headers
  ARM: orion/mvebu: unify debug-ll virtual addresses
  ARM: dove: multiplatform support

 arch/arm/Kconfig                              | 33 +---------
 arch/arm/Kconfig.debug                        |  5 +-
 arch/arm/configs/dove_defconfig               |  2 +
 arch/arm/configs/lpc32xx_defconfig            |  1 +
 arch/arm/mach-dove/Kconfig                    | 16 +++--
 arch/arm/mach-dove/Makefile                   |  2 +
 .../{include/mach => }/bridge-regs.h          |  4 +-
 arch/arm/mach-dove/cm-a510.c                  |  3 +-
 arch/arm/mach-dove/common.c                   |  4 +-
 arch/arm/mach-dove/dove-db-setup.c            |  2 +-
 arch/arm/mach-dove/{include/mach => }/dove.h  | 14 ++---
 arch/arm/mach-dove/include/mach/hardware.h    | 19 ------
 arch/arm/mach-dove/include/mach/uncompress.h  | 36 -----------
 arch/arm/mach-dove/irq.c                      |  5 +-
 arch/arm/mach-dove/{include/mach => }/irqs.h  |  2 -
 arch/arm/mach-dove/mpp.c                      |  2 +-
 arch/arm/mach-dove/pcie.c                     |  4 +-
 arch/arm/mach-dove/{include/mach => }/pm.h    |  4 +-
 arch/arm/mach-lpc32xx/Kconfig                 | 11 ++++
 arch/arm/mach-lpc32xx/common.c                | 24 +++++--
 arch/arm/mach-lpc32xx/common.h                |  1 -
 arch/arm/mach-lpc32xx/include/mach/board.h    | 15 -----
 .../mach-lpc32xx/include/mach/entry-macro.S   | 28 ---------
 arch/arm/mach-lpc32xx/include/mach/hardware.h | 25 --------
 .../mach-lpc32xx/include/mach/uncompress.h    | 50 ---------------
 .../{include/mach/platform.h => lpc32xx.h}    | 18 +++++-
 arch/arm/mach-lpc32xx/pm.c                    |  3 +-
 arch/arm/mach-lpc32xx/serial.c                | 33 +++++++++-
 arch/arm/mach-lpc32xx/suspend.S               |  3 +-
 arch/arm/mach-mv78xx0/mv78xx0.h               |  4 +-
 arch/arm/mach-orion5x/orion5x.h               |  4 +-
 drivers/gpio/Kconfig                          |  8 +++
 drivers/gpio/Makefile                         |  2 +-
 drivers/gpio/gpio-lpc32xx.c                   | 63 ++++++++++++-------
 drivers/net/ethernet/nxp/Kconfig              |  2 +-
 drivers/net/ethernet/nxp/lpc_eth.c            | 30 +++------
 drivers/tty/serial/Kconfig                    |  3 +-
 drivers/tty/serial/lpc32xx_hs.c               | 37 ++---------
 drivers/usb/gadget/udc/Kconfig                |  3 +-
 drivers/usb/gadget/udc/lpc32xx_udc.c          |  2 -
 drivers/usb/host/Kconfig                      |  3 +-
 drivers/usb/host/ohci-nxp.c                   | 25 +++++---
 drivers/watchdog/Kconfig                      |  2 +-
 drivers/watchdog/pnx4008_wdt.c                |  1 -
 include/linux/soc/nxp/lpc32xx-misc.h          | 33 ++++++++++
 45 files changed, 246 insertions(+), 345 deletions(-)
 rename arch/arm/mach-dove/{include/mach => }/bridge-regs.h (96%)
 rename arch/arm/mach-dove/{include/mach => }/dove.h (95%)
 delete mode 100644 arch/arm/mach-dove/include/mach/hardware.h
 delete mode 100644 arch/arm/mach-dove/include/mach/uncompress.h
 rename arch/arm/mach-dove/{include/mach => }/irqs.h (98%)
 rename arch/arm/mach-dove/{include/mach => }/pm.h (97%)
 create mode 100644 arch/arm/mach-lpc32xx/Kconfig
 delete mode 100644 arch/arm/mach-lpc32xx/include/mach/board.h
 delete mode 100644 arch/arm/mach-lpc32xx/include/mach/entry-macro.S
 delete mode 100644 arch/arm/mach-lpc32xx/include/mach/hardware.h
 delete mode 100644 arch/arm/mach-lpc32xx/include/mach/uncompress.h
 rename arch/arm/mach-lpc32xx/{include/mach/platform.h => lpc32xx.h} (98%)
 create mode 100644 include/linux/soc/nxp/lpc32xx-misc.h

-- 
2.20.0


^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: David Ahern @ 2019-07-31 19:58 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <087f584d-06c5-f4b9-722b-ccb72ce0e5de@gmail.com>

On 7/31/19 1:46 PM, David Ahern wrote:
> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>> check. e.g., what happens if a resource controller has been configured
>>> for the devlink instance and it is moved to a namespace whose existing
>>> config exceeds those limits?
>>
>> It's moved with all the values. The whole instance is moved.
>>
> 
> The values are moved, but the FIB in a namespace could already contain
> more routes than the devlink instance allows.
> 

From a quick test your recent refactoring to netdevsim broke the
resource controller. It was, and is intended to be, per network namespace.

^ permalink raw reply

* [PATCH 01/14] usb: ohci-nxp: enable compile-testing
From: Arnd Bergmann @ 2019-07-31 19:56 UTC (permalink / raw)
  To: soc, linux-arm-kernel, Vladimir Zapolskiy, Sylvain Lemieux,
	Russell King, Gregory Clement, Linus Walleij, Greg Kroah-Hartman,
	Alan Stern
  Cc: Jason Cooper, Andrew Lunn, Sebastian Hesselbarth, David S. Miller,
	Guenter Roeck, linux-gpio, netdev, linux-serial, linux-usb,
	linux-watchdog, Arnd Bergmann, linux-kernel
In-Reply-To: <20190731195713.3150463-1-arnd@arndb.de>

The driver hardcodes a hardware I/O address the way one should
generally not do, and this prevents both compile-testing, and
moving the platform to CONFIG_ARCH_MULTIPLATFORM.

Change the code to be independent of the machine headers
to allow those two. Removing the hardcoded address would
be hard and is not necessary, so leave that in place for now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/usb/host/Kconfig    |  3 ++-
 drivers/usb/host/ohci-nxp.c | 25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/host/Kconfig b/drivers/usb/host/Kconfig
index 40b5de597112..73d233d3bf4d 100644
--- a/drivers/usb/host/Kconfig
+++ b/drivers/usb/host/Kconfig
@@ -441,7 +441,8 @@ config USB_OHCI_HCD_S3C2410
 
 config USB_OHCI_HCD_LPC32XX
 	tristate "Support for LPC on-chip OHCI USB controller"
-	depends on USB_OHCI_HCD && ARCH_LPC32XX
+	depends on USB_OHCI_HCD
+	depends on ARCH_LPC32XX || COMPILE_TEST
 	depends on USB_ISP1301
 	default y
 	---help---
diff --git a/drivers/usb/host/ohci-nxp.c b/drivers/usb/host/ohci-nxp.c
index f5f532601092..c561881d0e79 100644
--- a/drivers/usb/host/ohci-nxp.c
+++ b/drivers/usb/host/ohci-nxp.c
@@ -29,10 +29,7 @@
 
 #include "ohci.h"
 
-#include <mach/hardware.h>
-
 #define USB_CONFIG_BASE		0x31020000
-#define USB_OTG_STAT_CONTROL	IO_ADDRESS(USB_CONFIG_BASE + 0x110)
 
 /* USB_OTG_STAT_CONTROL bit defines */
 #define TRANSPARENT_I2C_EN	(1 << 7)
@@ -122,19 +119,33 @@ static inline void isp1301_vbus_off(void)
 
 static void ohci_nxp_start_hc(void)
 {
-	unsigned long tmp = __raw_readl(USB_OTG_STAT_CONTROL) | HOST_EN;
+	void __iomem *usb_otg_stat_control = ioremap(USB_CONFIG_BASE + 0x110, 4);
+	unsigned long tmp;
+
+	if (WARN_ON(!usb_otg_stat_control))
+		return;
+
+	tmp = __raw_readl(usb_otg_stat_control) | HOST_EN;
 
-	__raw_writel(tmp, USB_OTG_STAT_CONTROL);
+	__raw_writel(tmp, usb_otg_stat_control);
 	isp1301_vbus_on();
+
+	iounmap(usb_otg_stat_control);
 }
 
 static void ohci_nxp_stop_hc(void)
 {
+	void __iomem *usb_otg_stat_control = ioremap(USB_CONFIG_BASE + 0x110, 4);
 	unsigned long tmp;
 
+	if (WARN_ON(!usb_otg_stat_control))
+		return;
+
 	isp1301_vbus_off();
-	tmp = __raw_readl(USB_OTG_STAT_CONTROL) & ~HOST_EN;
-	__raw_writel(tmp, USB_OTG_STAT_CONTROL);
+	tmp = __raw_readl(usb_otg_stat_control) & ~HOST_EN;
+	__raw_writel(tmp, usb_otg_stat_control);
+
+	iounmap(usb_otg_stat_control);
 }
 
 static int ohci_hcd_nxp_probe(struct platform_device *pdev)
-- 
2.20.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox