Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Martin Lau @ 2019-08-08  7:11 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Yonghong Song, Stanislav Fomichev, netdev@vger.kernel.org,
	bpf@vger.kernel.org, davem@davemloft.net, ast@kernel.org,
	daniel@iogearbox.net
In-Reply-To: <20190808000533.GA2820@mini-arch>

On Wed, Aug 07, 2019 at 05:05:33PM -0700, Stanislav Fomichev wrote:
[ ... ]
> > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > > +{
> > > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > > +	struct bpf_sk_storage *sk_storage;
> > > +	struct bpf_sk_storage_elem *selem;
> > > +	int ret;
> > > +
> > > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > +
> > > +	rcu_read_lock();
> > > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > +
> > > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > > +		goto out;
> > > +
> > > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > > +		struct bpf_sk_storage_map *smap;
> > > +		struct bpf_sk_storage_elem *copy_selem;
> > > +
> > > +		if (!selem->clone)
> > > +			continue;
> > > +
> > > +		smap = rcu_dereference(SDATA(selem)->smap);
> > > +		if (!smap)
> > > +			continue;
> > > +
> > > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > > +		if (IS_ERR(copy_selem)) {
> > > +			ret = PTR_ERR(copy_selem);
> > > +			goto err;
> > > +		}
> > > +
> > > +		if (!new_sk_storage) {
> > > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > > +			if (ret) {
> > > +				kfree(copy_selem);
> > > +				atomic_sub(smap->elem_size,
> > > +					   &newsk->sk_omem_alloc);
> > > +				goto err;
> > > +			}
> > > +
> > > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > > +			continue;
> > > +		}
> > > +
> > > +		raw_spin_lock_bh(&new_sk_storage->lock);
> > > +		selem_link_map(smap, copy_selem);
> > > +		__selem_link_sk(new_sk_storage, copy_selem);
> > > +		raw_spin_unlock_bh(&new_sk_storage->lock);
> > 
> > Considering in this particular case, new socket is not visible to 
> > outside world yet (both kernel and user space), map_delete/map_update
> > operations are not applicable in this situation, so
> > the above raw_spin_lock_bh() probably not needed.
> I agree, it's doing nothing, but __selem_link_sk has the following comment:
> /* sk_storage->lock must be held and sk_storage->list cannot be empty */
I believe I may have forgotten to remove this comment after reusing it
in sk_storage_alloc() which also has a similar no-lock-needed situation
like here.

I would also prefer not to acquire the new_sk_storage->lock here to avoid
confusion when investigating racing bugs in the future.

> 
> Just wanted to keep that invariant for this call site as well (in case
> we add some lockdep enforcement or smth else). WDYT?
> 
> > > +	}
> > > +
> > > +out:
> > > +	rcu_read_unlock();
> > > +	return 0;
> > > +
> > > +err:
> > > +	rcu_read_unlock();
> > > +
> > > +	bpf_sk_storage_free(newsk);
> > > +	return ret;
> > > +}
> > > +
> > >   BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > >   	   void *, value, u64, flags)
> > >   {
> > >   	struct bpf_sk_storage_data *sdata;
> > >   
> > > -	if (flags > BPF_SK_STORAGE_GET_F_CREATE)
> > > +	if (flags & ~BPF_SK_STORAGE_GET_F_MASK)
> > > +		return (unsigned long)NULL;
> > > +
> > > +	if ((flags & BPF_SK_STORAGE_GET_F_CLONE) &&
> > > +	    !(flags & BPF_SK_STORAGE_GET_F_CREATE))
> > >   		return (unsigned long)NULL;
> > >   
> > >   	sdata = sk_storage_lookup(sk, map, true);
> > >   	if (sdata)
> > >   		return (unsigned long)sdata->data;
> > >   
> > > -	if (flags == BPF_SK_STORAGE_GET_F_CREATE &&
> > > +	if ((flags & BPF_SK_STORAGE_GET_F_CREATE) &&
> > >   	    /* Cannot add new elem to a going away sk.
> > >   	     * Otherwise, the new elem may become a leak
> > >   	     * (and also other memory issues during map
> > > @@ -762,6 +853,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> > >   		/* sk must be a fullsock (guaranteed by verifier),
> > >   		 * so sock_gen_put() is unnecessary.
> > >   		 */
> > > +		if (!IS_ERR(sdata))
> > > +			SELEM(sdata)->clone =
> > > +				!!(flags & BPF_SK_STORAGE_GET_F_CLONE);
> > >   		sock_put(sk);
> > >   		return IS_ERR(sdata) ?
> > >   			(unsigned long)NULL : (unsigned long)sdata->data;
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..f5e801a9cea4 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> > >   			goto out;
> > >   		}
> > >   		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> > > -#ifdef CONFIG_BPF_SYSCALL
> > > -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > -#endif
> > > +
> > > +		if (bpf_sk_storage_clone(sk, newsk)) {
> > > +			sk_free_unlock_clone(newsk);
> > > +			newsk = NULL;
> > > +			goto out;
> > > +		}
> > >   
> > >   		newsk->sk_err	   = 0;
> > >   		newsk->sk_err_soft = 0;
> > > 

^ permalink raw reply

* Re: [PATCH v2] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Kalle Valo @ 2019-08-08  7:10 UTC (permalink / raw)
  To: Larry Finger
  Cc: Valdis Klētnieks, Ping-Ke Shih, David S. Miller,
	linux-wireless, netdev, linux-kernel
In-Reply-To: <15df2564-8815-f351-8fb2-b46611a90234@lwfinger.net>

Larry Finger <Larry.Finger@lwfinger.net> writes:

> On 8/7/19 8:51 PM, Valdis Klētnieks wrote:
>> Fix spurious warning message when building with W=1:
>>
>>    CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
>> drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
>> drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
>> drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line
>>
>> Clean up the comment format.
>>
>> Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
>>
>> ---
>> Changes since v1:  Larry Finger pointed out the patch wasn't checkpatch-clean.
>>
>> diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
>> index 34d68dbf4b4c..4b59f3b46b28 100644
>> --- a/drivers/net/wireless/realtek/rtlwifi/usb.c
>> +++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
>> @@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
>>   	mutex_destroy(&rtlpriv->io.bb_mutex);
>>   }
>>   -/**
>> - *
>> - *	Default aggregation handler. Do nothing and just return the oldest skb.
>> - */
>> +/*	Default aggregation handler. Do nothing and just return the oldest skb.  */
>>   static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
>>   						  struct sk_buff_head *list)
>>   {
>> @@ -756,11 +753,6 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
>>   	return err;
>>   }
>>   -/**
>> - *
>> - *
>> - */
>> -
>>   /*=======================  tx =========================================*/
>>   static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>>   {
>> @@ -786,11 +778,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>>   	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
>>   }
>>   -/**
>> - *
>> - * We may add some struct into struct rtl_usb later. Do deinit here.
>> - *
>> - */
>> +/* We may add some struct into struct rtl_usb later. Do deinit here.  */
>>   static void rtl_usb_deinit(struct ieee80211_hw *hw)
>>   {
>>   	rtl_usb_cleanup(hw);
>
> I missed that the subject line should be "rtwifi: Fix ....". Otherwise it is OK.

I can fix the subject during commit.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: [bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
From: Jesper Dangaard Brouer @ 2019-08-08  6:51 UTC (permalink / raw)
  To: Y Song
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, Xdp, a.s.protopopov,
	David Ahern, brouer
In-Reply-To: <CAH3MdRUf_2Sk8v2dPeQ_+LfKPPwX9N3QoMDMCGFehd5JQVktcw@mail.gmail.com>

On Wed, 7 Aug 2019 11:04:17 -0700
Y Song <ys114321@gmail.com> wrote:

> On Wed, Aug 7, 2019 at 5:37 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >
> > This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> > that the chosen egress index should be checked for existence in the
> > devmap. This can now be done via taking advantage of Toke's work in
> > commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
> >
> > This change makes xdp_fwd more practically usable, as this allows for
> > a mixed environment, where IP-forwarding fallback to network stack, if
> > the egress device isn't configured to use XDP.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  samples/bpf/xdp_fwd_kern.c |   20 ++++++++++++++------
> >  samples/bpf/xdp_fwd_user.c |   36 +++++++++++++++++++++++++-----------
> >  2 files changed, 39 insertions(+), 17 deletions(-)
> >
> > diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> > index e6ffc4ea06f4..4a5ad381ed2a 100644
> > --- a/samples/bpf/xdp_fwd_kern.c
> > +++ b/samples/bpf/xdp_fwd_kern.c
> > @@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
> >
> >         rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
> >
> > -       /* verify egress index has xdp support
> > -        * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> > -        *       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> > -        * NOTE: without verification that egress index supports XDP
> > -        *       forwarding packets are dropped.
> > -        */
> >         if (rc == 0) {
> > +               int *val;
> > +
> > +               /* Verify egress index has been configured as TX-port.
> > +                * (Note: User can still have inserted an egress ifindex that
> > +                * doesn't support XDP xmit, which will result in packet drops).
> > +                *
> > +                * Note: lookup in devmap supported since 0cdbb4b09a0.
> > +                * If not supported will fail with:
> > +                *  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> > +                */
> > +               val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex);  
> 
> It should be "xdp_tx_ports". Otherwise, you will have compilation errors.

Ups. This happened in my rebase, where I moved the rename patch [1/3]
before this one.  Thanks for catching this!

> > +               if (!val)
> > +                       return XDP_PASS;  
> 
> Also, maybe we can do
>          if (!bpf_map_lookup_elem(&tx_port, &fib_params.ifindex))
>             return XDP_PASS;
> so we do not need to define val at all.

I had it this way, because I also checked the contents (of the pointer
*val) to check if this was the correct ifindex, but I removed that
check again (as user side always insert correctly).  So, I guess I
could take your suggestion now.


> > +
> >                 if (h_proto == htons(ETH_P_IP))
> >                         ip_decrease_ttl(iph);
> >                 else if (h_proto == htons(ETH_P_IPV6))
> > diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
> > index ba012d9f93dd..20951bc27477 100644
> > --- a/samples/bpf/xdp_fwd_user.c
> > +++ b/samples/bpf/xdp_fwd_user.c
> > @@ -27,14 +27,20 @@
> >  #include "libbpf.h"
> >  #include <bpf/bpf.h>
> >
> > -
> > -static int do_attach(int idx, int fd, const char *name)
> > +static int do_attach(int idx, int prog_fd, int map_fd, const char
> > *name) {
> >         int err;
> >
> > -       err = bpf_set_link_xdp_fd(idx, fd, 0);
> > -       if (err < 0)
> > +       err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
> > +       if (err < 0) {
> >                 printf("ERROR: failed to attach program to %s\n",
> > name);
> > +               return err;
> > +       }
> > +
> > +       /* Adding ifindex as a possible egress TX port */
> > +       err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
> > +       if (err)
> > +               printf("ERROR: failed using device %s as
> > TX-port\n", name);
> >
> >         return err;
> >  }
> > @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
> >         if (err < 0)
> >                 printf("ERROR: failed to detach program from %s\n",
> > name);
> >
> > +       /* TODO: Remember to cleanup map, when adding use of shared
> > map
> > +        *  bpf_map_delete_elem((map_fd, &idx);
> > +        */
> >         return err;
> >  }
> >
> > @@ -67,10 +76,10 @@ int main(int argc, char **argv)
> >         };
> >         const char *prog_name = "xdp_fwd";
> >         struct bpf_program *prog;
> > +       int prog_fd, map_fd = -1;
> >         char filename[PATH_MAX];
> >         struct bpf_object *obj;
> >         int opt, i, idx, err;
> > -       int prog_fd, map_fd;
> >         int attach = 1;
> >         int ret = 0;
> >
> > @@ -103,8 +112,17 @@ int main(int argc, char **argv)
> >                         return 1;
> >                 }
> >
> > -               if (bpf_prog_load_xattr(&prog_load_attr, &obj,
> > &prog_fd))
> > +               err = bpf_prog_load_xattr(&prog_load_attr, &obj,
> > &prog_fd);
> > +               if (err) {
> > +                       if (err == -22) {  
> 
> -EINVAL?

Yes.

> For -EINVAL, many things could go wrong. But maybe the blow error
> is the most common one so I am fine with that.

Yes, it is rather sad, that we don't have better/more return codes,
such that we can react better to these.

E.g. if this was part of a open source project (external to the
kernel), I could have two XDP-BPF programs, one that use this feature
and one that don't.  If I had a more specific return code, then I could
load the other if the first failed.


> > +                               printf("Does kernel support devmap lookup?\n");
> > +                               /* If not, the error message will be:
> > +                                * "cannot pass map_type 14 into func
> > +                                * bpf_map_lookup_elem#1"
> > +                                */
> > +                       }
> >                         return 1;
> > +               }
> >
> >                 prog = bpf_object__find_program_by_title(obj, prog_name);
> >                 prog_fd = bpf_program__fd(prog);
> > @@ -119,10 +137,6 @@ int main(int argc, char **argv)
> >                         return 1;
> >                 }
> >         }
> > -       if (attach) {
> > -               for (i = 1; i < 64; ++i)
> > -                       bpf_map_update_elem(map_fd, &i, &i, 0);
> > -       }
> >
> >         for (i = optind; i < argc; ++i) {
> >                 idx = if_nametoindex(argv[i]);
> > @@ -138,7 +152,7 @@ int main(int argc, char **argv)
> >                         if (err)
> >                                 ret = err;
> >                 } else {
> > -                       err = do_attach(idx, prog_fd, argv[i]);
> > +                       err = do_attach(idx, prog_fd, map_fd, argv[i]);
> >                         if (err)
> >                                 ret = err;
> >                 }
> >  

I'll send a V2

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Martin Lau @ 2019-08-08  6:39 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, davem@davemloft.net,
	ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190807154720.260577-2-sdf@google.com>

On Wed, Aug 07, 2019 at 08:47:18AM -0700, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from bpf_sk_storage_clone. Reuse the gap in
> bpf_sk_storage_elem to store clone/non-clone flag.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/net/bpf_sk_storage.h |  10 ++++
>  include/uapi/linux/bpf.h     |   1 +
>  net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
>  net/core/sock.c              |   9 ++--
>  4 files changed, 115 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> index b9dcb02e756b..8e4f831d2e52 100644
> --- a/include/net/bpf_sk_storage.h
> +++ b/include/net/bpf_sk_storage.h
> @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
>  extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>  extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> +#else
> +static inline int bpf_sk_storage_clone(const struct sock *sk,
> +				       struct sock *newsk)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _BPF_SK_STORAGE_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..00459ca4c8cf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2931,6 +2931,7 @@ enum bpf_func_id {
>  
>  /* BPF_FUNC_sk_storage_get flags */
>  #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
> +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
It is only used in bpf_sk_storage_get().
What if the elem is created from bpf_fd_sk_storage_update_elem()
i.e. from the syscall API ?

What may be the use case for a map to have both CLONE and non-CLONE
elements?  If it is not the case, would it be better to add
BPF_F_CLONE to bpf_attr->map_flags?

>  
>  /* Mode for BPF_FUNC_skb_adjust_room helper. */
>  enum bpf_adj_room_mode {
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index 94c7f77ecb6b..b6dea67965bc 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -12,6 +12,9 @@
>  
>  static atomic_t cache_idx;
>  
> +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
> +					 BPF_SK_STORAGE_GET_F_CLONE)
> +
>  struct bucket {
>  	struct hlist_head list;
>  	raw_spinlock_t lock;
> @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
>  	struct hlist_node snode;	/* Linked to bpf_sk_storage */
>  	struct bpf_sk_storage __rcu *sk_storage;
>  	struct rcu_head rcu;
> -	/* 8 bytes hole */
> +	u8 clone:1;
> +	/* 7 bytes hole */
>  	/* The data is stored in aother cacheline to minimize
>  	 * the number of cachelines access during a cache hit.
>  	 */
> @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>  	return 0;
>  }
>  
> -/* Called by __sk_destruct() */
> +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
>  void bpf_sk_storage_free(struct sock *sk)
>  {
>  	struct bpf_sk_storage_elem *selem;
> @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
>  	return err;
>  }
>  
> +static struct bpf_sk_storage_elem *
> +bpf_sk_storage_clone_elem(struct sock *newsk,
> +			  struct bpf_sk_storage_map *smap,
> +			  struct bpf_sk_storage_elem *selem)
> +{
> +	struct bpf_sk_storage_elem *copy_selem;
> +
> +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> +	if (!copy_selem)
> +		return ERR_PTR(-ENOMEM);
nit.
may be just return NULL as selem_alloc() does.

> +
> +	if (map_value_has_spin_lock(&smap->map))
> +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> +				      SDATA(selem)->data, true);
> +	else
> +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> +			       SDATA(selem)->data);
> +
> +	return copy_selem;
> +}
> +
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct bpf_sk_storage *new_sk_storage = NULL;
> +	struct bpf_sk_storage *sk_storage;
> +	struct bpf_sk_storage_elem *selem;
> +	int ret;
> +
> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> +
> +	rcu_read_lock();
> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> +
> +	if (!sk_storage || hlist_empty(&sk_storage->list))
> +		goto out;
> +
> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> +		struct bpf_sk_storage_map *smap;
> +		struct bpf_sk_storage_elem *copy_selem;
> +
> +		if (!selem->clone)
> +			continue;
> +
> +		smap = rcu_dereference(SDATA(selem)->smap);
> +		if (!smap)
smap should not be NULL.

> +			continue;
> +
> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> +		if (IS_ERR(copy_selem)) {
> +			ret = PTR_ERR(copy_selem);
> +			goto err;
> +		}
> +
> +		if (!new_sk_storage) {
> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> +			if (ret) {
> +				kfree(copy_selem);
> +				atomic_sub(smap->elem_size,
> +					   &newsk->sk_omem_alloc);
> +				goto err;
> +			}
> +
> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> +			continue;
> +		}
> +
> +		raw_spin_lock_bh(&new_sk_storage->lock);
> +		selem_link_map(smap, copy_selem);
Unlike the existing selem-update use-cases in bpf_sk_storage.c,
the smap->map.refcnt has not been held here.  Reading the smap
is fine.  However, adding a new selem to a deleting smap is an issue.
Hence, I think bpf_map_inc_not_zero() should be done first.

> +		__selem_link_sk(new_sk_storage, copy_selem);
> +		raw_spin_unlock_bh(&new_sk_storage->lock);
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return 0;
> +
> +err:
> +	rcu_read_unlock();
> +
> +	bpf_sk_storage_free(newsk);
> +	return ret;
> +}
> +
>  BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
>  	   void *, value, u64, flags)
>  {
>  	struct bpf_sk_storage_data *sdata;
>  
> -	if (flags > BPF_SK_STORAGE_GET_F_CREATE)
> +	if (flags & ~BPF_SK_STORAGE_GET_F_MASK)
> +		return (unsigned long)NULL;
> +
> +	if ((flags & BPF_SK_STORAGE_GET_F_CLONE) &&
> +	    !(flags & BPF_SK_STORAGE_GET_F_CREATE))
>  		return (unsigned long)NULL;
>  
>  	sdata = sk_storage_lookup(sk, map, true);
>  	if (sdata)
>  		return (unsigned long)sdata->data;
>  
> -	if (flags == BPF_SK_STORAGE_GET_F_CREATE &&
> +	if ((flags & BPF_SK_STORAGE_GET_F_CREATE) &&
>  	    /* Cannot add new elem to a going away sk.
>  	     * Otherwise, the new elem may become a leak
>  	     * (and also other memory issues during map
> @@ -762,6 +853,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
>  		/* sk must be a fullsock (guaranteed by verifier),
>  		 * so sock_gen_put() is unnecessary.
>  		 */
> +		if (!IS_ERR(sdata))
> +			SELEM(sdata)->clone =
> +				!!(flags & BPF_SK_STORAGE_GET_F_CLONE);
>  		sock_put(sk);
>  		return IS_ERR(sdata) ?
>  			(unsigned long)NULL : (unsigned long)sdata->data;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index d57b0cc995a0..f5e801a9cea4 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  			goto out;
>  		}
>  		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> -#ifdef CONFIG_BPF_SYSCALL
> -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> -#endif
> +
> +		if (bpf_sk_storage_clone(sk, newsk)) {
> +			sk_free_unlock_clone(newsk);
> +			newsk = NULL;
> +			goto out;
> +		}
>  
>  		newsk->sk_err	   = 0;
>  		newsk->sk_err_soft = 0;
> -- 
> 2.22.0.770.g0f2c4a37fd-goog
> 

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain index
From: Paul Blakey @ 2019-08-08  6:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev@vger.kernel.org, David S. Miller, Justin Pettit,
	Pravin B Shelar, Simon Horman, Vlad Buslov, Jiri Pirko, Roi Dayan,
	Yossi Kuperman, Rony Efraim, Oz Shlomo
In-Reply-To: <20190807150026.GE21609@localhost.localdomain>


On 8/7/2019 6:00 PM, Marcelo Ricardo Leitner wrote:
> On Wed, Aug 07, 2019 at 03:08:42PM +0300, Paul Blakey wrote:
>> Offloaded OvS datapath rules are translated one to one to tc rules,
>> for example the following simplified OvS rule:
>>
>> recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
>>
>> Will be translated to the following tc rule:
>>
>> $ tc filter add dev dev1 ingress \
>> 	    prio 1 chain 0 proto ip \
>> 		flower tcp ct_state -trk \
>> 		action ct pipe \
>> 		action goto chain 2
>>
>> Received packets will first travel though tc, and if they aren't stolen
>> by it, like in the above rule, they will continue to OvS datapath.
>> Since we already did some actions (action ct in this case) which might
>> modify the packets, and updated action stats, we would like to continue
>> the proccessing with the correct recirc_id in OvS (here recirc_id(2))
>> where we left off.
>>
>> To support this, introduce a new skb extension for tc, which
>> will be used for translating tc chain to ovs recirc_id to
>> handle these miss cases. Last tc chain index will be set
>> by tc goto chain action and read by OvS datapath.
>>
>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Thanks!


>> ---
>>   include/linux/skbuff.h    | 13 +++++++++++++
>>   include/net/sch_generic.h |  5 ++++-
>>   net/core/skbuff.c         |  6 ++++++
>>   net/openvswitch/flow.c    |  9 +++++++++
>>   net/sched/Kconfig         | 13 +++++++++++++
>>   net/sched/act_api.c       |  1 +
>>   net/sched/cls_api.c       | 12 ++++++++++++
>>   7 files changed, 58 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 3aef8d8..fb2a792 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -279,6 +279,16 @@ struct nf_bridge_info {
>>   };
>>   #endif
>>   
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +/* Chain in tc_skb_ext will be used to share the tc chain with
>> + * ovs recirc_id. It will be set to the current chain by tc
>> + * and read by ovs to recirc_id.
>> + */
>> +struct tc_skb_ext {
>> +	__u32 chain;
>> +};
>> +#endif
>> +
>>   struct sk_buff_head {
>>   	/* These two members must be first. */
>>   	struct sk_buff	*next;
>> @@ -4050,6 +4060,9 @@ enum skb_ext_id {
>>   #ifdef CONFIG_XFRM
>>   	SKB_EXT_SEC_PATH,
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	TC_SKB_EXT,
>> +#endif
>>   	SKB_EXT_NUM, /* must be last */
>>   };
>>   
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 6b6b012..871feea 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -275,7 +275,10 @@ struct tcf_result {
>>   			unsigned long	class;
>>   			u32		classid;
>>   		};
>> -		const struct tcf_proto *goto_tp;
>> +		struct {
>> +			const struct tcf_proto *goto_tp;
>> +			u32 goto_index;
>> +		};
>>   
>>   		/* used in the skb_tc_reinsert function */
>>   		struct {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index ea8e8d3..2b40b5a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4087,6 +4087,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>>   #ifdef CONFIG_XFRM
>>   	[SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	[TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
>> +#endif
>>   };
>>   
>>   static __always_inline unsigned int skb_ext_total_length(void)
>> @@ -4098,6 +4101,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
>>   #ifdef CONFIG_XFRM
>>   		skb_ext_type_len[SKB_EXT_SEC_PATH] +
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +		skb_ext_type_len[TC_SKB_EXT] +
>> +#endif
>>   		0;
>>   }
>>   
>> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
>> index bc89e16..0287ead 100644
>> --- a/net/openvswitch/flow.c
>> +++ b/net/openvswitch/flow.c
>> @@ -816,6 +816,9 @@ static int key_extract_mac_proto(struct sk_buff *skb)
>>   int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   			 struct sk_buff *skb, struct sw_flow_key *key)
>>   {
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	struct tc_skb_ext *tc_ext;
>> +#endif
>>   	int res, err;
>>   
>>   	/* Extract metadata from packet. */
>> @@ -848,7 +851,13 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   	if (res < 0)
>>   		return res;
>>   	key->mac_proto = res;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	tc_ext = skb_ext_find(skb, TC_SKB_EXT);
>> +	key->recirc_id = tc_ext ? tc_ext->chain : 0;
>> +#else
>>   	key->recirc_id = 0;
>> +#endif
>>   
>>   	err = key_extract(skb, key);
>>   	if (!err)
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index afd2ba1..b3faafe 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -963,6 +963,19 @@ config NET_IFE_SKBTCINDEX
>>           tristate "Support to encoding decoding skb tcindex on IFE action"
>>           depends on NET_ACT_IFE
>>   
>> +config NET_TC_SKB_EXT
>> +	bool "TC recirculation support"
>> +	depends on NET_CLS_ACT
>> +	default y if NET_CLS_ACT
>> +	select SKB_EXTENSIONS
>> +
>> +	help
>> +	  Say Y here to allow tc chain misses to continue in OvS datapath in
>> +	  the correct recirc_id, and hardware chain misses to continue in
>> +	  the correct chain in tc software datapath.
>> +
>> +	  Say N here if you won't be using tc<->ovs offload or tc chains offload.
>> +
>>   endif # NET_SCHED
>>   
>>   config NET_SCH_FIFO
>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> index 3397122..c393604 100644
>> --- a/net/sched/act_api.c
>> +++ b/net/sched/act_api.c
>> @@ -27,6 +27,7 @@ static void tcf_action_goto_chain_exec(const struct tc_action *a,
>>   {
>>   	const struct tcf_chain *chain = rcu_dereference_bh(a->goto_chain);
>>   
>> +	res->goto_index = chain->index;
>>   	res->goto_tp = rcu_dereference_bh(chain->filter_chain);
>>   }
>>   
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 3565d9a..b0b829a 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -1660,6 +1660,18 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>>   			goto reset;
>>   		} else if (unlikely(TC_ACT_EXT_CMP(err, TC_ACT_GOTO_CHAIN))) {
>>   			first_tp = res->goto_tp;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +			{
>> +				struct tc_skb_ext *ext;
>> +
>> +				ext = skb_ext_add(skb, TC_SKB_EXT);
>> +				if (WARN_ON_ONCE(!ext))
>> +					return TC_ACT_SHOT;
>> +
>> +				ext->chain = res->goto_index;
>> +			}
>> +#endif
>>   			goto reset;
>>   		}
>>   #endif
>> -- 
>> 1.8.3.1
>>

^ permalink raw reply

* Re: [PATCH 11/17] qca: no need to check return value of debugfs_create functions
From: Michael Heimpold @ 2019-08-08  6:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: netdev, David S. Miller, Stefan Wahren, Yangtao Li
In-Reply-To: <20190806161128.31232-12-gregkh@linuxfoundation.org>

Am Dienstag, 6. August 2019, 18:11:22 CEST schrieb Greg Kroah-Hartman:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Stefan Wahren <stefan.wahren@i2se.com>
> Cc: Michael Heimpold <michael.heimpold@i2se.com>
> Cc: Yangtao Li <tiny.windzz@gmail.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  drivers/net/ethernet/qualcomm/qca_debug.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/qca_debug.c
> b/drivers/net/ethernet/qualcomm/qca_debug.c index
> bcb890b18a94..702aa217a27a 100644
> --- a/drivers/net/ethernet/qualcomm/qca_debug.c
> +++ b/drivers/net/ethernet/qualcomm/qca_debug.c
> @@ -131,17 +131,10 @@ DEFINE_SHOW_ATTRIBUTE(qcaspi_info);
>  void
>  qcaspi_init_device_debugfs(struct qcaspi *qca)
>  {
> -	struct dentry *device_root;
> +	qca->device_root = debugfs_create_dir(dev_name(&qca->net_dev->dev),
> +					      NULL);
> 
> -	device_root = debugfs_create_dir(dev_name(&qca->net_dev->dev), NULL);
> -	qca->device_root = device_root;
> -
> -	if (IS_ERR(device_root) || !device_root) {
> -		pr_warn("failed to create debugfs directory for %s\n",
> -			dev_name(&qca->net_dev->dev));
> -		return;
> -	}
> -	debugfs_create_file("info", S_IFREG | 0444, device_root, qca,
> +	debugfs_create_file("info", S_IFREG | 0444, qca->device_root, qca,
>  			    &qcaspi_info_fops);
>  }

Acked-by: Michael Heimpold <michael.heimpold@i2se.com>





^ permalink raw reply

* [PATCH v2] team: Add vlan tx offload to hw_enc_features
From: YueHaibing @ 2019-08-08  6:22 UTC (permalink / raw)
  To: j.vosburgh, vfalico, andy, davem, jiri, jay.vosburgh
  Cc: linux-kernel, netdev, YueHaibing
In-Reply-To: <20190807023808.51976-1-yuehaibing@huawei.com>

We should also enable team's vlan tx offload in hw_enc_features,
pass the vlan packets to the slave devices with vlan tci, let the
slave handle vlan tunneling offload implementation.

Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
v2: fix commit log typo
---
 drivers/net/team/team.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index abfa0da..e8089de 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1004,6 +1004,8 @@ static void __team_compute_features(struct team *team)
 
 	team->dev->vlan_features = vlan_features;
 	team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
+				     NETIF_F_HW_VLAN_CTAG_TX |
+				     NETIF_F_HW_VLAN_STAG_TX |
 				     NETIF_F_GSO_UDP_L4;
 	team->dev->hard_header_len = max_hard_header_len;
 
-- 
2.7.4



^ permalink raw reply related

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Yonglong Liu @ 2019-08-08  6:21 UTC (permalink / raw)
  To: Heiner Kallweit, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <080b68c7-abe6-d142-da4b-26e8a7d4dc19@gmail.com>



On 2019/8/8 14:11, Heiner Kallweit wrote:
> On 08.08.2019 03:15, Yonglong Liu wrote:
>>
>>
>> On 2019/8/8 0:47, Heiner Kallweit wrote:
>>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>>
>>>> When using rtl8211f in polling mode, may get a invalid speed,
>>>> because of reading a fake link up and autoneg complete status
>>>> immediately after starting autoneg:
>>>>
>>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>
>>>> According to the datasheet of rtl8211f, to get the real time
>>>> link status, need to read MII_BMSR twice.
>>>>
>>>> This patch add a read_status hook for rtl8211f, and do a fake
>>>> phy_read before genphy_read_status(), so that can get real link
>>>> status in genphy_read_status().
>>>>
>>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>>> ---
>>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>> Is this an accidental resubmit? Because we discussed this in
>>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>>> been applied already.
>>>
>>> Heiner
>>>
>>> .
>>>
>>
>> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
>> recurrence rate is almost 100%, and I had test the solution about
>> 5 times and it works. But yesterday it happen again suddenly, and than
>> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
>> after autoneg started which is not 0x798d from last discussion.
>>
>>
>>
> OK, I'll have a look.
> However the approach is wrong. The double read is related to the latching
> of link-down events. This is done by all PHY's and not specific to RT8211F.
> Also it's not related to the problem. I assume any sufficient delay would
> do instead of the read.
> 
> .
> 

So you will send a new patch to fix this problem? I am waiting for it,
and can do a full test this time.


^ permalink raw reply

* Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
From: Eric Dumazet @ 2019-08-08  6:13 UTC (permalink / raw)
  To: Josh Hunt, netdev; +Cc: davem, edumazet, ncardwell
In-Reply-To: <1565221950-1376-2-git-send-email-johunt@akamai.com>



On 8/8/19 1:52 AM, Josh Hunt wrote:
> TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
> enabled. Update the comment to reflect this.
> 
> Suggested-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Josh Hunt <johunt@akamai.com>
> ---

Signed-off-by: Eric Dumazet <edumazet@google.com>


^ permalink raw reply

* Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
From: Eric Dumazet @ 2019-08-08  6:12 UTC (permalink / raw)
  To: Josh Hunt, netdev; +Cc: davem, edumazet, ncardwell
In-Reply-To: <1565221950-1376-1-git-send-email-johunt@akamai.com>



On 8/8/19 1:52 AM, Josh Hunt wrote:
> The current implementation of TCP MTU probing can considerably
> underestimate the MTU on lossy connections allowing the MSS to get down to
> 48. We have found that in almost all of these cases on our networks these
> paths can handle much larger MTUs meaning the connections are being
> artificially limited. Even though TCP MTU probing can raise the MSS back up
> we have seen this not to be the case causing connections to be "stuck" with
> an MSS of 48 when heavy loss is present.
> 
> Prior to pushing out this change we could not keep TCP MTU probing enabled
> b/c of the above reasons. Now with a reasonble floor set we've had it
> enabled for the past 6 months.

I am still sad to see you do not share what is a reasonable value and let
everybody guess. 

It seems to be a top-secret value.

> 
> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> administrators the ability to control the floor of MSS probing.
> 
> Signed-off-by: Josh Hunt <johunt@akamai.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Heiner Kallweit @ 2019-08-08  6:11 UTC (permalink / raw)
  To: Yonglong Liu, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <e663235c-93eb-702d-5a9c-8f781d631c42@huawei.com>

On 08.08.2019 03:15, Yonglong Liu wrote:
> 
> 
> On 2019/8/8 0:47, Heiner Kallweit wrote:
>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>
>>> When using rtl8211f in polling mode, may get a invalid speed,
>>> because of reading a fake link up and autoneg complete status
>>> immediately after starting autoneg:
>>>
>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>
>>> According to the datasheet of rtl8211f, to get the real time
>>> link status, need to read MII_BMSR twice.
>>>
>>> This patch add a read_status hook for rtl8211f, and do a fake
>>> phy_read before genphy_read_status(), so that can get real link
>>> status in genphy_read_status().
>>>
>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>> ---
>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>  1 file changed, 13 insertions(+)
>>>
>> Is this an accidental resubmit? Because we discussed this in
>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>> been applied already.
>>
>> Heiner
>>
>> .
>>
> 
> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
> recurrence rate is almost 100%, and I had test the solution about
> 5 times and it works. But yesterday it happen again suddenly, and than
> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
> after autoneg started which is not 0x798d from last discussion.
> 
> 
> 
OK, I'll have a look.
However the approach is wrong. The double read is related to the latching
of link-down events. This is done by all PHY's and not specific to RT8211F.
Also it's not related to the problem. I assume any sufficient delay would
do instead of the read.

^ permalink raw reply

* Re: [PATCH bpf-next] btf: expose BTF info through sysfs
From: Greg KH @ 2019-08-08  6:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team,
	Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20190807183821.138728-1-andriin@fb.com>

On Wed, Aug 07, 2019 at 11:38:21AM -0700, Andrii Nakryiko wrote:
> Make .BTF section allocated and expose its contents through sysfs.

Ah, found the original of this, sorry for the noise on the previous
response...

Still need Documentation/ABI/ entries though :)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Greg KH @ 2019-08-08  6:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov, daniel@iogearbox.net,
	andrii.nakryiko@gmail.com, Kernel Team, Masahiro Yamada,
	Arnaldo Carvalho de Melo, Jiri Olsa, Sam Ravnborg
In-Reply-To: <89a6e282-0250-4264-128d-469be99073e9@fb.com>

On Thu, Aug 08, 2019 at 04:24:25AM +0000, Yonghong Song wrote:
> 
> 
> On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> > Make .BTF section allocated and expose its contents through sysfs.

Was this original patch not on bpf@vger?  I can't find it in my
archive.  Anyway...

> > /sys/kernel/btf directory is created to contain all the BTFs present
> > inside kernel. Currently there is only kernel's main BTF, represented as
> > /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> > each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Why are you using sysfs for this?  Who uses "BTF"s?  Are these debugging
images that only people working on developing bpf programs are going to
need, or are these things that you are going to need on a production
system?

I ask as maybe debugfs is the best place for this if they are not needed
on production systems.


> > 
> > Current approach relies on a few pieces coming together:
> > 1. pahole is used to take almost final vmlinux image (modulo .BTF and
> >     kallsyms) and generate .BTF section by converting DWARF info into
> >     BTF. This section is not allocated and not mapped to any segment,
> >     though, so is not yet accessible from inside kernel at runtime.
> > 2. objcopy dumps .BTF contents into binary file and subsequently
> >     convert binary file into linkable object file with automatically
> >     generated symbols _binary__btf_kernel_bin_start and
> >     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
> >     of BTF raw data.
> > 3. final vmlinux image is generated by linking this object file (and
> >     kallsyms, if necessary). sysfs_btf.c then creates
> >     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
> >     it. This allows, e.g., libbpf and bpftool access BTF info at
> >     well-known location, without resorting to searching for vmlinux image
> >     on disk (location of which is not standardized and vmlinux image
> >     might not be even available in some scenarios, e.g., inside qemu
> >     during testing).
> > 
> > Alternative approach using .incbin assembler directive to embed BTF
> > contents directly was attempted but didn't work, because sysfs_proc.o is
> > not re-compiled during link-vmlinux.sh stage. This is required, though,
> > to update embedded BTF data (initially empty data is embedded, then
> > pahole generates BTF info and we need to regenerate sysfs_btf.o with
> > updated contents, but it's too late at that point).
> > 
> > If BTF couldn't be generated due to missing or too old pahole,
> > sysfs_btf.c handles that gracefully by detecting that
> > _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> > /sys/kernel/btf at all.
> > 
> > v1->v2:
> > - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> > 
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Sam Ravnborg <sam@ravnborg.org>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   kernel/bpf/Makefile     |  3 +++
> >   kernel/bpf/sysfs_btf.c  | 52 ++++++++++++++++++++++++++++++++++++++
> >   scripts/link-vmlinux.sh | 55 +++++++++++++++++++++++++++--------------
> >   3 files changed, 91 insertions(+), 19 deletions(-)
> >   create mode 100644 kernel/bpf/sysfs_btf.c

First rule, you can't create new sysfs files without a matching
Documentation/ABI/ set of entries.  Please do that for the next version
of this patch so we can properly check to see if what you are
documenting lines up with the code.  Otherwise we just have to guess as
to what the entries you are creating actually do.

> > 
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 29d781061cd5..e1d9adb212f9 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -22,3 +22,6 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >   ifeq ($(CONFIG_INET),y)
> >   obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
> >   endif
> > +ifeq ($(CONFIG_SYSFS),y)
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
> > +endif
> > diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
> > new file mode 100644
> > index 000000000000..ac06ce1d62e8
> > --- /dev/null
> > +++ b/kernel/bpf/sysfs_btf.c
> > @@ -0,0 +1,52 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Provide kernel BTF information for introspection and use by eBPF tools.
> > + */
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/kobject.h>
> > +#include <linux/init.h>
> > +
> > +/* See scripts/link-vmlinux.sh, gen_btf() func for details */
> > +extern char __weak _binary__btf_kernel_bin_start[];
> > +extern char __weak _binary__btf_kernel_bin_end[];
> > +
> > +static ssize_t
> > +btf_kernel_read(struct file *file, struct kobject *kobj,
> > +		struct bin_attribute *bin_attr,
> > +		char *buf, loff_t off, size_t len)
> > +{
> > +	memcpy(buf, _binary__btf_kernel_bin_start + off, len);
> > +	return len;
> > +}
> > +
> > +static struct bin_attribute btf_kernel_attr __ro_after_init = {
> > +	.attr = {
> > +		.name = "kernel",
> > +		.mode = 0444,
> > +	},
> > +	.read = btf_kernel_read,
> > +};

BIN_ATTR_RO()?

> > +
> > +static struct bin_attribute *btf_attrs[] __ro_after_init = {
> > +	&btf_kernel_attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group btf_group_attr __ro_after_init = {
> > +	.name = "btf",
> > +	.bin_attrs = btf_attrs,
> > +};
> > +
> > +static int __init btf_kernel_init(void)
> > +{
> > +	if (!_binary__btf_kernel_bin_start)
> > +		return 0;
> > +
> > +	btf_kernel_attr.size = _binary__btf_kernel_bin_end -
> > +			       _binary__btf_kernel_bin_start;
> > +
> > +	return sysfs_create_group(kernel_kobj, &btf_group_attr);

You are nesting directories here without a "real" kobject in the middle.
Are you _sure_ you want to do that?  It's going to get really tricky
later on based on your comments above about creating multiple files in
that directory over time once "modules" are allowed.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] team: Add vlan tx offload to hw_enc_features
From: Jesse Brandeburg @ 2019-08-08  5:48 UTC (permalink / raw)
  To: YueHaibing
  Cc: j.vosburgh, vfalico, andy, davem, jiri, jay.vosburgh,
	linux-kernel, netdev, jesse.brandeburg
In-Reply-To: <20190807023808.51976-1-yuehaibing@huawei.com>

On Wed, 7 Aug 2019 10:38:08 +0800
YueHaibing <yuehaibing@huawei.com> wrote:

> We should also enable bonding's vlan tx offload in hw_enc_features,

You mean team's vlan tx offload?

> pass the vlan packets to the slave devices with vlan tci, let them

s/let them to/let the slave/

> to handle vlan tunneling offload implementation.
> 
> Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>


^ permalink raw reply

* [PATCH] net/netfilter/nf_nat_proto.c - make tables static
From: Valdis Klētnieks @ 2019-08-08  5:43 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, linux-kernel

Sparse warns about two tables not being declared.

  CHECK   net/netfilter/nf_nat_proto.c
net/netfilter/nf_nat_proto.c:725:26: warning: symbol 'nf_nat_ipv4_ops' was not declared. Should it be static?
net/netfilter/nf_nat_proto.c:964:26: warning: symbol 'nf_nat_ipv6_ops' was not declared. Should it be static?

And in fact they can indeed be static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index 7ac733ebd060..0a59c14b5177 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -722,7 +722,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
 	return ret;
 }
 
-const struct nf_hook_ops nf_nat_ipv4_ops[] = {
+static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
 	/* Before packet filtering, change destination */
 	{
 		.hook		= nf_nat_ipv4_in,
@@ -961,7 +961,7 @@ nf_nat_ipv6_local_fn(void *priv, struct sk_buff *skb,
 	return ret;
 }
 
-const struct nf_hook_ops nf_nat_ipv6_ops[] = {
+static const struct nf_hook_ops nf_nat_ipv6_ops[] = {
 	/* Before packet filtering, change destination */
 	{
 		.hook		= nf_nat_ipv6_in,


^ permalink raw reply related

* Re: [PATCH v3 38/41] powerpc: convert put_page() to put_user_page*()
From: Michael Ellerman @ 2019-08-08  5:42 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard, Benjamin Herrenschmidt,
	Christoph Hellwig, linuxppc-dev
In-Reply-To: <20190807013340.9706-39-jhubbard@nvidia.com>

Hi John,

john.hubbard@gmail.com writes:
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
> index b056cae3388b..e126193ba295 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -203,6 +202,7 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long i;
>  	struct page *page = NULL;
> +	bool dirty = false;

I don't think you need that initialisation do you?

>  	if (!mem->hpas)
>  		return;
> @@ -215,10 +215,9 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  		if (!page)
>  			continue;
>  
> -		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
> -			SetPageDirty(page);
> +		dirty = mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
> -		put_page(page);
> +		put_user_pages_dirty_lock(&page, 1, dirty);
>  		mem->hpas[i] = 0;
>  	}
>  }

cheers

^ permalink raw reply

* Re: [PATCH] myri10ge: remove unneeded variable
From: Jesse Brandeburg @ 2019-08-08  5:31 UTC (permalink / raw)
  To: Ding Xiang; +Cc: christopher.lee, davem, netdev, linux-kernel, jesse.brandeburg
In-Reply-To: <1564563226-13367-1-git-send-email-dingxiang@cmss.chinamobile.com>

On Wed, 31 Jul 2019 16:53:46 +0800
Ding Xiang <dingxiang@cmss.chinamobile.com> wrote:

> "error" is unneeded,just return 0
> 
> Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH 2/2] net: ag71xx: Use GFP_KERNEL instead of GFP_ATOMIC in 'ag71xx_rings_init()'
From: Jesse Brandeburg @ 2019-08-08  5:30 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: jcliburn, davem, chris.snook, netdev, linux-kernel,
	kernel-janitors, jesse.brandeburg
In-Reply-To: <75ee52ae065ce9cb1f32d88939b166773316dc45.1564560130.git.christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 10:06:48 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> There is no need to use GFP_ATOMIC here, GFP_KERNEL should be enough.
> The 'kcalloc()' just a few lines above, already uses GFP_KERNEL.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH 1/2] net: ag71xx: Slighly simplify code in 'ag71xx_rings_init()'
From: Jesse Brandeburg @ 2019-08-08  5:29 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: jcliburn, davem, chris.snook, netdev, linux-kernel,
	kernel-janitors, jesse.brandeburg
In-Reply-To: <08fbcfe0f913644fe538656221a15790a1a83f1d.1564560130.git.christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 10:06:38 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> A few lines above, we have:
>    tx_size = BIT(tx->order);
> 
> So use 'tx_size' directly to be consistent with the way 'rx->descs_cpu' and
> 'rx->descs_dma' are computed below.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* [PATCH] net/netfilter - add missing prototypes.
From: Valdis Klētnieks @ 2019-08-08  5:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, linux-kernel

Sparse rightly complains about undeclared symbols.

  CHECK   net/netfilter/nft_set_hash.c
net/netfilter/nft_set_hash.c:647:21: warning: symbol 'nft_set_rhash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:670:21: warning: symbol 'nft_set_hash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:690:21: warning: symbol 'nft_set_hash_fast_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_bitmap.c
net/netfilter/nft_set_bitmap.c:296:21: warning: symbol 'nft_set_bitmap_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_rbtree.c
net/netfilter/nft_set_rbtree.c:470:21: warning: symbol 'nft_set_rbtree_type' was not declared. Should it be static?

Include nf_tables_core.h rather than nf_tables.h to pick up the additional definitions.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c
index b5aeccdddb22..087a056e34d1 100644
--- a/net/netfilter/nft_set_bitmap.c
+++ b/net/netfilter/nft_set_bitmap.c
@@ -10,7 +10,7 @@
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 struct nft_bitmap_elem {
 	struct list_head	head;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 6e8d20c03e3d..c490451fcebf 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -16,7 +16,7 @@
 #include <linux/rhashtable.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 /* We target a hash table size of 4, element hint is 75% of final size */
 #define NFT_RHASH_ELEMENT_HINT 3
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 419d58ef802b..57123259452f 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -13,7 +13,7 @@
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 struct nft_rbtree {
 	struct rb_root		root;

^ permalink raw reply related

* Re: [PATCH] net: ethernet: et131x: Use GFP_KERNEL instead of GFP_ATOMIC when allocating tx_ring->tcb_ring
From: Jesse Brandeburg @ 2019-08-08  5:23 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: mark.einon, davem, willy, f.fainelli, andrew, netdev,
	linux-kernel, kernel-janitors, jesse.brandeburg
In-Reply-To: <20190731073842.16948-1-christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 09:38:42 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> There is no good reason to use GFP_ATOMIC here. Other memory allocations
> are performed with GFP_KERNEL (see other 'dma_alloc_coherent()' below and
> 'kzalloc()' in 'et131x_rx_dma_memory_alloc()')
> 
> Use GFP_KERNEL which should be enough.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Sure, but generally I'd say GFP_ATOMIC is ok if you're in an init path
and you can afford to have the allocation thread sleep while memory is
being found by the kernel.

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH net 2/2] mlxsw: spectrum_buffers: Further reduce pool size on Spectrum-2
From: Jesse Brandeburg @ 2019-08-08  5:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, jiri, petrm, mlxsw, Ido Schimmel, jesse.brandeburg
In-Reply-To: <20190731063315.9381-3-idosch@idosch.org>

On Wed, 31 Jul 2019 09:33:15 +0300
Ido Schimmel <idosch@idosch.org> wrote:

> From: Petr Machata <petrm@mellanox.com>
> 
> In commit e891ce1dd2a5 ("mlxsw: spectrum_buffers: Reduce pool size on
> Spectrum-2"), pool size was reduced to mitigate a problem in port buffer
> usage of ports split four ways. It turns out that this work around does not
> solve the issue, and a further reduction is required.
> 
> Thus reduce the size of pool 0 by another 2.7 MiB, and round down to the
> whole number of cells.
> 
> Fixes: e891ce1dd2a5 ("mlxsw: spectrum_buffers: Reduce pool size on Spectrum-2")
> Signed-off-by: Petr Machata <petrm@mellanox.com>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH net 1/2] mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
From: Jesse Brandeburg @ 2019-08-08  5:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, jiri, petrm, mlxsw, Ido Schimmel, jesse.brandeburg
In-Reply-To: <20190731063315.9381-2-idosch@idosch.org>

On Wed, 31 Jul 2019 09:33:14 +0300
Ido Schimmel <idosch@idosch.org> wrote:

> From: Jiri Pirko <jiri@mellanox.com>
> 
> In case of sp2 pci driver registration fail, fix the error path to
> start with sp1 pci driver unregister.
> 
> Fixes: c3ab435466d5 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Yonghong Song @ 2019-08-08  5:09 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos, netdev@vger.kernel.org
  Cc: ebiederm@xmission.com, brouer@redhat.com,
	quentin.monnet@netronome.com
In-Reply-To: <20190808012240.htbgpv2mhktvig5h@dev00>



On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> The code has been modified to avoid syscalls that could sleep.
> Please let me know if any other modification is needed.
> 
>  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@gmail.com>
> Date: Wed, 7 Aug 2019 20:04:30 -0400
> Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
>   from current task
> 
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
> 
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
> 
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> 
>          u32 pid = bpf_get_current_pid_tgid() >> 32;
>          if (pid != <pid_arg_passed_in>)
>                  return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
> 
>          struct bpf_pidns pidns;
>          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>          u32 pid = pidns.tgid;
>          u32 nsid = pidns.nsid;
>          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                  return 0;
> 
> To find out the name PID namespace id of a process, you could use this command:
> 
> $ ps -h -o pidns -p <pid_of_interest>
> 
> Or this other command:
> 
> $ ls -Li /proc/<pid_of_interest>/ns/pid
> 
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>   fs/namei.c                                         |   2 +-
>   include/linux/bpf.h                                |   1 +
>   include/linux/namei.h                              |   4 +
>   include/uapi/linux/bpf.h                           |  29 ++++-
>   kernel/bpf/core.c                                  |   1 +
>   kernel/bpf/helpers.c                               |  78 ++++++++++++
>   kernel/trace/bpf_trace.c                           |   2 +
>   samples/bpf/Makefile                               |   3 +
>   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
>   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
>   tools/include/uapi/linux/bpf.h                     |  29 ++++-
>   tools/testing/selftests/bpf/Makefile               |   2 +-
>   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
>   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
>   15 files changed, 418 insertions(+), 4 deletions(-)
>   create mode 100644 samples/bpf/trace_ns_info_user.c
>   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..d1eca36972d2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -19,7 +19,6 @@
>   #include <linux/export.h>
>   #include <linux/kernel.h>
>   #include <linux/slab.h>
> -#include <linux/fs.h>
>   #include <linux/namei.h>
>   #include <linux/pagemap.h>
>   #include <linux/fsnotify.h>
> @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
>   	putname(name);
>   	return retval;
>   }
> +EXPORT_SYMBOL(filename_lookup);

No need to export symbols. bpf uses it and bpf is in the core, not in 
modules.

>   
>   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
>   static int path_parentat(struct nameidata *nd, unsigned flags,
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f9a506147c8a..e4adf5e05afd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>   extern const struct bpf_func_proto bpf_strtol_proto;
>   extern const struct bpf_func_proto bpf_strtoul_proto;
>   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>   
>   /* Shared helpers among cBPF and eBPF. */
>   void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 9138b4471dbf..2c24e8c71d46 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -6,6 +6,7 @@
>   #include <linux/path.h>
>   #include <linux/fcntl.h>
>   #include <linux/errno.h>
> +#include <linux/fs.h>
>   
>   enum { MAX_NESTED_LINKS = 8 };
>   
> @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
>   
>   extern void nd_jump_link(struct path *path);
>   
> +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> +		    struct path *path, struct path *root);

The previous definition in fs/internal.h should be removed.

> +
>   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
>   {
>   	((char *) name)[min(len, maxlen)] = '\0';
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..6f601f7106e2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,26 @@ union bpf_attr {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *	Description
> + *		Copies into *pidns* pid, namespace id and tgid as seen by the
> + *		current namespace and also device from /proc/self/ns/pid.
> + *		*size_of_pidns* must be the size of *pidns*
> + *
> + *		This helper is used when pid filtering is needed inside a
> + *		container as bpf_get_current_tgid() helper returns always the
> + *		pid id as seen by the root namespace.
> + *	Return
> + *		0 on success
> + *
> + *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
> + *		Or if size_of_pidns is not valid.

Maybe reword by following the code sequence.
    if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
    the current task.

> + *
> + *		**-ENOMEM**  if allocation fails.

Maybe some other error codes in filename_lookup() function?

> + *
> + *		If unable to get the inode from /proc/self/ns/pid an error code
> + *		will be returned.

You do not need this. The description of error code cases should cover this.

>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2853,7 +2873,8 @@ union bpf_attr {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(get_current_pidns_info),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +struct bpf_pidns_info {
> +	__u32 dev;
> +	__u32 nsid;
> +	__u32 tgid;
> +	__u32 pid;
> +};
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>   
>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>   {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..571f24077db2 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,12 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
> +
>   
>   #include "../../lib/kstrtox.h"
>   
> @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   	preempt_enable();
>   }
>   
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +	 size)
> +{
> +	const char *name = "/proc/self/ns/pid";

maybe rename this variable to pidns_path?

> +	struct pid_namespace *pidns = NULL;
> +	struct filename *tmp = NULL;

Maybe rename this variable to name?

> +	int len = strlen(name) + 1;

We can delay this assignment later until it is needed.

> +	struct inode *inode;
> +	struct path kp;
> +	pid_t tgid = 0;
> +	pid_t pid = 0;
> +	int ret;
> +
> +	if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +		return -EINVAL;
> +
> +	pidns = task_active_pid_ns(current);
> +

we can save an empty line here.

> +	if (unlikely(!pidns))
> +		goto clear;
> +
> +	pidns_info->nsid =  pidns->ns.inum;
> +	pid = task_pid_nr_ns(current, pidns);
> +

We can save an empty line here.

> +	if (unlikely(!pid))
> +		goto clear;
> +
> +	tgid = task_tgid_nr_ns(current, pidns);
> +
ditto. save an empty line.
> +	if (unlikely(!tgid))
> +		goto clear;
> +
> +	pidns_info->tgid = (u32) tgid;
> +	pidns_info->pid = (u32) pid;
> +
> +	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +	if (unlikely(!tmp)) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return -ENOMEM;
> +	}
> +
> +	memcpy((char *)tmp->name, name, len);
> +	tmp->uptr = NULL;
> +	tmp->aname = NULL;
> +	tmp->refcnt = 1;
> +
ditto. save an empty line.
> +	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> +
ditto. save an empty line.
> +	if (ret) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return ret;
> +	}
> +
> +	inode = d_backing_inode(kp.dentry);
> +	pidns_info->dev = inode->i_sb->s_dev;
> +
> +	return 0;
> +
> +clear:
> +	memset((void *)pidns_info, 0, (size_t) size);
> +
save an empty line.
> +	return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +	.func	= bpf_get_current_pidns_info,
make the "= " aligned with others?
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +};
> +
>   #ifdef CONFIG_CGROUPS
>   BPF_CALL_0(bpf_get_current_cgroup_id)
>   {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> +	case BPF_FUNC_get_current_pidns_info:
> +		return &bpf_get_current_pidns_info_proto;
>   	default:
>   		return NULL;
>   	}
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1d9be26b4edd..238453ff27d2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
>   hostprogs-y += xdp_sample_pkts
>   hostprogs-y += ibumad
>   hostprogs-y += hbm
> +hostprogs-y += trace_ns_info
[...]

^ permalink raw reply

* [PATCH] liquidio: Use pcie_flr() instead of reimplementing it
From: Denis Efremov @ 2019-08-08  4:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Denis Efremov, Derek Chickles, Satanand Burla, Felix Manlunas,
	netdev, linux-pci, linux-kernel

octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR
bit, which bypasses timing requirements imposed by the PCIe spec.
This patch fixes the function to use the pcie_flr() interface instead.

Signed-off-by: Denis Efremov <efremov@linux.com>
---
 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
index 021d99cd1665..614d07be7181 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -260,9 +260,7 @@ static int octeon_mbox_process_cmd(struct octeon_mbox *mbox,
 		dev_info(&oct->pci_dev->dev,
 			 "got a request for FLR from VF that owns DPI ring %u\n",
 			 mbox->q_no);
-		pcie_capability_set_word(
-			oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no],
-			PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
+		pcie_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no]);
 		break;
 
 	case OCTEON_PF_CHANGED_VF_MACADDR:
-- 
2.21.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox