Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v2] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-08-07 22:49 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Network Development, davejwatson, borisp, aviadye,
	John Fastabend, Daniel Borkmann, Eric Dumazet, Alexei Starovoitov,
	oss-drivers
In-Reply-To: <CA+FuTSeR1QqAZVTLQ-mJ8iHi+h+ghbrGyT6TWATTecQSbQP6sA@mail.gmail.com>

On Wed, 7 Aug 2019 14:46:23 -0400, Willem de Bruijn wrote:
> > > > @@ -984,6 +984,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> > > >                         if (!skb)
> > > >                                 goto wait_for_memory;
> > > >
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +                       skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
> > > > +#endif  
> > >
> > > Nothing is stopping userspace from passing this new flag. In send
> > > (tcp_sendmsg_locked) it is ignored. But can it reach do_tcp_sendpages
> > > through tcp_bpf_sendmsg?  
> >
> > Ah, I think you're right, thanks for checking that :( I don't entirely
> > follow how 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through
> > ULP") is safe then.
> >
> > One option would be to clear the flags kernel would previously ignore
> > in tcp_bpf_sendmsg(). But I feel like we should just go back to marking
> > the socket, since we don't need the per-message flexibility of a flag.
> >
> > WDYT?  
> 
> I don't feel strongly either way. Passing flags from send through
> tcp_bpf_sendmsg is probably unintentional, so should probably be
> addressed anyway? Then this is a bit simpler.

FWIW I had a closer look at the tcp_bpf_sendmsg() flags, and
MSG_SENDPAGE_NOPOLICY should be okay to let though there.

That flag is only meaningful to tls in case sockmap is layered 
on top of tls and we'd always set it before calling tls.

v3 coming soon..

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Yonghong Song @ 2019-08-07 23:03 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <20190807154720.260577-2-sdf@google.com>



On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
> Add new helper bpf_sk_storage_clone which optionally clones sk storage
> and call it from bpf_sk_storage_clone. Reuse the gap in
> bpf_sk_storage_elem to store clone/non-clone flag.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

I tried to see whether I can find any missing race conditions in
the code but I failed. So except a minor comments below,
Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   include/net/bpf_sk_storage.h |  10 ++++
>   include/uapi/linux/bpf.h     |   1 +
>   net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
>   net/core/sock.c              |   9 ++--
>   4 files changed, 115 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> index b9dcb02e756b..8e4f831d2e52 100644
> --- a/include/net/bpf_sk_storage.h
> +++ b/include/net/bpf_sk_storage.h
> @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
>   extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>   extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>   
> +#ifdef CONFIG_BPF_SYSCALL
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> +#else
> +static inline int bpf_sk_storage_clone(const struct sock *sk,
> +				       struct sock *newsk)
> +{
> +	return 0;
> +}
> +#endif
> +
>   #endif /* _BPF_SK_STORAGE_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..00459ca4c8cf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2931,6 +2931,7 @@ enum bpf_func_id {
>   
>   /* BPF_FUNC_sk_storage_get flags */
>   #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
> +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
>   
>   /* Mode for BPF_FUNC_skb_adjust_room helper. */
>   enum bpf_adj_room_mode {
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index 94c7f77ecb6b..b6dea67965bc 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -12,6 +12,9 @@
>   
>   static atomic_t cache_idx;
>   
> +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
> +					 BPF_SK_STORAGE_GET_F_CLONE)
> +
>   struct bucket {
>   	struct hlist_head list;
>   	raw_spinlock_t lock;
> @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
>   	struct hlist_node snode;	/* Linked to bpf_sk_storage */
>   	struct bpf_sk_storage __rcu *sk_storage;
>   	struct rcu_head rcu;
> -	/* 8 bytes hole */
> +	u8 clone:1;
> +	/* 7 bytes hole */
>   	/* The data is stored in aother cacheline to minimize
>   	 * the number of cachelines access during a cache hit.
>   	 */
> @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>   	return 0;
>   }
>   
> -/* Called by __sk_destruct() */
> +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
>   void bpf_sk_storage_free(struct sock *sk)
>   {
>   	struct bpf_sk_storage_elem *selem;
> @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
>   	return err;
>   }
>   
> +static struct bpf_sk_storage_elem *
> +bpf_sk_storage_clone_elem(struct sock *newsk,
> +			  struct bpf_sk_storage_map *smap,
> +			  struct bpf_sk_storage_elem *selem)
> +{
> +	struct bpf_sk_storage_elem *copy_selem;
> +
> +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> +	if (!copy_selem)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (map_value_has_spin_lock(&smap->map))
> +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> +				      SDATA(selem)->data, true);
> +	else
> +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> +			       SDATA(selem)->data);
> +
> +	return copy_selem;
> +}
> +
> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> +{
> +	struct bpf_sk_storage *new_sk_storage = NULL;
> +	struct bpf_sk_storage *sk_storage;
> +	struct bpf_sk_storage_elem *selem;
> +	int ret;
> +
> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> +
> +	rcu_read_lock();
> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> +
> +	if (!sk_storage || hlist_empty(&sk_storage->list))
> +		goto out;
> +
> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> +		struct bpf_sk_storage_map *smap;
> +		struct bpf_sk_storage_elem *copy_selem;
> +
> +		if (!selem->clone)
> +			continue;
> +
> +		smap = rcu_dereference(SDATA(selem)->smap);
> +		if (!smap)
> +			continue;
> +
> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> +		if (IS_ERR(copy_selem)) {
> +			ret = PTR_ERR(copy_selem);
> +			goto err;
> +		}
> +
> +		if (!new_sk_storage) {
> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> +			if (ret) {
> +				kfree(copy_selem);
> +				atomic_sub(smap->elem_size,
> +					   &newsk->sk_omem_alloc);
> +				goto err;
> +			}
> +
> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> +			continue;
> +		}
> +
> +		raw_spin_lock_bh(&new_sk_storage->lock);
> +		selem_link_map(smap, copy_selem);
> +		__selem_link_sk(new_sk_storage, copy_selem);
> +		raw_spin_unlock_bh(&new_sk_storage->lock);

Considering in this particular case, new socket is not visible to 
outside world yet (both kernel and user space), map_delete/map_update
operations are not applicable in this situation, so
the above raw_spin_lock_bh() probably not needed.


> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return 0;
> +
> +err:
> +	rcu_read_unlock();
> +
> +	bpf_sk_storage_free(newsk);
> +	return ret;
> +}
> +
>   BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
>   	   void *, value, u64, flags)
>   {
>   	struct bpf_sk_storage_data *sdata;
>   
> -	if (flags > BPF_SK_STORAGE_GET_F_CREATE)
> +	if (flags & ~BPF_SK_STORAGE_GET_F_MASK)
> +		return (unsigned long)NULL;
> +
> +	if ((flags & BPF_SK_STORAGE_GET_F_CLONE) &&
> +	    !(flags & BPF_SK_STORAGE_GET_F_CREATE))
>   		return (unsigned long)NULL;
>   
>   	sdata = sk_storage_lookup(sk, map, true);
>   	if (sdata)
>   		return (unsigned long)sdata->data;
>   
> -	if (flags == BPF_SK_STORAGE_GET_F_CREATE &&
> +	if ((flags & BPF_SK_STORAGE_GET_F_CREATE) &&
>   	    /* Cannot add new elem to a going away sk.
>   	     * Otherwise, the new elem may become a leak
>   	     * (and also other memory issues during map
> @@ -762,6 +853,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
>   		/* sk must be a fullsock (guaranteed by verifier),
>   		 * so sock_gen_put() is unnecessary.
>   		 */
> +		if (!IS_ERR(sdata))
> +			SELEM(sdata)->clone =
> +				!!(flags & BPF_SK_STORAGE_GET_F_CLONE);
>   		sock_put(sk);
>   		return IS_ERR(sdata) ?
>   			(unsigned long)NULL : (unsigned long)sdata->data;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index d57b0cc995a0..f5e801a9cea4 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>   			goto out;
>   		}
>   		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> -#ifdef CONFIG_BPF_SYSCALL
> -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> -#endif
> +
> +		if (bpf_sk_storage_clone(sk, newsk)) {
> +			sk_free_unlock_clone(newsk);
> +			newsk = NULL;
> +			goto out;
> +		}
>   
>   		newsk->sk_err	   = 0;
>   		newsk->sk_err_soft = 0;
> 

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: sync bpf.h to tools/
From: Yonghong Song @ 2019-08-07 23:04 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <20190807154720.260577-3-sdf@google.com>



On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
> Sync new sk storage clone flag.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   tools/include/uapi/linux/bpf.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4393bd4b2419..00459ca4c8cf 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2931,6 +2931,7 @@ enum bpf_func_id {
>   
>   /* BPF_FUNC_sk_storage_get flags */
>   #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
> +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
>   
>   /* Mode for BPF_FUNC_skb_adjust_room helper. */
>   enum bpf_adj_room_mode {
> 

^ permalink raw reply

* Re: [PATCH bpf-next 3/3] selftests/bpf: add sockopt clone/inheritance test
From: Yonghong Song @ 2019-08-07 23:14 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <20190807154720.260577-4-sdf@google.com>



On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
> Add a test that calls setsockopt on the listener socket which triggers
> BPF program. This BPF program writes to the sk storage and sets
> clone flag. Make sure that sk storage is cloned for a newly
> accepted connection.
> 
> We have two cloned maps in the tests to make sure we hit both cases
> in bpf_sk_storage_clone: first element (sk_storage_alloc) and
> non-first element(s) (selem_link_map).
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   tools/testing/selftests/bpf/.gitignore        |   1 +
>   tools/testing/selftests/bpf/Makefile          |   3 +-
>   .../selftests/bpf/progs/sockopt_inherit.c     | 102 +++++++
>   .../selftests/bpf/test_sockopt_inherit.c      | 252 ++++++++++++++++++
>   4 files changed, 357 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/bpf/progs/sockopt_inherit.c
>   create mode 100644 tools/testing/selftests/bpf/test_sockopt_inherit.c
> 
> diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> index 90f70d2c7c22..60c9338cd9b4 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -42,4 +42,5 @@ xdping
>   test_sockopt
>   test_sockopt_sk
>   test_sockopt_multi
> +test_sockopt_inherit
>   test_tcp_rtt
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3bd0f4a0336a..c875763a851a 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>   	test_cgroup_storage test_select_reuseport test_section_names \
>   	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>   	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -	test_sockopt_multi test_tcp_rtt
> +	test_sockopt_multi test_sockopt_inherit test_tcp_rtt
>   
>   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>   TEST_GEN_FILES = $(BPF_OBJ_FILES)
> @@ -110,6 +110,7 @@ $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c
>   $(OUTPUT)/test_sockopt: cgroup_helpers.c
>   $(OUTPUT)/test_sockopt_sk: cgroup_helpers.c
>   $(OUTPUT)/test_sockopt_multi: cgroup_helpers.c
> +$(OUTPUT)/test_sockopt_inherit: cgroup_helpers.c
>   $(OUTPUT)/test_tcp_rtt: cgroup_helpers.c
>   
>   .PHONY: force
> diff --git a/tools/testing/selftests/bpf/progs/sockopt_inherit.c b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
> new file mode 100644
> index 000000000000..357fc9db5874
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> +
> +#define SOL_CUSTOM			0xdeadbeef
> +#define CUSTOM_INHERIT1			0
> +#define CUSTOM_INHERIT2			1
> +#define CUSTOM_LISTENER			2
> +
> +struct sockopt_inherit {
> +	__u8 val;
> +};
> +
> +struct bpf_map_def SEC("maps") cloned1_map = {
> +	.type = BPF_MAP_TYPE_SK_STORAGE,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(struct sockopt_inherit),
> +	.map_flags = BPF_F_NO_PREALLOC,
> +};
> +BPF_ANNOTATE_KV_PAIR(cloned1_map, int, struct sockopt_inherit);
> +
> +struct bpf_map_def SEC("maps") cloned2_map = {
> +	.type = BPF_MAP_TYPE_SK_STORAGE,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(struct sockopt_inherit),
> +	.map_flags = BPF_F_NO_PREALLOC,
> +};
> +BPF_ANNOTATE_KV_PAIR(cloned2_map, int, struct sockopt_inherit);
> +
> +struct bpf_map_def SEC("maps") listener_map = {
> +	.type = BPF_MAP_TYPE_SK_STORAGE,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(struct sockopt_inherit),
> +	.map_flags = BPF_F_NO_PREALLOC,
> +};
> +BPF_ANNOTATE_KV_PAIR(listener_map, int, struct sockopt_inherit);

Your still use the old way for map definitions. Is this possible for you
to use new map definitions (in section ".maps")?

> +
> +static __inline struct sockopt_inherit *get_storage(struct bpf_sockopt *ctx)
> +{
> +	if (ctx->optname == CUSTOM_INHERIT1)
> +		return bpf_sk_storage_get(&cloned1_map, ctx->sk, 0,
> +					  BPF_SK_STORAGE_GET_F_CREATE |
> +					  BPF_SK_STORAGE_GET_F_CLONE);
> +	else if (ctx->optname == CUSTOM_INHERIT2)
> +		return bpf_sk_storage_get(&cloned2_map, ctx->sk, 0,
> +					  BPF_SK_STORAGE_GET_F_CREATE |
> +					  BPF_SK_STORAGE_GET_F_CLONE);
> +	else
> +		return bpf_sk_storage_get(&listener_map, ctx->sk, 0,
> +					  BPF_SK_STORAGE_GET_F_CREATE);
> +}
> +
[.....]> diff --git a/tools/testing/selftests/bpf/test_sockopt_inherit.c 
b/tools/testing/selftests/bpf/test_sockopt_inherit.c
> new file mode 100644
> index 000000000000..e47b9c28d743
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_sockopt_inherit.c
> @@ -0,0 +1,252 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <error.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <netinet/in.h>
> +#include <pthread.h>
> +
> +#include <linux/filter.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "bpf_rlimit.h"
> +#include "bpf_util.h"
> +#include "cgroup_helpers.h"
> +
> +#define CG_PATH				"/sockopt_inherit"
> +#define SOL_CUSTOM			0xdeadbeef
> +#define CUSTOM_INHERIT1			0
> +#define CUSTOM_INHERIT2			1
> +#define CUSTOM_LISTENER			2
> +
> +static int connect_to_server(int server_fd)
> +{
> +	struct sockaddr_storage addr;
> +	socklen_t len = sizeof(addr);
> +	int fd;
> +
> +	fd = socket(AF_INET, SOCK_STREAM, 0);
> +	if (fd < 0) {
> +		log_err("Failed to create client socket");
> +		return -1;
> +	}
> +
> +	if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) {
> +		log_err("Failed to get server addr");
> +		goto out;
> +	}
> +
> +	if (connect(fd, (const struct sockaddr *)&addr, len) < 0) {
> +		log_err("Fail to connect to server");
> +		goto out;
> +	}
> +
> +	return fd;
> +
> +out:
> +	close(fd);
> +	return -1;
> +}
> +
> +static int verify_sockopt(int fd, int optname, const char *msg, char expected)
> +{
> +	socklen_t optlen = 1;
> +	char buf = 0;
> +	int err;
> +
> +	err = getsockopt(fd, SOL_CUSTOM, optname, &buf, &optlen);
> +	if (err) {
> +		log_err("%s: failed to call getsockopt", msg);
> +		return 1;
> +	}
> +
> +	log_err("%s %d: got=0x%x ? expected=0x%x", msg, optname, buf, expected);

There may not be error here.

> +
> +	if (buf != expected) {
> +		log_err("%s: unexpected getsockopt value %d != %d", msg,
> +			buf, expected);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void *server_thread(void *arg)
> +{
> +	struct sockaddr_storage addr;
> +	socklen_t len = sizeof(addr);
> +	int fd = *(int *)arg;
> +	int client_fd;
> +	int err = 0;
> +
> +	if (listen(fd, 1) < 0)
> +		error(1, errno, "Failed to listed on socket");
> +
> +	err += verify_sockopt(fd, CUSTOM_INHERIT1, "listen", 1);
> +	err += verify_sockopt(fd, CUSTOM_INHERIT2, "listen", 1);
> +	err += verify_sockopt(fd, CUSTOM_LISTENER, "listen", 1);
> +
> +	client_fd = accept(fd, (struct sockaddr *)&addr, &len);
> +	if (client_fd < 0)
> +		error(1, errno, "Failed to accept client");
> +
> +	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "accept", 1);
> +	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "accept", 1);
> +	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "accept", 0);
> +
> +	close(client_fd);
> +
> +	return (void *)(long)err;
> +}
> +
> +static int start_server(void)
> +{
> +	struct sockaddr_in addr = {
> +		.sin_family = AF_INET,
> +		.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> +	};
> +	char buf;
> +	int err;
> +	int fd;
> +	int i;
> +
> +	fd = socket(AF_INET, SOCK_STREAM, 0);
> +	if (fd < 0) {
> +		log_err("Failed to create server socket");
> +		return -1;
> +	}
> +
> +	for (i = CUSTOM_INHERIT1; i <= CUSTOM_LISTENER; i++) {
> +		buf = 0x01;
> +		err = setsockopt(fd, SOL_CUSTOM, i, &buf, 1);
> +		if (err) {
> +			log_err("Failed to call setsockopt(%d)", i);
> +			close(fd);
> +			return -1;
> +		}
> +	}
> +
> +	if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
> +		log_err("Failed to bind socket");
> +		close(fd);
> +		return -1;
> +	}
> +
> +	return fd;
> +}
> +
> +static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title)
> +{
> +	enum bpf_attach_type attach_type;
> +	enum bpf_prog_type prog_type;
> +	struct bpf_program *prog;
> +	int err;
> +
> +	err = libbpf_prog_type_by_name(title, &prog_type, &attach_type);
> +	if (err) {
> +		log_err("Failed to deduct types for %s BPF program", title);
> +		return -1;
> +	}
> +
> +	prog = bpf_object__find_program_by_title(obj, title);
> +	if (!prog) {
> +		log_err("Failed to find %s BPF program", title);
> +		return -1;
> +	}
> +
> +	err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd,
> +			      attach_type, 0);
> +	if (err) {
> +		log_err("Failed to attach %s BPF program", title);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int run_test(int cgroup_fd)
> +{
> +	struct bpf_prog_load_attr attr = {
> +		.file = "./sockopt_inherit.o",
> +	};
> +	int server_fd = -1, client_fd;
> +	struct bpf_object *obj;
> +	void *server_err;
> +	pthread_t tid;
> +	int ignored;
> +	int err;
> +
> +	err = bpf_prog_load_xattr(&attr, &obj, &ignored);
> +	if (err) {
> +		log_err("Failed to load BPF object");
> +		return -1;
> +	}
> +
> +	err = prog_attach(obj, cgroup_fd, "cgroup/getsockopt");
> +	if (err)
> +		goto close_bpf_object;
> +
> +	err = prog_attach(obj, cgroup_fd, "cgroup/setsockopt");
> +	if (err)
> +		goto close_bpf_object;
> +
> +	server_fd = start_server();
> +	if (server_fd < 0) {
> +		err = -1;
> +		goto close_bpf_object;
> +	}
> +
> +	pthread_create(&tid, NULL, server_thread, (void *)&server_fd);
> +
> +	client_fd = connect_to_server(server_fd);
> +	if (client_fd < 0) {
> +		err = -1;
> +		goto close_bpf_object;
> +	}
> +
> +	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "connect", 0);
> +	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "connect", 0);
> +	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "connect", 0);
> +
> +	pthread_join(tid, &server_err);
> +
> +	err += (int)(long)server_err;
> +
> +	close(client_fd);
> +
> +close_bpf_object:
> +	bpf_object__close(obj);
> +	close(server_fd);

server_fd could be -1 here.

> +	return err;
> +}
> +
> +int main(int args, char **argv)
> +{
> +	int cgroup_fd;
> +	int err = EXIT_SUCCESS;
> +
> +	if (setup_cgroup_environment())
> +		return err;
> +
> +	cgroup_fd = create_and_get_cgroup(CG_PATH);
> +	if (cgroup_fd < 0)
> +		goto cleanup_cgroup_env;
> +
> +	if (join_cgroup(CG_PATH))
> +		goto cleanup_cgroup;
> +
> +	if (run_test(cgroup_fd))
> +		err = EXIT_FAILURE;
> +
> +	printf("test_sockopt_inherit: %s\n",
> +	       err == EXIT_SUCCESS ? "PASSED" : "FAILED");
> +
> +cleanup_cgroup:
> +	close(cgroup_fd);
> +cleanup_cgroup_env:
> +	cleanup_cgroup_environment();
> +	return err;
> +}
> 

^ permalink raw reply

* [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-08-07 23:52 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, ncardwell, Josh Hunt

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt <johunt@akamai.com>
---
 Documentation/networking/ip-sysctl.txt | 6 ++++++
 include/net/netns/ipv4.h               | 1 +
 net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
 net/ipv4/tcp_ipv4.c                    | 1 +
 net/ipv4/tcp_timer.c                   | 2 +-
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
 	Path MTU discovery (MTU probing).  If MTU probing is enabled,
 	this is the initial MSS used by the connection.
 
+tcp_mtu_probe_floor - INTEGER
+	If MTU probing is enabled this caps the minimum MSS used for search_low
+	for the connection.
+
+	Default : 48
+
 tcp_min_snd_mss - INTEGER
 	TCP SYN and SYNACK messages usually advertise an ADVMSS option,
 	as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_l3mdev_accept;
 #endif
 	int sysctl_tcp_mtu_probing;
+	int sysctl_tcp_mtu_probe_floor;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_min_snd_mss;
 	int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra2		= &tcp_min_snd_mss_max,
 	},
 	{
+		.procname	= "tcp_mtu_probe_floor",
+		.data		= &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_min_snd_mss_min,
+		.extra2		= &tcp_min_snd_mss_max,
+	},
+	{
 		.procname	= "tcp_probe_threshold",
 		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+	net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
 
 	net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
 	net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a..dbd9d2d0ee63 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 	} else {
 		mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
 		mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
-		mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+		mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
 		mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
 		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
 	}
-- 
2.7.4


^ permalink raw reply related

* [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
From: Josh Hunt @ 2019-08-07 23:52 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, ncardwell, Josh Hunt
In-Reply-To: <1565221950-1376-1-git-send-email-johunt@akamai.com>

TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
enabled. Update the comment to reflect this.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 81e8ade1e6e4..9e9fbfaf052b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -64,7 +64,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* Minimal accepted MSS. It is (60+60+8) - (20+20). */
 #define TCP_MIN_MSS		88U
 
-/* The least MTU to use for probing */
+/* The initial MTU to use for probing */
 #define TCP_BASE_MSS		1024
 
 /* probing interval, default to 10 minutes as per RFC4821 */
-- 
2.7.4


^ permalink raw reply related

* [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-08-08  0:03 UTC (permalink / raw)
  To: davem
  Cc: netdev, davejwatson, borisp, aviadye, john.fastabend, daniel,
	willemb, edumazet, alexei.starovoitov, oss-drivers,
	Jakub Kicinski

sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
through ULP") for justification why the internal flag is safe.
The only location which could leak the flag in is tcp_bpf_sendmsg(),
which is taken care of by clearing the previously unused bit.

v2:
 - remove superfluous decrypted mark copy (Willem);
 - remove the stale doc entry (Boris);
 - rely entirely on EOR marking to prevent coalescing (Boris);
 - use an internal sendpages flag instead of marking the socket
   (Boris).
v3 (Willem):
 - reorganize the can_skb_orphan_partial() condition;
 - fix the flag leak-in through tcp_bpf_sendmsg.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 Documentation/networking/tls-offload.rst | 18 ------------------
 include/linux/skbuff.h                   |  8 ++++++++
 include/linux/socket.h                   |  3 +++
 include/net/sock.h                       | 10 +++++++++-
 net/core/sock.c                          | 19 ++++++++++++++-----
 net/ipv4/tcp.c                           |  3 +++
 net/ipv4/tcp_bpf.c                       |  6 +++++-
 net/ipv4/tcp_output.c                    |  3 +++
 net/tls/tls_device.c                     |  9 +++++++--
 9 files changed, 52 insertions(+), 27 deletions(-)

diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index b70b70dc4524..0dd3f748239f 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -506,21 +506,3 @@ Drivers should ignore the changes to TLS the device feature flags.
 These flags will be acted upon accordingly by the core ``ktls`` code.
 TLS device feature flags only control adding of new TLS connection
 offloads, old connections will remain active after flags are cleared.
-
-Known bugs
-==========
-
-skb_orphan() leaks clear text
------------------------------
-
-Currently drivers depend on the :c:member:`sk` member of
-:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
-encryption. Any operation which removes or does not preserve the socket
-association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
-will cause the driver to miss the packets and lead to clear text leaks.
-
-Redirects leak clear text
--------------------------
-
-In the RX direction, if segment has already been decrypted by the device
-and it gets redirected or mirrored - clear text will be transmitted out.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8af86d995d6..ba5583522d24 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1374,6 +1374,14 @@ static inline void skb_copy_hash(struct sk_buff *to, const struct sk_buff *from)
 	to->l4_hash = from->l4_hash;
 };
 
+static inline void skb_copy_decrypted(struct sk_buff *to,
+				      const struct sk_buff *from)
+{
+#ifdef CONFIG_TLS_DEVICE
+	to->decrypted = from->decrypted;
+#endif
+}
+
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
 static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
 {
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 97523818cb14..fc0bed59fc84 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -292,6 +292,9 @@ struct ucred {
 #define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
 #define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
+#define MSG_SENDPAGE_DECRYPTED	0x100000 /* sendpage() internal : page may carry
+					  * plain text and require encryption
+					  */
 
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/include/net/sock.h b/include/net/sock.h
index 228db3998e46..2c53f1a1d905 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2482,6 +2482,7 @@ static inline bool sk_fullsock(const struct sock *sk)
 
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
+ * Check decrypted mark in case skb_orphan() cleared socket.
  */
 static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 						   struct net_device *dev)
@@ -2489,8 +2490,15 @@ static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 #ifdef CONFIG_SOCK_VALIDATE_XMIT
 	struct sock *sk = skb->sk;
 
-	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb) {
 		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#ifdef CONFIG_TLS_DEVICE
+	} else if (unlikely(skb->decrypted)) {
+		pr_warn_ratelimited("unencrypted skb with no associated socket - dropping\n");
+		kfree_skb(skb);
+		skb = NULL;
+#endif
+	}
 #endif
 
 	return skb;
diff --git a/net/core/sock.c b/net/core/sock.c
index d57b0cc995a0..6d08553f885c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1992,6 +1992,19 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+static bool can_skb_orphan_partial(const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	/* Drivers depend on in-order delivery for crypto offload,
+	 * partial orphan breaks out-of-order-OK logic.
+	 */
+	if (skb->decrypted)
+		return false;
+#endif
+	return (skb->destructor == sock_wfree ||
+		(IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree));
+}
+
 /* This helper is used by netem, as it can hold packets in its
  * delay queue. We want to allow the owner socket to send more
  * packets, as if they were already TX completed by a typical driver.
@@ -2003,11 +2016,7 @@ void skb_orphan_partial(struct sk_buff *skb)
 	if (skb_is_tcp_pure_ack(skb))
 		return;
 
-	if (skb->destructor == sock_wfree
-#ifdef CONFIG_INET
-	    || skb->destructor == tcp_wfree
-#endif
-		) {
+	if (can_skb_orphan_partial(skb)) {
 		struct sock *sk = skb->sk;
 
 		if (refcount_inc_not_zero(&sk->sk_refcnt)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 776905899ac0..77b485d60b9d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -984,6 +984,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			if (!skb)
 				goto wait_for_memory;
 
+#ifdef CONFIG_TLS_DEVICE
+			skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
+#endif
 			skb_entail(sk, skb);
 			copy = size_goal;
 		}
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 3d1e15401384..8a56e09cfb0e 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -398,10 +398,14 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct sk_msg tmp, *msg_tx = NULL;
-	int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
 	int copied = 0, err = 0;
 	struct sk_psock *psock;
 	long timeo;
+	int flags;
+
+	/* Don't let internal do_tcp_sendpages() flags through */
+	flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
+	flags |= MSG_NO_SHARED_FRAGS;
 
 	psock = sk_psock_get(sk);
 	if (unlikely(!psock))
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4afc48d7bb..979520e46e33 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
 	if (!buff)
 		return -ENOMEM; /* We'll just try again later. */
+	skb_copy_decrypted(buff, skb);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -1874,6 +1875,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 	buff = sk_stream_alloc_skb(sk, 0, gfp, true);
 	if (unlikely(!buff))
 		return -ENOMEM;
+	skb_copy_decrypted(buff, skb);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -2143,6 +2145,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	sk_mem_charge(sk, nskb->truesize);
 
 	skb = tcp_send_head(sk);
+	skb_copy_decrypted(nskb, skb);
 
 	TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq;
 	TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size;
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 7c0b2b778703..43922d86e510 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -373,9 +373,9 @@ static int tls_push_data(struct sock *sk,
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
-	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
 	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
 	struct tls_record_info *record = ctx->open_record;
+	int tls_push_record_flags;
 	struct page_frag *pfrag;
 	size_t orig_size = size;
 	u32 max_open_record_len;
@@ -390,6 +390,9 @@ static int tls_push_data(struct sock *sk,
 	if (sk->sk_err)
 		return -sk->sk_err;
 
+	flags |= MSG_SENDPAGE_DECRYPTED;
+	tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+
 	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 	if (tls_is_partially_sent_record(tls_ctx)) {
 		rc = tls_push_partial_record(sk, tls_ctx, flags);
@@ -576,7 +579,9 @@ void tls_device_write_space(struct sock *sk, struct tls_context *ctx)
 		gfp_t sk_allocation = sk->sk_allocation;
 
 		sk->sk_allocation = GFP_ATOMIC;
-		tls_push_partial_record(sk, ctx, MSG_DONTWAIT | MSG_NOSIGNAL);
+		tls_push_partial_record(sk, ctx,
+					MSG_DONTWAIT | MSG_NOSIGNAL |
+					MSG_SENDPAGE_DECRYPTED);
 		sk->sk_allocation = sk_allocation;
 	}
 }
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Stanislav Fomichev @ 2019-08-08  0:05 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <9bd56e49-c38d-e1c4-1ff3-8250531d0d48@fb.com>

On 08/07, Yonghong Song wrote:
> 
> 
> On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
> > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > and call it from bpf_sk_storage_clone. Reuse the gap in
> > bpf_sk_storage_elem to store clone/non-clone flag.
> > 
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> 
> I tried to see whether I can find any missing race conditions in
> the code but I failed. So except a minor comments below,
Thanks for a review!

> Acked-by: Yonghong Song <yhs@fb.com>
> 
> > ---
> >   include/net/bpf_sk_storage.h |  10 ++++
> >   include/uapi/linux/bpf.h     |   1 +
> >   net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
> >   net/core/sock.c              |   9 ++--
> >   4 files changed, 115 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> > index b9dcb02e756b..8e4f831d2e52 100644
> > --- a/include/net/bpf_sk_storage.h
> > +++ b/include/net/bpf_sk_storage.h
> > @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
> >   extern const struct bpf_func_proto bpf_sk_storage_get_proto;
> >   extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
> >   
> > +#ifdef CONFIG_BPF_SYSCALL
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
> > +#else
> > +static inline int bpf_sk_storage_clone(const struct sock *sk,
> > +				       struct sock *newsk)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >   #endif /* _BPF_SK_STORAGE_H */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4393bd4b2419..00459ca4c8cf 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2931,6 +2931,7 @@ enum bpf_func_id {
> >   
> >   /* BPF_FUNC_sk_storage_get flags */
> >   #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
> > +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
> >   
> >   /* Mode for BPF_FUNC_skb_adjust_room helper. */
> >   enum bpf_adj_room_mode {
> > diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> > index 94c7f77ecb6b..b6dea67965bc 100644
> > --- a/net/core/bpf_sk_storage.c
> > +++ b/net/core/bpf_sk_storage.c
> > @@ -12,6 +12,9 @@
> >   
> >   static atomic_t cache_idx;
> >   
> > +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
> > +					 BPF_SK_STORAGE_GET_F_CLONE)
> > +
> >   struct bucket {
> >   	struct hlist_head list;
> >   	raw_spinlock_t lock;
> > @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
> >   	struct hlist_node snode;	/* Linked to bpf_sk_storage */
> >   	struct bpf_sk_storage __rcu *sk_storage;
> >   	struct rcu_head rcu;
> > -	/* 8 bytes hole */
> > +	u8 clone:1;
> > +	/* 7 bytes hole */
> >   	/* The data is stored in aother cacheline to minimize
> >   	 * the number of cachelines access during a cache hit.
> >   	 */
> > @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
> >   	return 0;
> >   }
> >   
> > -/* Called by __sk_destruct() */
> > +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
> >   void bpf_sk_storage_free(struct sock *sk)
> >   {
> >   	struct bpf_sk_storage_elem *selem;
> > @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
> >   	return err;
> >   }
> >   
> > +static struct bpf_sk_storage_elem *
> > +bpf_sk_storage_clone_elem(struct sock *newsk,
> > +			  struct bpf_sk_storage_map *smap,
> > +			  struct bpf_sk_storage_elem *selem)
> > +{
> > +	struct bpf_sk_storage_elem *copy_selem;
> > +
> > +	copy_selem = selem_alloc(smap, newsk, NULL, true);
> > +	if (!copy_selem)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	if (map_value_has_spin_lock(&smap->map))
> > +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
> > +				      SDATA(selem)->data, true);
> > +	else
> > +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
> > +			       SDATA(selem)->data);
> > +
> > +	return copy_selem;
> > +}
> > +
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > +{
> > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_elem *selem;
> > +	int ret;
> > +
> > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > +
> > +	rcu_read_lock();
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +
> > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > +		goto out;
> > +
> > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > +		struct bpf_sk_storage_map *smap;
> > +		struct bpf_sk_storage_elem *copy_selem;
> > +
> > +		if (!selem->clone)
> > +			continue;
> > +
> > +		smap = rcu_dereference(SDATA(selem)->smap);
> > +		if (!smap)
> > +			continue;
> > +
> > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > +		if (IS_ERR(copy_selem)) {
> > +			ret = PTR_ERR(copy_selem);
> > +			goto err;
> > +		}
> > +
> > +		if (!new_sk_storage) {
> > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > +			if (ret) {
> > +				kfree(copy_selem);
> > +				atomic_sub(smap->elem_size,
> > +					   &newsk->sk_omem_alloc);
> > +				goto err;
> > +			}
> > +
> > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > +			continue;
> > +		}
> > +
> > +		raw_spin_lock_bh(&new_sk_storage->lock);
> > +		selem_link_map(smap, copy_selem);
> > +		__selem_link_sk(new_sk_storage, copy_selem);
> > +		raw_spin_unlock_bh(&new_sk_storage->lock);
> 
> Considering in this particular case, new socket is not visible to 
> outside world yet (both kernel and user space), map_delete/map_update
> operations are not applicable in this situation, so
> the above raw_spin_lock_bh() probably not needed.
I agree, it's doing nothing, but __selem_link_sk has the following comment:
/* sk_storage->lock must be held and sk_storage->list cannot be empty */

Just wanted to keep that invariant for this call site as well (in case
we add some lockdep enforcement or smth else). WDYT?

> > +	}
> > +
> > +out:
> > +	rcu_read_unlock();
> > +	return 0;
> > +
> > +err:
> > +	rcu_read_unlock();
> > +
> > +	bpf_sk_storage_free(newsk);
> > +	return ret;
> > +}
> > +
> >   BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> >   	   void *, value, u64, flags)
> >   {
> >   	struct bpf_sk_storage_data *sdata;
> >   
> > -	if (flags > BPF_SK_STORAGE_GET_F_CREATE)
> > +	if (flags & ~BPF_SK_STORAGE_GET_F_MASK)
> > +		return (unsigned long)NULL;
> > +
> > +	if ((flags & BPF_SK_STORAGE_GET_F_CLONE) &&
> > +	    !(flags & BPF_SK_STORAGE_GET_F_CREATE))
> >   		return (unsigned long)NULL;
> >   
> >   	sdata = sk_storage_lookup(sk, map, true);
> >   	if (sdata)
> >   		return (unsigned long)sdata->data;
> >   
> > -	if (flags == BPF_SK_STORAGE_GET_F_CREATE &&
> > +	if ((flags & BPF_SK_STORAGE_GET_F_CREATE) &&
> >   	    /* Cannot add new elem to a going away sk.
> >   	     * Otherwise, the new elem may become a leak
> >   	     * (and also other memory issues during map
> > @@ -762,6 +853,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk,
> >   		/* sk must be a fullsock (guaranteed by verifier),
> >   		 * so sock_gen_put() is unnecessary.
> >   		 */
> > +		if (!IS_ERR(sdata))
> > +			SELEM(sdata)->clone =
> > +				!!(flags & BPF_SK_STORAGE_GET_F_CLONE);
> >   		sock_put(sk);
> >   		return IS_ERR(sdata) ?
> >   			(unsigned long)NULL : (unsigned long)sdata->data;
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..f5e801a9cea4 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1851,9 +1851,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> >   			goto out;
> >   		}
> >   		RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
> > -#ifdef CONFIG_BPF_SYSCALL
> > -		RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > -#endif
> > +
> > +		if (bpf_sk_storage_clone(sk, newsk)) {
> > +			sk_free_unlock_clone(newsk);
> > +			newsk = NULL;
> > +			goto out;
> > +		}
> >   
> >   		newsk->sk_err	   = 0;
> >   		newsk->sk_err_soft = 0;
> > 

^ permalink raw reply

* [PATCH V38 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-08-08  0:07 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Alexei Starovoitov, Matthew Garrett, Kees Cook, netdev,
	Chun-Yi Lee, Daniel Borkmann
In-Reply-To: <20190808000721.124691-1-matthewgarrett@google.com>

From: David Howells <dhowells@redhat.com>

bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/security.h     |  1 +
 kernel/trace/bpf_trace.c     | 10 ++++++++++
 security/lockdown/lockdown.c |  1 +
 3 files changed, 12 insertions(+)

diff --git a/include/linux/security.h b/include/linux/security.h
index 987d8427f091..8dd1741a52cd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -118,6 +118,7 @@ enum lockdown_reason {
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_KCORE,
 	LOCKDOWN_KPROBES,
+	LOCKDOWN_BPF_READ,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..492a8bfaae98 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -142,8 +142,13 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret < 0)
+		goto out;
+
 	ret = probe_kernel_read(dst, unsafe_ptr, size);
 	if (unlikely(ret < 0))
+out:
 		memset(dst, 0, size);
 
 	return ret;
@@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret < 0)
+		goto out;
+
 	/*
 	 * The strncpy_from_unsafe() call will likely not fill the entire
 	 * buffer, but that's okay in this circumstance as we're probing
@@ -580,6 +589,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 	 */
 	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
 	if (unlikely(ret < 0))
+out:
 		memset(dst, 0, size);
 
 	return ret;
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 6b123cbf3748..1b89d3e8e54d 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -33,6 +33,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_KCORE] = "/proc/kcore access",
 	[LOCKDOWN_KPROBES] = "use of kprobes",
+	[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
 
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* Re: [PATCH bpf-next 3/3] selftests/bpf: add sockopt clone/inheritance test
From: Stanislav Fomichev @ 2019-08-08  0:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <5a18a8ed-ab1b-de15-5dff-2b4a068bbe56@fb.com>

On 08/07, Yonghong Song wrote:
> 
> 
> On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
> > Add a test that calls setsockopt on the listener socket which triggers
> > BPF program. This BPF program writes to the sk storage and sets
> > clone flag. Make sure that sk storage is cloned for a newly
> > accepted connection.
> > 
> > We have two cloned maps in the tests to make sure we hit both cases
> > in bpf_sk_storage_clone: first element (sk_storage_alloc) and
> > non-first element(s) (selem_link_map).
> > 
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   tools/testing/selftests/bpf/.gitignore        |   1 +
> >   tools/testing/selftests/bpf/Makefile          |   3 +-
> >   .../selftests/bpf/progs/sockopt_inherit.c     | 102 +++++++
> >   .../selftests/bpf/test_sockopt_inherit.c      | 252 ++++++++++++++++++
> >   4 files changed, 357 insertions(+), 1 deletion(-)
> >   create mode 100644 tools/testing/selftests/bpf/progs/sockopt_inherit.c
> >   create mode 100644 tools/testing/selftests/bpf/test_sockopt_inherit.c
> > 
> > diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> > index 90f70d2c7c22..60c9338cd9b4 100644
> > --- a/tools/testing/selftests/bpf/.gitignore
> > +++ b/tools/testing/selftests/bpf/.gitignore
> > @@ -42,4 +42,5 @@ xdping
> >   test_sockopt
> >   test_sockopt_sk
> >   test_sockopt_multi
> > +test_sockopt_inherit
> >   test_tcp_rtt
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 3bd0f4a0336a..c875763a851a 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
> >   	test_cgroup_storage test_select_reuseport test_section_names \
> >   	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
> >   	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> > -	test_sockopt_multi test_tcp_rtt
> > +	test_sockopt_multi test_sockopt_inherit test_tcp_rtt
> >   
> >   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
> >   TEST_GEN_FILES = $(BPF_OBJ_FILES)
> > @@ -110,6 +110,7 @@ $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c
> >   $(OUTPUT)/test_sockopt: cgroup_helpers.c
> >   $(OUTPUT)/test_sockopt_sk: cgroup_helpers.c
> >   $(OUTPUT)/test_sockopt_multi: cgroup_helpers.c
> > +$(OUTPUT)/test_sockopt_inherit: cgroup_helpers.c
> >   $(OUTPUT)/test_tcp_rtt: cgroup_helpers.c
> >   
> >   .PHONY: force
> > diff --git a/tools/testing/selftests/bpf/progs/sockopt_inherit.c b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
> > new file mode 100644
> > index 000000000000..357fc9db5874
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/bpf.h>
> > +#include "bpf_helpers.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +__u32 _version SEC("version") = 1;
> > +
> > +#define SOL_CUSTOM			0xdeadbeef
> > +#define CUSTOM_INHERIT1			0
> > +#define CUSTOM_INHERIT2			1
> > +#define CUSTOM_LISTENER			2
> > +
> > +struct sockopt_inherit {
> > +	__u8 val;
> > +};
> > +
> > +struct bpf_map_def SEC("maps") cloned1_map = {
> > +	.type = BPF_MAP_TYPE_SK_STORAGE,
> > +	.key_size = sizeof(int),
> > +	.value_size = sizeof(struct sockopt_inherit),
> > +	.map_flags = BPF_F_NO_PREALLOC,
> > +};
> > +BPF_ANNOTATE_KV_PAIR(cloned1_map, int, struct sockopt_inherit);
> > +
> > +struct bpf_map_def SEC("maps") cloned2_map = {
> > +	.type = BPF_MAP_TYPE_SK_STORAGE,
> > +	.key_size = sizeof(int),
> > +	.value_size = sizeof(struct sockopt_inherit),
> > +	.map_flags = BPF_F_NO_PREALLOC,
> > +};
> > +BPF_ANNOTATE_KV_PAIR(cloned2_map, int, struct sockopt_inherit);
> > +
> > +struct bpf_map_def SEC("maps") listener_map = {
> > +	.type = BPF_MAP_TYPE_SK_STORAGE,
> > +	.key_size = sizeof(int),
> > +	.value_size = sizeof(struct sockopt_inherit),
> > +	.map_flags = BPF_F_NO_PREALLOC,
> > +};
> > +BPF_ANNOTATE_KV_PAIR(listener_map, int, struct sockopt_inherit);
> 
> Your still use the old way for map definitions. Is this possible for you
> to use new map definitions (in section ".maps")?
Ah, my bad, I'm not used to the new defs. Will fix!

> > +
> > +static __inline struct sockopt_inherit *get_storage(struct bpf_sockopt *ctx)
> > +{
> > +	if (ctx->optname == CUSTOM_INHERIT1)
> > +		return bpf_sk_storage_get(&cloned1_map, ctx->sk, 0,
> > +					  BPF_SK_STORAGE_GET_F_CREATE |
> > +					  BPF_SK_STORAGE_GET_F_CLONE);
> > +	else if (ctx->optname == CUSTOM_INHERIT2)
> > +		return bpf_sk_storage_get(&cloned2_map, ctx->sk, 0,
> > +					  BPF_SK_STORAGE_GET_F_CREATE |
> > +					  BPF_SK_STORAGE_GET_F_CLONE);
> > +	else
> > +		return bpf_sk_storage_get(&listener_map, ctx->sk, 0,
> > +					  BPF_SK_STORAGE_GET_F_CREATE);
> > +}
> > +
> [.....]> diff --git a/tools/testing/selftests/bpf/test_sockopt_inherit.c 
> b/tools/testing/selftests/bpf/test_sockopt_inherit.c
> > new file mode 100644
> > index 000000000000..e47b9c28d743
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/test_sockopt_inherit.c
> > @@ -0,0 +1,252 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <error.h>
> > +#include <errno.h>
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <netinet/in.h>
> > +#include <pthread.h>
> > +
> > +#include <linux/filter.h>
> > +#include <bpf/bpf.h>
> > +#include <bpf/libbpf.h>
> > +
> > +#include "bpf_rlimit.h"
> > +#include "bpf_util.h"
> > +#include "cgroup_helpers.h"
> > +
> > +#define CG_PATH				"/sockopt_inherit"
> > +#define SOL_CUSTOM			0xdeadbeef
> > +#define CUSTOM_INHERIT1			0
> > +#define CUSTOM_INHERIT2			1
> > +#define CUSTOM_LISTENER			2
> > +
> > +static int connect_to_server(int server_fd)
> > +{
> > +	struct sockaddr_storage addr;
> > +	socklen_t len = sizeof(addr);
> > +	int fd;
> > +
> > +	fd = socket(AF_INET, SOCK_STREAM, 0);
> > +	if (fd < 0) {
> > +		log_err("Failed to create client socket");
> > +		return -1;
> > +	}
> > +
> > +	if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) {
> > +		log_err("Failed to get server addr");
> > +		goto out;
> > +	}
> > +
> > +	if (connect(fd, (const struct sockaddr *)&addr, len) < 0) {
> > +		log_err("Fail to connect to server");
> > +		goto out;
> > +	}
> > +
> > +	return fd;
> > +
> > +out:
> > +	close(fd);
> > +	return -1;
> > +}
> > +
> > +static int verify_sockopt(int fd, int optname, const char *msg, char expected)
> > +{
> > +	socklen_t optlen = 1;
> > +	char buf = 0;
> > +	int err;
> > +
> > +	err = getsockopt(fd, SOL_CUSTOM, optname, &buf, &optlen);
> > +	if (err) {
> > +		log_err("%s: failed to call getsockopt", msg);
> > +		return 1;
> > +	}
> > +
> > +	log_err("%s %d: got=0x%x ? expected=0x%x", msg, optname, buf, expected);
> 
> There may not be error here.
Good point, will switch to simple printf.

> > +
> > +	if (buf != expected) {
> > +		log_err("%s: unexpected getsockopt value %d != %d", msg,
> > +			buf, expected);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void *server_thread(void *arg)
> > +{
> > +	struct sockaddr_storage addr;
> > +	socklen_t len = sizeof(addr);
> > +	int fd = *(int *)arg;
> > +	int client_fd;
> > +	int err = 0;
> > +
> > +	if (listen(fd, 1) < 0)
> > +		error(1, errno, "Failed to listed on socket");
> > +
> > +	err += verify_sockopt(fd, CUSTOM_INHERIT1, "listen", 1);
> > +	err += verify_sockopt(fd, CUSTOM_INHERIT2, "listen", 1);
> > +	err += verify_sockopt(fd, CUSTOM_LISTENER, "listen", 1);
> > +
> > +	client_fd = accept(fd, (struct sockaddr *)&addr, &len);
> > +	if (client_fd < 0)
> > +		error(1, errno, "Failed to accept client");
> > +
> > +	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "accept", 1);
> > +	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "accept", 1);
> > +	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "accept", 0);
> > +
> > +	close(client_fd);
> > +
> > +	return (void *)(long)err;
> > +}
> > +
> > +static int start_server(void)
> > +{
> > +	struct sockaddr_in addr = {
> > +		.sin_family = AF_INET,
> > +		.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> > +	};
> > +	char buf;
> > +	int err;
> > +	int fd;
> > +	int i;
> > +
> > +	fd = socket(AF_INET, SOCK_STREAM, 0);
> > +	if (fd < 0) {
> > +		log_err("Failed to create server socket");
> > +		return -1;
> > +	}
> > +
> > +	for (i = CUSTOM_INHERIT1; i <= CUSTOM_LISTENER; i++) {
> > +		buf = 0x01;
> > +		err = setsockopt(fd, SOL_CUSTOM, i, &buf, 1);
> > +		if (err) {
> > +			log_err("Failed to call setsockopt(%d)", i);
> > +			close(fd);
> > +			return -1;
> > +		}
> > +	}
> > +
> > +	if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
> > +		log_err("Failed to bind socket");
> > +		close(fd);
> > +		return -1;
> > +	}
> > +
> > +	return fd;
> > +}
> > +
> > +static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title)
> > +{
> > +	enum bpf_attach_type attach_type;
> > +	enum bpf_prog_type prog_type;
> > +	struct bpf_program *prog;
> > +	int err;
> > +
> > +	err = libbpf_prog_type_by_name(title, &prog_type, &attach_type);
> > +	if (err) {
> > +		log_err("Failed to deduct types for %s BPF program", title);
> > +		return -1;
> > +	}
> > +
> > +	prog = bpf_object__find_program_by_title(obj, title);
> > +	if (!prog) {
> > +		log_err("Failed to find %s BPF program", title);
> > +		return -1;
> > +	}
> > +
> > +	err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd,
> > +			      attach_type, 0);
> > +	if (err) {
> > +		log_err("Failed to attach %s BPF program", title);
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int run_test(int cgroup_fd)
> > +{
> > +	struct bpf_prog_load_attr attr = {
> > +		.file = "./sockopt_inherit.o",
> > +	};
> > +	int server_fd = -1, client_fd;
> > +	struct bpf_object *obj;
> > +	void *server_err;
> > +	pthread_t tid;
> > +	int ignored;
> > +	int err;
> > +
> > +	err = bpf_prog_load_xattr(&attr, &obj, &ignored);
> > +	if (err) {
> > +		log_err("Failed to load BPF object");
> > +		return -1;
> > +	}
> > +
> > +	err = prog_attach(obj, cgroup_fd, "cgroup/getsockopt");
> > +	if (err)
> > +		goto close_bpf_object;
> > +
> > +	err = prog_attach(obj, cgroup_fd, "cgroup/setsockopt");
> > +	if (err)
> > +		goto close_bpf_object;
> > +
> > +	server_fd = start_server();
> > +	if (server_fd < 0) {
> > +		err = -1;
> > +		goto close_bpf_object;
> > +	}
> > +
> > +	pthread_create(&tid, NULL, server_thread, (void *)&server_fd);
> > +
> > +	client_fd = connect_to_server(server_fd);
> > +	if (client_fd < 0) {
> > +		err = -1;
> > +		goto close_bpf_object;
> > +	}
> > +
> > +	err += verify_sockopt(client_fd, CUSTOM_INHERIT1, "connect", 0);
> > +	err += verify_sockopt(client_fd, CUSTOM_INHERIT2, "connect", 0);
> > +	err += verify_sockopt(client_fd, CUSTOM_LISTENER, "connect", 0);
> > +
> > +	pthread_join(tid, &server_err);
> > +
> > +	err += (int)(long)server_err;
> > +
> > +	close(client_fd);
> > +
> > +close_bpf_object:
> > +	bpf_object__close(obj);
> > +	close(server_fd);
> 
> server_fd could be -1 here.
I've initialized it to -1 so we can close(-1) here and not close(some
random data). Shouldn't be a problem, right?

The order is backwards though, should be:
close(server_fd);
bpf_object__close(obj);

I can probably add a label for bpf_object__close case to avoid this close(-1).
Will do for a v2.

> > +	return err;
> > +}
> > +
> > +int main(int args, char **argv)
> > +{
> > +	int cgroup_fd;
> > +	int err = EXIT_SUCCESS;
> > +
> > +	if (setup_cgroup_environment())
> > +		return err;
> > +
> > +	cgroup_fd = create_and_get_cgroup(CG_PATH);
> > +	if (cgroup_fd < 0)
> > +		goto cleanup_cgroup_env;
> > +
> > +	if (join_cgroup(CG_PATH))
> > +		goto cleanup_cgroup;
> > +
> > +	if (run_test(cgroup_fd))
> > +		err = EXIT_FAILURE;
> > +
> > +	printf("test_sockopt_inherit: %s\n",
> > +	       err == EXIT_SUCCESS ? "PASSED" : "FAILED");
> > +
> > +cleanup_cgroup:
> > +	close(cgroup_fd);
> > +cleanup_cgroup_env:
> > +	cleanup_cgroup_environment();
> > +	return err;
> > +}
> > 

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Yonghong Song @ 2019-08-08  0:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <20190808000533.GA2820@mini-arch>



On 8/7/19 5:05 PM, Stanislav Fomichev wrote:
> On 08/07, Yonghong Song wrote:
>>
>>
>> On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
>>> Add new helper bpf_sk_storage_clone which optionally clones sk storage
>>> and call it from bpf_sk_storage_clone. Reuse the gap in
>>> bpf_sk_storage_elem to store clone/non-clone flag.
>>>
>>> Cc: Martin KaFai Lau <kafai@fb.com>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>
>> I tried to see whether I can find any missing race conditions in
>> the code but I failed. So except a minor comments below,
> Thanks for a review!
> 
>> Acked-by: Yonghong Song <yhs@fb.com>
>>
>>> ---
>>>    include/net/bpf_sk_storage.h |  10 ++++
>>>    include/uapi/linux/bpf.h     |   1 +
>>>    net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
>>>    net/core/sock.c              |   9 ++--
>>>    4 files changed, 115 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
>>> index b9dcb02e756b..8e4f831d2e52 100644
>>> --- a/include/net/bpf_sk_storage.h
>>> +++ b/include/net/bpf_sk_storage.h
>>> @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
>>>    extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>>>    extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>>>    
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
>>> +#else
>>> +static inline int bpf_sk_storage_clone(const struct sock *sk,
>>> +				       struct sock *newsk)
>>> +{
>>> +	return 0;
>>> +}
>>> +#endif
>>> +
>>>    #endif /* _BPF_SK_STORAGE_H */
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 4393bd4b2419..00459ca4c8cf 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2931,6 +2931,7 @@ enum bpf_func_id {
>>>    
>>>    /* BPF_FUNC_sk_storage_get flags */
>>>    #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
>>> +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
>>>    
>>>    /* Mode for BPF_FUNC_skb_adjust_room helper. */
>>>    enum bpf_adj_room_mode {
>>> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
>>> index 94c7f77ecb6b..b6dea67965bc 100644
>>> --- a/net/core/bpf_sk_storage.c
>>> +++ b/net/core/bpf_sk_storage.c
>>> @@ -12,6 +12,9 @@
>>>    
>>>    static atomic_t cache_idx;
>>>    
>>> +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
>>> +					 BPF_SK_STORAGE_GET_F_CLONE)
>>> +
>>>    struct bucket {
>>>    	struct hlist_head list;
>>>    	raw_spinlock_t lock;
>>> @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
>>>    	struct hlist_node snode;	/* Linked to bpf_sk_storage */
>>>    	struct bpf_sk_storage __rcu *sk_storage;
>>>    	struct rcu_head rcu;
>>> -	/* 8 bytes hole */
>>> +	u8 clone:1;
>>> +	/* 7 bytes hole */
>>>    	/* The data is stored in aother cacheline to minimize
>>>    	 * the number of cachelines access during a cache hit.
>>>    	 */
>>> @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>>>    	return 0;
>>>    }
>>>    
>>> -/* Called by __sk_destruct() */
>>> +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
>>>    void bpf_sk_storage_free(struct sock *sk)
>>>    {
>>>    	struct bpf_sk_storage_elem *selem;
>>> @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
>>>    	return err;
>>>    }
>>>    
>>> +static struct bpf_sk_storage_elem *
>>> +bpf_sk_storage_clone_elem(struct sock *newsk,
>>> +			  struct bpf_sk_storage_map *smap,
>>> +			  struct bpf_sk_storage_elem *selem)
>>> +{
>>> +	struct bpf_sk_storage_elem *copy_selem;
>>> +
>>> +	copy_selem = selem_alloc(smap, newsk, NULL, true);
>>> +	if (!copy_selem)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	if (map_value_has_spin_lock(&smap->map))
>>> +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
>>> +				      SDATA(selem)->data, true);
>>> +	else
>>> +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
>>> +			       SDATA(selem)->data);
>>> +
>>> +	return copy_selem;
>>> +}
>>> +
>>> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
>>> +{
>>> +	struct bpf_sk_storage *new_sk_storage = NULL;
>>> +	struct bpf_sk_storage *sk_storage;
>>> +	struct bpf_sk_storage_elem *selem;
>>> +	int ret;
>>> +
>>> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
>>> +
>>> +	rcu_read_lock();
>>> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
>>> +
>>> +	if (!sk_storage || hlist_empty(&sk_storage->list))
>>> +		goto out;
>>> +
>>> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
>>> +		struct bpf_sk_storage_map *smap;
>>> +		struct bpf_sk_storage_elem *copy_selem;
>>> +
>>> +		if (!selem->clone)
>>> +			continue;
>>> +
>>> +		smap = rcu_dereference(SDATA(selem)->smap);
>>> +		if (!smap)
>>> +			continue;
>>> +
>>> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
>>> +		if (IS_ERR(copy_selem)) {
>>> +			ret = PTR_ERR(copy_selem);
>>> +			goto err;
>>> +		}
>>> +
>>> +		if (!new_sk_storage) {
>>> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
>>> +			if (ret) {
>>> +				kfree(copy_selem);
>>> +				atomic_sub(smap->elem_size,
>>> +					   &newsk->sk_omem_alloc);
>>> +				goto err;
>>> +			}
>>> +
>>> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
>>> +			continue;
>>> +		}
>>> +
>>> +		raw_spin_lock_bh(&new_sk_storage->lock);
>>> +		selem_link_map(smap, copy_selem);
>>> +		__selem_link_sk(new_sk_storage, copy_selem);
>>> +		raw_spin_unlock_bh(&new_sk_storage->lock);
>>
>> Considering in this particular case, new socket is not visible to
>> outside world yet (both kernel and user space), map_delete/map_update
>> operations are not applicable in this situation, so
>> the above raw_spin_lock_bh() probably not needed.
> I agree, it's doing nothing, but __selem_link_sk has the following comment:
> /* sk_storage->lock must be held and sk_storage->list cannot be empty */
> 
> Just wanted to keep that invariant for this call site as well (in case
> we add some lockdep enforcement or smth else). WDYT?

Agree. Let us keep the locking to be consistent with other uses in
the same file. This is not the critical path.


^ permalink raw reply

* High-Speed WiFi Booster Takes the World By Storm
From:  Julie Rogers @ 2019-08-08  0:06 UTC (permalink / raw)
  To: netdev

Internet providers make big money by overcharging you for faster internet lines. Could this little device really be the one solution that we've all been waiting for?

Frequency: 2.4Ghz
Wireless Rate: 300Mbps
Interface: 1 10/100Mbps WAN/LAN RJ45 Ports

Amazing new technology find out here! http://bit.ly/asdf411gt

------------------------
If you'd prefer not to receive future emails, Unsubscribe Here http://bit.ly/itr4fgy
11041 Santa Monica Blvd. #301 Los Angeles, CA 90025

^ permalink raw reply

* Re: WARNING: ODEBUG bug in netdev_freemem (2)
From: syzbot @ 2019-08-08  0:25 UTC (permalink / raw)
  To: alexander.h.duyck, amritha.nambiar, andriy.shevchenko, avagin,
	davem, dmitry.torokhov, dvyukov, eric.dumazet, f.fainelli, gregkh,
	idosch, jiri, kimbrownkd, linux-kernel, netdev, syzkaller-bugs,
	tglx, tyhicks, wanghai26, yuehaibing
In-Reply-To: <000000000000d6a8ba058c0df076@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    13dfb3fa Merge git://git.kernel.org/pub/scm/linux/kernel/g..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1671e69a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7
dashboard link: https://syzkaller.appspot.com/bug?extid=c4521ac872a4ccc3afec
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=170542c2600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c4521ac872a4ccc3afec@syzkaller.appspotmail.com

bond0 (unregistering): (slave bond_slave_1): Releasing backup interface
bond0 (unregistering): (slave bond_slave_0): Releasing backup interface
bond0 (unregistering): Released all slaves
------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: timer_list hint:  
delayed_work_timer_fn+0x0/0x90 arch/x86/include/asm/paravirt.h:768
WARNING: CPU: 0 PID: 9919 at lib/debugobjects.c:481  
debug_print_object+0x168/0x250 lib/debugobjects.c:481
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9919 Comm: kworker/u4:6 Not tainted 5.3.0-rc3+ #122
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:219
  __warn.cold+0x20/0x4c kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:debug_print_object+0x168/0x250 lib/debugobjects.c:481
Code: dd e0 32 c6 87 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 b5 00 00 00 48  
8b 14 dd e0 32 c6 87 48 c7 c7 e0 27 c6 87 e8 70 cd 05 fe <0f> 0b 83 05 a3  
7c 67 06 01 48 83 c4 20 5b 41 5c 41 5d 41 5e 5d c3
RSP: 0018:ffff8880898ff838 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815c3ba6 RDI: ffffed101131fef9
RBP: ffff8880898ff878 R08: ffff88808ece42c0 R09: ffffed1015d04101
R10: ffffed1015d04100 R11: ffff8880ae820807 R12: 0000000000000001
R13: ffffffff88db6660 R14: ffffffff8161da40 R15: ffff88808f639af0
  __debug_check_no_obj_freed lib/debugobjects.c:963 [inline]
  debug_check_no_obj_freed+0x2d4/0x43f lib/debugobjects.c:994
  kfree+0xf8/0x2c0 mm/slab.c:3755
  kvfree+0x61/0x70 mm/util.c:488
  netdev_freemem+0x4c/0x60 net/core/dev.c:9093
  netdev_release+0x86/0xb0 net/core/net-sysfs.c:1635
  device_release+0x7a/0x210 drivers/base/core.c:1064
  kobject_cleanup lib/kobject.c:693 [inline]
  kobject_release lib/kobject.c:722 [inline]
  kref_put include/linux/kref.h:65 [inline]
  kobject_put.cold+0x289/0x2e6 lib/kobject.c:739
  netdev_run_todo+0x53b/0x7b0 net/core/dev.c:8998
  rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:112
  default_device_exit_batch+0x358/0x410 net/core/dev.c:9781
  ops_exit_list.isra.0+0xfc/0x150 net/core/net_namespace.c:175
  cleanup_net+0x4e2/0xa70 net/core/net_namespace.c:594
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


^ permalink raw reply

* [PATCH bpf-next] tools/bpf: fix core_reloc.c compilation error
From: Yonghong Song @ 2019-08-08  0:38 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song,
	Andrii Nakryiko

On my local machine, I have the following compilation errors:
=====
  In file included from prog_tests/core_reloc.c:3:0:
  ./progs/core_reloc_types.h:517:46: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘fancy_char_ptr_t’
 typedef const char * const volatile restrict fancy_char_ptr_t;
                                              ^
  ./progs/core_reloc_types.h:527:2: error: unknown type name ‘fancy_char_ptr_t’
    fancy_char_ptr_t d;
    ^
=====

I am using gcc 4.8.5. Later compilers may change their behavior not emitting the
error. Nevertheless, let us fix the issue. "restrict" can be tested
without typedef.

Fixes: 9654e2ae908e ("selftests/bpf: add CO-RE relocs modifiers/typedef tests")
Cc: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/progs/core_reloc_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/core_reloc_types.h b/tools/testing/selftests/bpf/progs/core_reloc_types.h
index 10a252b6da55..f686a8138d90 100644
--- a/tools/testing/selftests/bpf/progs/core_reloc_types.h
+++ b/tools/testing/selftests/bpf/progs/core_reloc_types.h
@@ -514,7 +514,7 @@ typedef arr1_t arr2_t;
 typedef arr2_t arr3_t;
 typedef arr3_t arr4_t;
 
-typedef const char * const volatile restrict fancy_char_ptr_t;
+typedef const char * const volatile fancy_char_ptr_t;
 
 typedef core_reloc_mods_substruct_t core_reloc_mods_substruct_tt;
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH] vhost: do not reference a file that does not exist
From: egranata @ 2019-08-08  0:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: mst, jasowang, kvm, virtualization, netdev, trivial, egranata

From: Enrico Granata <egranata@google.com>

lguest was removed from the mainline kernel in late 2017.

Signed-off-by: Enrico Granata <egranata@google.com>
---
 drivers/vhost/vhost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..2c376cb66971 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -4,8 +4,8 @@
  *
  * Author: Michael S. Tsirkin <mst@redhat.com>
  *
- * Inspiration, some code, and most witty comments come from
- * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ * Inspiration, some code, and most witty comments come from lguest.c
+ * by Rusty Russell
  *
  * Generic code for virtio server in host kernel.
  */
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* Re: [PATCH] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Larry Finger @ 2019-08-08  0:57 UTC (permalink / raw)
  To: Valdis Klētnieks, Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <5924.1565217560@turing-police>

On 8/7/19 5:39 PM, Valdis Klētnieks wrote:

When this driver was originally entered, a line with "/*" was flagged by 
checkpatch.pl. In fact, when I make your change, I get

WARNING: networking block comments don't use an empty /* line, use /* Comment...
#243: FILE: drivers/net/wireless/realtek/rtlwifi/usb.c:243:
+/*
+ *

To avoid a loop of "fixing" compiler/checkpatch warnings, you need to put the 
first real line of the comment on the line of the "/*". For the first of your 
patches, that results in

diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c 
b/drivers/net/wireless/realtek/rtlwifi/usb.c
index 34d68dbf4b4c..f89ceac25eff 100644
--- a/drivers/net/wireless/realtek/rtlwifi/usb.c
+++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
@@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw 
*hw)
         mutex_destroy(&rtlpriv->io.bb_mutex);
  }
-/**
- *
- *     Default aggregation handler. Do nothing and just return the oldest skb.
- */
+/* Default aggregation handler. Do nothing and just return the oldest skb.  */
  static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
                                                   struct sk_buff_head *list)

Because you merely shift the warning to a different tool,

NACK.

Larry

^ permalink raw reply related

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Yonglong Liu @ 2019-08-08  1:15 UTC (permalink / raw)
  To: Heiner Kallweit, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <d67831ab-8902-a653-3db9-b2f55adacabd@gmail.com>



On 2019/8/8 0:47, Heiner Kallweit wrote:
> On 07.08.2019 15:16, Yonglong Liu wrote:
>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>
>> When using rtl8211f in polling mode, may get a invalid speed,
>> because of reading a fake link up and autoneg complete status
>> immediately after starting autoneg:
>>
>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>
>> According to the datasheet of rtl8211f, to get the real time
>> link status, need to read MII_BMSR twice.
>>
>> This patch add a read_status hook for rtl8211f, and do a fake
>> phy_read before genphy_read_status(), so that can get real link
>> status in genphy_read_status().
>>
>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>> ---
>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
> Is this an accidental resubmit? Because we discussed this in
> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
> been applied already.
> 
> Heiner
> 
> .
> 

In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
recurrence rate is almost 100%, and I had test the solution about
5 times and it works. But yesterday it happen again suddenly, and than
I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
after autoneg started which is not 0x798d from last discussion.



^ permalink raw reply

* [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Carlos Antonio Neira Bustos @ 2019-08-08  1:22 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, quentin.monnet, cneirabustos

The code has been modified to avoid syscalls that could sleep.
Please let me know if any other modification is needed.

From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@gmail.com>
Date: Wed, 7 Aug 2019 20:04:30 -0400
Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data 
 from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

        u32 pid = bpf_get_current_pid_tgid() >> 32;
        if (pid != <pid_arg_passed_in>)
                return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

        struct bpf_pidns pidns;
        bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
        u32 pid = pidns.tgid;
        u32 nsid = pidns.nsid;
        if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
                return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 fs/namei.c                                         |   2 +-
 include/linux/bpf.h                                |   1 +
 include/linux/namei.h                              |   4 +
 include/uapi/linux/bpf.h                           |  29 ++++-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/helpers.c                               |  78 ++++++++++++
 kernel/trace/bpf_trace.c                           |   2 +
 samples/bpf/Makefile                               |   3 +
 samples/bpf/trace_ns_info_user.c                   |  35 ++++++
 samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
 tools/include/uapi/linux/bpf.h                     |  29 ++++-
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
 .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
 tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
 15 files changed, 418 insertions(+), 4 deletions(-)
 create mode 100644 samples/bpf/trace_ns_info_user.c
 create mode 100644 samples/bpf/trace_ns_info_user_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns.c

diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..d1eca36972d2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -19,7 +19,6 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
-#include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/fsnotify.h>
@@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
 	putname(name);
 	return retval;
 }
+EXPORT_SYMBOL(filename_lookup);
 
 /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
 static int path_parentat(struct nameidata *nd, unsigned flags,
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f9a506147c8a..e4adf5e05afd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..2c24e8c71d46 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -6,6 +6,7 @@
 #include <linux/path.h>
 #include <linux/fcntl.h>
 #include <linux/errno.h>
+#include <linux/fs.h>
 
 enum { MAX_NESTED_LINKS = 8 };
 
@@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
 
 extern void nd_jump_link(struct path *path);
 
+extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
+		    struct path *path, struct path *root);
+
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
 	((char *) name)[min(len, maxlen)] = '\0';
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4393bd4b2419..6f601f7106e2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2741,6 +2741,26 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
+ *		Or if size_of_pidns is not valid.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
+ *		If unable to get the inode from /proc/self/ns/pid an error code
+ *		will be returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2873,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3625,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..3159f2a0188c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..571f24077db2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,12 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+
 
 #include "../../lib/kstrtox.h"
 
@@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+	 size)
+{
+	const char *name = "/proc/self/ns/pid";
+	struct pid_namespace *pidns = NULL;
+	struct filename *tmp = NULL;
+	int len = strlen(name) + 1;
+	struct inode *inode;
+	struct path kp;
+	pid_t tgid = 0;
+	pid_t pid = 0;
+	int ret;
+
+	if (unlikely(size != sizeof(struct bpf_pidns_info)))
+		return -EINVAL;
+
+	pidns = task_active_pid_ns(current);
+
+	if (unlikely(!pidns))
+		goto clear;
+
+	pidns_info->nsid =  pidns->ns.inum;
+	pid = task_pid_nr_ns(current, pidns);
+
+	if (unlikely(!pid))
+		goto clear;
+
+	tgid = task_tgid_nr_ns(current, pidns);
+
+	if (unlikely(!tgid))
+		goto clear;
+
+	pidns_info->tgid = (u32) tgid;
+	pidns_info->pid = (u32) pid;
+
+	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
+	if (unlikely(!tmp)) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return -ENOMEM;
+	}
+
+	memcpy((char *)tmp->name, name, len);
+	tmp->uptr = NULL;
+	tmp->aname = NULL;
+	tmp->refcnt = 1;
+
+	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
+
+	if (ret) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return ret;
+	}
+
+	inode = d_backing_inode(kp.dentry);
+	pidns_info->dev = inode->i_sb->s_dev;
+
+	return 0;
+
+clear:
+	memset((void *)pidns_info, 0, (size_t) size);
+
+	return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+	.func	= bpf_get_current_pidns_info,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..5e1dc22765a5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+	case BPF_FUNC_get_current_pidns_info:
+		return &bpf_get_current_pidns_info_proto;
 	default:
 		return NULL;
 	}
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1d9be26b4edd..238453ff27d2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
 hostprogs-y += ibumad
 hostprogs-y += hbm
+hostprogs-y += trace_ns_info
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
 hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
+trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
 always += ibumad_kern.o
 always += hbm_out_kern.o
 always += hbm_edt_kern.o
+always += trace_ns_info_user_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
new file mode 100644
index 000000000000..e06d08db6f30
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "bpf/libbpf.h"
+#include "bpf_load.h"
+
+/* This code was taken verbatim from tracex1_user.c, it's used
+ * to exercize bpf_get_current_pidns_info() helper call.
+ */
+int main(int ac, char **argv)
+{
+	FILE *f;
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
+	printf("loading %s\n", filename);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	f = popen("taskset 1 ping  localhost", "r");
+	(void) f;
+	read_trace_pipe();
+	return 0;
+}
diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
new file mode 100644
index 000000000000..96675e02b707
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user_kern.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+typedef __u64 u64;
+typedef __u32 u32;
+
+
+/* kprobe is NOT a stable ABI
+ * kernel functions can be removed, renamed or completely change semantics.
+ * Number of arguments and their positions can change, etc.
+ * In such case this bpf+kprobe example will no longer be meaningful
+ */
+
+/* This will call bpf_get_current_pidns_info() to display pid and ns values
+ * as seen by the current namespace, on the far left you will see the pid as
+ * seen as by the root namespace.
+ */
+
+SEC("kprobe/__netif_receive_skb_core")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
+	struct bpf_pidns_info nsinfo;
+	int ok = 0;
+
+	ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
+	if (ok == 0)
+		bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
+				 (u32) nsinfo.dev, (u32)nsinfo.pid);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4393bd4b2419..6f601f7106e2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2741,6 +2741,26 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
+ *		Or if size_of_pidns is not valid.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
+ *		If unable to get the inode from /proc/self/ns/pid an error code
+ *		will be returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2873,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3625,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a0336a..1f97b571b581 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_tcp_rtt
+	test_sockopt_multi test_tcp_rtt test_pidns
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 120aa86c58d3..c96795a9d983 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
 static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
 					  int ip_len, void *tcp, int tcp_len) =
 	(void *) BPF_FUNC_tcp_gen_syncookie;
+static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
+					 unsigned int buf_size) =
+	(void *) BPF_FUNC_get_current_pidns_info;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
new file mode 100644
index 000000000000..e1d2facfa762
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/bpf.h>
+#include <errno.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") nsidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") pidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int trace(void *ctx)
+{
+	struct bpf_pidns_info nsinfo;
+	__u32 key = 0, *expected_pid, *val;
+	char fmt[] = "ERROR nspid:%d\n";
+
+	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
+		return -EINVAL;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+
+
+	if (!expected_pid || *expected_pid != nsinfo.pid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&nsidmap, &key);
+	if (val)
+		*val = nsinfo.nsid;
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
new file mode 100644
index 000000000000..a7254055f294
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_pidns.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({		\
+	int __ret = !!(condition);			\
+	if (__ret) {					\
+		printf("%s:FAIL:%s ", __func__, tag);	\
+		printf(format);				\
+	} else {					\
+		printf("%s:PASS:%s\n", __func__, tag);	\
+	}						\
+	__ret;						\
+})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "syscalls/sys_enter_nanosleep";
+	const char *file = "test_pidns_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	int pidmap_fd, nsidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 knsid = 0;
+	__u32 key = 0, pid;
+	int exit_code = 1;
+	struct stat st;
+	char buf[256];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto cleanup_cgroup_env;
+
+	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
+	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  nsidmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  pidmap_fd, errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	/* trigger some syscalls */
+	sleep(1);
+
+	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+
+	if (stat("/proc/self/ns/pid", &st))
+		goto close_pmu;
+
+	if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
+		  "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
+		goto close_pmu;
+
+	exit_code = 0;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+cleanup_cgroup_env:
+	return exit_code;
+}
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH bpf-next] tools/bpf: fix core_reloc.c compilation error
From: Alexei Starovoitov @ 2019-08-08  1:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Andrii Nakryiko
In-Reply-To: <20190808003856.555097-1-yhs@fb.com>

On Wed, Aug 7, 2019 at 5:42 PM Yonghong Song <yhs@fb.com> wrote:
>
> On my local machine, I have the following compilation errors:
> =====
>   In file included from prog_tests/core_reloc.c:3:0:
>   ./progs/core_reloc_types.h:517:46: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘fancy_char_ptr_t’
>  typedef const char * const volatile restrict fancy_char_ptr_t;
>                                               ^
>   ./progs/core_reloc_types.h:527:2: error: unknown type name ‘fancy_char_ptr_t’
>     fancy_char_ptr_t d;
>     ^
> =====
>
> I am using gcc 4.8.5. Later compilers may change their behavior not emitting the
> error. Nevertheless, let us fix the issue. "restrict" can be tested
> without typedef.
>
> Fixes: 9654e2ae908e ("selftests/bpf: add CO-RE relocs modifiers/typedef tests")
> Cc: Andrii Nakryiko <andriin@fb.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied. Thanks.

^ permalink raw reply

* RE: [PATCH net-next 5/5] r8152: change rx_frag_head_sz and rx_max_agg_num dynamically
From: Hayes Wang @ 2019-08-08  1:40 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jakub Kicinski, netdev@vger.kernel.org, nic_swsd,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <20190807144346.00005d2b@gmail.com>

Maciej Fijalkowski [mailto:maciejromanfijalkowski@gmail.com]
> Sent: Wednesday, August 07, 2019 8:44 PM
[...]
> > Excuse me.
> > I find struct ethtool_tunable for ETHTOOL_RX_COPYBREAK.
> > How about the descriptor count?
> 
> Look how drivers implement ethtool's set_ringparam ops.

I would check it. Thanks.


Best Regards,
Hayes



^ permalink raw reply

* [PATCH v2] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Valdis Klētnieks @ 2019-08-08  1:51 UTC (permalink / raw)
  To: Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel

Fix spurious warning message when building with W=1:

  CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line

Clean up the comment format.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

---
Changes since v1:  Larry Finger pointed out the patch wasn't checkpatch-clean.

diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
index 34d68dbf4b4c..4b59f3b46b28 100644
--- a/drivers/net/wireless/realtek/rtlwifi/usb.c
+++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
@@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
 	mutex_destroy(&rtlpriv->io.bb_mutex);
 }
 
-/**
- *
- *	Default aggregation handler. Do nothing and just return the oldest skb.
- */
+/*	Default aggregation handler. Do nothing and just return the oldest skb.  */
 static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
 						  struct sk_buff_head *list)
 {
@@ -756,11 +753,6 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
 	return err;
 }
 
-/**
- *
- *
- */
-
 /*=======================  tx =========================================*/
 static void rtl_usb_cleanup(struct ieee80211_hw *hw)
 {
@@ -786,11 +778,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
 	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
 }
 
-/**
- *
- * We may add some struct into struct rtl_usb later. Do deinit here.
- *
- */
+/* We may add some struct into struct rtl_usb later. Do deinit here.  */
 static void rtl_usb_deinit(struct ieee80211_hw *hw)
 {
 	rtl_usb_cleanup(hw);


^ permalink raw reply related

* Re: [PATCH] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Randy Dunlap @ 2019-08-08  1:54 UTC (permalink / raw)
  To: Valdis Klētnieks, Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <5924.1565217560@turing-police>

On 8/7/19 3:39 PM, Valdis Klētnieks wrote:
> Fix spurious warning message when building with W=1:
> 
>   CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
> drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line
> 
> Change the comment so gcc doesn't think it's a kerneldoc comment block

s:gcc:scripts/kernel-doc/

and as Larry pointed out, networking comment style is "different" from the rest
of the kernel.  :(

> 
> Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> 
> diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
> index e24fda5e9087..9478cc0d4f8b 100644
> --- a/drivers/net/wireless/realtek/rtlwifi/usb.c
> +++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
> @@ -239,7 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
>  	mutex_destroy(&rtlpriv->io.bb_mutex);
>  }
>  
> -/**
> +/*
>   *
>   *	Default aggregation handler. Do nothing and just return the oldest skb.
>   */
> @@ -756,7 +756,7 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
>  	return err;
>  }
>  
> -/**
> +/*
>   *
>   *
>   */
> @@ -786,7 +786,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>  	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
>  }
>  
> -/**
> +/*
>   *
>   * We may add some struct into struct rtl_usb later. Do deinit here.
>   *
> 


-- 
~Randy

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Firo Yang @ 2019-08-08  1:55 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Duyck, davem@davemloft.net, Alexander Duyck,
	intel-wired-lan, Jacob Wen, LKML, Netdev
In-Reply-To: <CAKgT0UfEh8cvTht3yceyXqwReJOQkcpJV8j0vHSJwookTWhn_Q@mail.gmail.com>

The 08/07/2019 09:06, Alexander Duyck wrote:
> On Wed, Aug 7, 2019 at 7:09 AM Maciej Fijalkowski
> <maciejromanfijalkowski@gmail.com> wrote:
> >
> > On Wed, 7 Aug 2019 08:38:43 +0000
> > Firo Yang <firo.yang@suse.com> wrote:
> >
> > > The 08/07/2019 15:56, Jacob Wen wrote:
> > > > I think the description is not correct. Consider using something like below.
> > > Thank you for comments.
> > >
> > > >
> > > > In Xen environment, due to memory fragmentation ixgbe may allocate a 'DMA'
> > > > buffer with pages that are not physically contiguous.
> > > Actually, I didn't look into the reason why ixgbe got a DMA buffer which
> > > was mapped to Xen-swiotlb area.
> >
> > I think that neither of these descriptions are telling us what was truly
> > broken. They lack what Alexander wrote on v1 thread of this patch.
> >
> > ixgbe_dma_sync_frag is called only on case when the current descriptor has EOP
> > bit set, skb was already allocated and you'll be adding a current buffer as a
> > frag. DMA unmapping for the first frag was intentionally skipped and we will be
> > unmapping it here, in ixgbe_dma_sync_frag. As Alex said, we're using the
> > DMA_ATTR_SKIP_CPU_SYNC attribute which obliges us to perform a sync manually
> > and it was missing.
> >
> > So IMHO the commit description should include descriptions from both xen and
> > ixgbe worlds, the v2 lacks info about ixgbe case.
Thank you. Will submit v3 with what Alexander worte on v1.

Thanks,
Firo
> >
> > BTW Alex, what was the initial reason for holding off with unmapping the first
> > buffer? Asking because IIRC the i40e and ice behave a bit different here. We
> > don't look there for EOP at all when building/constructing skb and not delaying
> > the unmap of non-eop buffers.
> >
> > Thanks,
> > Maciej
> 
> The reason why we have to hold off on unmapping the first buffer is
> because in the case of Receive Side Coalescing (RSC), also known as Large
> Receive Offload (LRO), the header of the packet is updated for each
> additional frame that is added. As such you can end up with the device
> writing data, header, data, header, data, header where each data write
> would update a new descriptor, but the header will only ever update the
> first descriptor in the chain. As such if we unmapped it earlier it would
> result in an IOMMU fault because the device would be rewriting the header
> after it had been unmapped.
> 
> The devices supported by the ixgbe driver are the only ones that have
> RSC/LRO support. As such this behavior is present for ixgbe, but not for
> other Intel NIC drivers including igb, igbvf, ixgbevf, i40e, etc.
> 
> - Alex
> 

^ permalink raw reply

* Re: [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Jacob Wen @ 2019-08-08  1:56 UTC (permalink / raw)
  To: Firo Yang
  Cc: davem@davemloft.net, jeffrey.t.kirsher@intel.com,
	alexander.h.duyck@linux.intel.com,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20190807083831.GA6811@linux-6qg8>


On 8/7/19 4:38 PM, Firo Yang wrote:
> The 08/07/2019 15:56, Jacob Wen wrote:
>> I think the description is not correct. Consider using something like below.
> Thank you for comments.
>
>> In Xen environment, due to memory fragmentation ixgbe may allocate a 'DMA'
>> buffer with pages that are not physically contiguous.
> Actually, I didn't look into the reason why ixgbe got a DMA buffer which
> was mapped to Xen-swiotlb area.
Yes. I was wrong. You don't need to tell the exact reason.
>
> But I don't think this issue relates to phsical memory contiguity because, in
> our case, one ixgbe_rx_buffer only associates at most one page.

This is interesting.

I guess the performance of the NIC in your environment is not good due 
to the extra cost of bounce buffer.

>
> If you take a look at the related code, you will find there are several reasons
> for mapping a DMA buffer to Xen-swiotlb area:
> static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
>           */
>          if (dma_capable(dev, dev_addr, size) &&
>              !range_straddles_page_boundary(phys, size) &&
>                  !xen_arch_need_swiotlb(dev, phys, dev_addr) &&
>                  swiotlb_force != SWIOTLB_FORCE)
>                  goto done;
>
> // Firo
>> A NIC doesn't support directly write such buffer. So xen-swiotlb would use
>> the pages, which are physically contiguous, from the swiotlb buffer for the
>> NIC.
>>
>> The unmap operation is used to copy the swiotlb buffer to the pages that are
>> allocated by ixgbe.
>>
>> On 8/7/19 10:49 AM, Firo Yang wrote:
>>> In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
>>> could possibly allocate a page, DMA memory buffer, for the first
>>> fragment which is not suitable for Xen-swiotlb to do DMA operations.
>>> Xen-swiotlb have to internally allocate another page for doing DMA
>>> operations. It requires syncing between those two pages. However,
>>> since commit f3213d932173 ("ixgbe: Update driver to make use of DMA
>>> attributes in Rx path"), the unmap operation is performed with
>>> DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.
>>>
>>> To fix this problem, always sync before possibly performing a page
>>> unmap operation.
>>>
>>> Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
>>> attributes in Rx path")
>>> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> Signed-off-by: Firo Yang <firo.yang@suse.com>
>>> ---
>>>
>>> Changes from v1:
>>>    * Imporved the patch description.
>>>    * Added Reviewed-by: and Fixes: as suggested by Alexander Duyck
>>>
>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
>>>    1 file changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> index cbaf712d6529..200de9838096 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> @@ -1825,13 +1825,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
>>>    static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
>>>    				struct sk_buff *skb)
>>>    {
>>> -	/* if the page was released unmap it, else just sync our portion */
>>> -	if (unlikely(IXGBE_CB(skb)->page_released)) {
>>> -		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
>>> -				     ixgbe_rx_pg_size(rx_ring),
>>> -				     DMA_FROM_DEVICE,
>>> -				     IXGBE_RX_DMA_ATTR);
>>> -	} else if (ring_uses_build_skb(rx_ring)) {
>>> +	if (ring_uses_build_skb(rx_ring)) {
>>>    		unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
>>>    		dma_sync_single_range_for_cpu(rx_ring->dev,
>>> @@ -1848,6 +1842,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
>>>    					      skb_frag_size(frag),
>>>    					      DMA_FROM_DEVICE);
>>>    	}
>>> +
>>> +	/* If the page was released, just unmap it. */
>>> +	if (unlikely(IXGBE_CB(skb)->page_released)) {
>>> +		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
>>> +				     ixgbe_rx_pg_size(rx_ring),
>>> +				     DMA_FROM_DEVICE,
>>> +				     IXGBE_RX_DMA_ATTR);
>>> +	}
>>>    }
>>>    /**

^ permalink raw reply

* RE: [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
From: Hayes Wang @ 2019-08-08  2:00 UTC (permalink / raw)
  To: Heiner Kallweit, nic_swsd, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <7d79d794-b41c-101f-0720-59eea88bf9ab@gmail.com>

> From: Heiner Kallweit [mailto:hkallweit1@gmail.com]
> Sent: Thursday, August 08, 2019 3:38 AM
> To: nic_swsd; David Miller
> Cc: netdev@vger.kernel.org
> Subject: [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
> 
> We allocate 16kb per rx buffer, so we can avoid some overhead by using
> alloc_pages_node directly instead of bothering kmalloc_node. Due to
> this change buffers are page-aligned now, therefore the alignment check
> can be removed.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Acked-by: Hayes Wang <hayeswang@realtek.com>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox