Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 1/2] liquidio: move macro definition to a proper place
From: Felix Manlunas @ 2017-08-18 18:35 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	veerasenareddy.burru
In-Reply-To: <20170818183432.GA4487@felix-thinkpad.cavium.com>

The macro LIO_CMD_WAIT_TM is not specific to the PF driver; it can be used
by the VF driver too, so move its definition from a PF-specific header file
to one that's common to PF and VF.

Signed-off-by: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h | 2 --
 drivers/net/ethernet/cavium/liquidio/liquidio_common.h  | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
index dee6046..2aba524 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
@@ -24,8 +24,6 @@
 
 #include "cn23xx_pf_regs.h"
 
-#define LIO_CMD_WAIT_TM 100
-
 /* Register address and configuration for a CN23XX devices.
  * If device specific changes need to be made then add a struct to include
  * device specific fields as shown in the commented section
diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 18d2955..a2274e6 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -238,6 +238,8 @@ static inline void add_sg_size(struct octeon_sg_entry *sg_entry,
 #define   OCTNET_CMD_VLAN_FILTER_ENABLE 0x1
 #define   OCTNET_CMD_VLAN_FILTER_DISABLE 0x0
 
+#define   LIO_CMD_WAIT_TM 100
+
 /* RX(packets coming from wire) Checksum verification flags */
 /* TCP/UDP csum */
 #define   CNNIC_L4SUM_VERIFIED             0x1
-- 
2.9.0

^ permalink raw reply related

* [PATCH net-next 0/2] liquidio: VF driver will notify NIC firmware of MTU change
From: Felix Manlunas @ 2017-08-18 18:34 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	veerasenareddy.burru

From: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>

Make VF driver notify NIC firmware of MTU change.  Firmware needs this
information for MTU propagation and enforcement.

The first patch in this series moves a macro definition to a proper place
to prevent a build error in the second patch which has the code that sends
the notification.

Veerasenareddy Burru (2):
  liquidio: move macro definition to a proper place
  liquidio: make VF driver notify NIC firmware of MTU change

 .../ethernet/cavium/liquidio/cn23xx_pf_device.h    |  2 --
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 22 ++++++++++++++++++----
 .../net/ethernet/cavium/liquidio/liquidio_common.h |  2 ++
 3 files changed, 20 insertions(+), 6 deletions(-)

-- 
2.9.0

^ permalink raw reply

* Re: [net-next PATCH 06/10] bpf: sockmap with sk redirect support
From: Alexei Starovoitov @ 2017-08-18 18:32 UTC (permalink / raw)
  To: John Fastabend, davem, daniel; +Cc: tgraf, netdev, tom
In-Reply-To: <599698D0.9040803@gmail.com>

On 8/18/17 12:35 AM, John Fastabend wrote:
> From an API perspective having all socks in a sockmap inherit the same
> BPF programs is useful when working with cgroups. It keeps things consistent
> and is pretty effective for applying policy to cgroup sockets.

agree. it's all clear, see modified proposal below.

> But,
> in some cases it breaks down a bit and that is where the map_flags
> and BPF_SOCKMAP_STRPARSER entered the picture. After this discussion
> I think we can clean this up. Here is my proposal, let me know what you
> think.
>
> First I would like to continue allowing socks to inherit BPF programs
> from sock map if users set this up. To clean it up and make it extensible
> though,
>
>   - instead of doing the attach_fd2 which would break quickly if we need
>     fd(3,4,...) use two separate attach types,
>
>          BPF_SMAP_STREAM_PARSER
>          BPF_SMAP_STREAM_VERDICT
>
>     the target fd is the map fd just as before.
>
>     This allows us to easily extend as needed by adding another type and
>     the map space is a u32 so we have plenty of room for extensions.
>
>  - implement the detach for the above to remove the programs
>
> Next lets just remove the map_flags BPF_SOCKMAP_STRPARSER. The UAPI is
> simplified this way and the inheritance rule is clear. If BPF programs
> are attached to the map they are inherited. If there is no BPF program
> attached the socks do not use strparser/verdict logic and are purely
> for redirect actions.
>
> A sock may be in multiple maps but can only inherit a single BPF
> stream/verdict program. Otherwise we would have no way to "know"
> which stream parser to run.
>
> Future extensions could provide an API for doing per sock attach operations
> and I see no reason they would not be compatible. By adding two more
> attach types,
>
>         BPF_SOCK_STREAM_PARSER
>         BPF_SOCK_STREAM_VERDICT
>
> we can provide specific sock BPF programs. With verifier work we could
> even make bpf helpers
>
>         bpf_sock_prog_attach(skops, prog, type, flags)
>         bpf_sock_map_attach(sockmap, key, prog, type, flags)
>
> I think both this and the above work together nicely also the code can
> support this with some additional work. To summarize the API then with
> above changes,
>
>  syscall:
>
>   bpf_create_map(BPF_MAP_TYPE_SOCKMAP, .... )
>   bpf_prog_attach(verdict_prog, map_fd, BPF_SMAP_STREAM_VERDICT, 0);
>   bpf_prog_attach(parse_prog, map_fd, BPF_SMAP_STREAM_PARSER, 0);
>   bpf_map_update_elem(map_fd, key, sock_fd, BPF_ANY)
>   bpf_map_delete_elem(map_fd, key)
>
>  helpers:
>   to insert sock from sock ops progrm
>       bpf_sock_map_update(skops, map, key, flags);
>   to redirect skb to a sock in a sockmap
>       bpf_sk_redirect_map(map, key, flags)
>
>  future work:
>   bpf_prog_attach(verdict_prog, map_fd, BPF_SOCK_STREAM_VERDICT, 0)
>   bpf_prog_attach(parse_prog, map_fd, BPF_SOCK_STREAM_PARSER, 0)
>
> How does this look? I think it will be both extensible and very usable
> now.

Above sounds much better than the present situation.
Can we take it even further and split psock from sockmap?
My understanding that psock->key is there only because you tied
psock with the map and using map as a storage for the rx socket.
imo separating rx and tx sockets will make it cleaner.
Like we can have new syscall cmd that creates psock that holds
strpaser, verdict and potentially other programs.
Later sock ops program will use a helper:
bpf_psock_update(skops, psock_obj_handle, flags);
to assign single skops socket into this psock object.
The programs (strparser, verdict) will be applied to this skops socket,
so your inheritance requirement is satisfied.
And use sockmap only for TX sockets. Either user space via syscall
will store them in there or sockops program will store them into the map
via bpf_sock_map_update(skops, sockmap, key, flags); helper.
Later the verdict program will use
bpf_sk_redirect_map(sockmap, key, flags);
and for the program author no need to worry about 'type' of socket
in the sockmap. All sockets in there are TX sockets to redirect to.
And the same verdict program can use multiple sockmaps.
Similarly user space can create multiple psock objects with
same strparser+verdict programs or different and sockops prog
can pick and choose which psock to use to assign RX socket into.

Another alternative:
Instead of new psock object to store single socket (like current
implementation does), we can do two types of sockmap.
One for a set of RX sockets. All of them will have the same
strparser+verdict progs and psock with skbuff queue will be part
of this sockmap type.
And another sockmap type for TX sockets that don't have skbuff queues
at all and can only be used to redirect the RX socket into.
So bpf_rx_sock_map_update() helper will be used only on RX_SOCKMAP map
and bpf_tx_sock_map_update() helper will be used only on TX_SOCKMAP,
while bpf_sk_redirect_map() can only be used on TX_SOCKMAP.

Or you have cases when two RX sockets need to redirect into each
other and in both cases strparser+verdict need to run?
In such case we need to allow bpf_sk_redirect_map() to use on
RX_SOCKMAP map as well,
but looking at current implementation you only allow one psock per map,
so two sockets forwarding to each other cannot work due to only one queue.
Am I missing anything from what you want to achieve?
Thoughts?

^ permalink raw reply

* Does the kernel have a function to parse a text IPv6 address?
From: David Howells @ 2017-08-18 18:32 UTC (permalink / raw)
  To: netdev; +Cc: dhowells

Does the kernel have a function to parse a text IPv6 address of the form
"x:y:..::z" and put it into a struct sockaddr_in6?

David

^ permalink raw reply

* [PATCH net-next 0/2] bpf: Allow selecting numa node during map creation
From: Martin KaFai Lau @ 2017-08-18 18:27 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This series allows user to pick the numa node during map creation.
The first patch has the details

Martin KaFai Lau (2):
  bpf: Allow selecting numa node during map creation
  bpf: Allow numa selection in INNER_LRU_HASH_PREALLOC test of
    map_perf_test

 include/linux/bpf.h                       | 10 +++++++++-
 include/uapi/linux/bpf.h                  | 10 +++++++++-
 kernel/bpf/arraymap.c                     |  7 +++++--
 kernel/bpf/devmap.c                       |  9 ++++++---
 kernel/bpf/hashtab.c                      | 19 ++++++++++++++----
 kernel/bpf/lpm_trie.c                     |  9 +++++++--
 kernel/bpf/sockmap.c                      | 10 +++++++---
 kernel/bpf/stackmap.c                     |  8 +++++---
 kernel/bpf/syscall.c                      | 14 ++++++++++----
 samples/bpf/bpf_load.c                    | 21 ++++++++++++--------
 samples/bpf/bpf_load.h                    |  1 +
 samples/bpf/map_perf_test_kern.c          |  2 ++
 samples/bpf/map_perf_test_user.c          | 12 +++++++++---
 tools/include/uapi/linux/bpf.h            | 10 +++++++++-
 tools/lib/bpf/bpf.c                       | 32 +++++++++++++++++++++++++++----
 tools/lib/bpf/bpf.h                       |  6 ++++++
 tools/testing/selftests/bpf/bpf_helpers.h |  1 +
 17 files changed, 142 insertions(+), 39 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH net-next 1/2] bpf: Allow selecting numa node during map creation
From: Martin KaFai Lau @ 2017-08-18 18:28 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20170818182801.2518162-1-kafai@fb.com>

The current map creation API does not allow to provide the numa-node
preference.  The memory usually comes from where the map-creation-process
is running.  The performance is not ideal if the bpf_prog is known to
always run in a numa node different from the map-creation-process.

One of the use case is sharding on CPU to different LRU maps (i.e.
an array of LRU maps).  Here is the test result of map_perf_test on
the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
CPU0 to be allocated from a remote numa node:

[ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]

># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<

After specifying numa node:
># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<

This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
field is honored if and only if the BPF_F_NUMA_NODE flag is set.

Numa node selection is not supported for percpu map.

This patch does not change all the kmalloc.  F.e.
'htab = kzalloc()' is not changed since the object
is small enough to stay in the cache.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 include/linux/bpf.h      | 10 +++++++++-
 include/uapi/linux/bpf.h | 10 +++++++++-
 kernel/bpf/arraymap.c    |  7 +++++--
 kernel/bpf/devmap.c      |  9 ++++++---
 kernel/bpf/hashtab.c     | 19 +++++++++++++++----
 kernel/bpf/lpm_trie.c    |  9 +++++++--
 kernel/bpf/sockmap.c     | 10 +++++++---
 kernel/bpf/stackmap.c    |  8 +++++---
 kernel/bpf/syscall.c     | 14 ++++++++++----
 9 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1cc6c5ff61ec..55b88e329804 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -51,6 +51,7 @@ struct bpf_map {
 	u32 map_flags;
 	u32 pages;
 	u32 id;
+	int numa_node;
 	struct user_struct *user;
 	const struct bpf_map_ops *ops;
 	struct work_struct work;
@@ -264,7 +265,7 @@ struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_precharge_memlock(u32 pages);
-void *bpf_map_area_alloc(size_t size);
+void *bpf_map_area_alloc(size_t size, int numa_node);
 void bpf_map_area_free(void *base);
 
 extern int sysctl_unprivileged_bpf_disabled;
@@ -316,6 +317,13 @@ struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
 
+/* Return map's numa specified by userspace */
+static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
+{
+	return (attr->map_flags & BPF_F_NUMA_NODE) ?
+		attr->numa_node : NUMA_NO_NODE;
+}
+
 #else
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5ecbe812a2cc..843818dff96d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -165,6 +165,7 @@ enum bpf_attach_type {
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
 
+/* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
 /* Instead of having one common LRU list in the
  * BPF_MAP_TYPE_LRU_[PERCPU_]HASH map, use a percpu LRU list
@@ -173,6 +174,8 @@ enum bpf_attach_type {
  * across different LRU lists.
  */
 #define BPF_F_NO_COMMON_LRU	(1U << 1)
+/* Specify numa node during map creation */
+#define BPF_F_NUMA_NODE		(1U << 2)
 
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
@@ -180,8 +183,13 @@ union bpf_attr {
 		__u32	key_size;	/* size of key in bytes */
 		__u32	value_size;	/* size of value in bytes */
 		__u32	max_entries;	/* max number of entries in a map */
-		__u32	map_flags;	/* prealloc or not */
+		__u32	map_flags;	/* BPF_MAP_CREATE related
+					 * flags defined above.
+					 */
 		__u32	inner_map_fd;	/* fd pointing to the inner map */
+		__u32	numa_node;	/* numa node (effective only if
+					 * BPF_F_NUMA_NODE is set).
+					 */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d771a3872500..96e9c5c1dfc9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -49,13 +49,15 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
 static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 {
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
+	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_array *array;
 	u64 array_size;
 	u32 elem_size;
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    attr->value_size == 0 || attr->map_flags)
+	    attr->value_size == 0 || attr->map_flags & ~BPF_F_NUMA_NODE ||
+	    (percpu && numa_node != NUMA_NO_NODE))
 		return ERR_PTR(-EINVAL);
 
 	if (attr->value_size > KMALLOC_MAX_SIZE)
@@ -77,7 +79,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(array_size);
+	array = bpf_map_area_alloc(array_size, numa_node);
 	if (!array)
 		return ERR_PTR(-ENOMEM);
 
@@ -87,6 +89,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->map.value_size = attr->value_size;
 	array->map.max_entries = attr->max_entries;
 	array->map.map_flags = attr->map_flags;
+	array->map.numa_node = numa_node;
 	array->elem_size = elem_size;
 
 	if (!percpu)
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 18a72a8add43..67f4f00ce33a 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -80,7 +80,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    attr->value_size != 4 || attr->map_flags)
+	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
 	dtab = kzalloc(sizeof(*dtab), GFP_USER);
@@ -93,6 +93,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	dtab->map.value_size = attr->value_size;
 	dtab->map.max_entries = attr->max_entries;
 	dtab->map.map_flags = attr->map_flags;
+	dtab->map.numa_node = bpf_map_attr_numa_node(attr);
 
 	err = -ENOMEM;
 
@@ -119,7 +120,8 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 		goto free_dtab;
 
 	dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
-					      sizeof(struct bpf_dtab_netdev *));
+					      sizeof(struct bpf_dtab_netdev *),
+					      dtab->map.numa_node);
 	if (!dtab->netdev_map)
 		goto free_dtab;
 
@@ -344,7 +346,8 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (!ifindex) {
 		dev = NULL;
 	} else {
-		dev = kmalloc(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN);
+		dev = kmalloc_node(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN,
+				   map->numa_node);
 		if (!dev)
 			return -ENOMEM;
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4fb463172aa8..47ae748c3a49 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -18,6 +18,9 @@
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
 
+#define HTAB_CREATE_FLAG_MASK \
+	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE)
+
 struct bucket {
 	struct hlist_nulls_head head;
 	raw_spinlock_t lock;
@@ -138,7 +141,8 @@ static int prealloc_init(struct bpf_htab *htab)
 	if (!htab_is_percpu(htab) && !htab_is_lru(htab))
 		num_entries += num_possible_cpus();
 
-	htab->elems = bpf_map_area_alloc(htab->elem_size * num_entries);
+	htab->elems = bpf_map_area_alloc(htab->elem_size * num_entries,
+					 htab->map.numa_node);
 	if (!htab->elems)
 		return -ENOMEM;
 
@@ -233,6 +237,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	 */
 	bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
 	bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
+	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_htab *htab;
 	int err, i;
 	u64 cost;
@@ -248,7 +253,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		 */
 		return ERR_PTR(-EPERM);
 
-	if (attr->map_flags & ~(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU))
+	if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK)
 		/* reserved bits should not be used */
 		return ERR_PTR(-EINVAL);
 
@@ -258,6 +263,9 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (lru && !prealloc)
 		return ERR_PTR(-ENOTSUPP);
 
+	if (numa_node != NUMA_NO_NODE && (percpu || percpu_lru))
+		return ERR_PTR(-EINVAL);
+
 	htab = kzalloc(sizeof(*htab), GFP_USER);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
@@ -268,6 +276,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	htab->map.value_size = attr->value_size;
 	htab->map.max_entries = attr->max_entries;
 	htab->map.map_flags = attr->map_flags;
+	htab->map.numa_node = numa_node;
 
 	/* check sanity of attributes.
 	 * value_size == 0 may be allowed in the future to use map as a set
@@ -346,7 +355,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	err = -ENOMEM;
 	htab->buckets = bpf_map_area_alloc(htab->n_buckets *
-					   sizeof(struct bucket));
+					   sizeof(struct bucket),
+					   htab->map.numa_node);
 	if (!htab->buckets)
 		goto free_htab;
 
@@ -689,7 +699,8 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 				atomic_dec(&htab->count);
 				return ERR_PTR(-E2BIG);
 			}
-		l_new = kmalloc(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN);
+		l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN,
+				     htab->map.numa_node);
 		if (!l_new)
 			return ERR_PTR(-ENOMEM);
 	}
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index b09185f0f17d..1b767844a76f 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -244,7 +244,8 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie,
 	if (value)
 		size += trie->map.value_size;
 
-	node = kmalloc(size, GFP_ATOMIC | __GFP_NOWARN);
+	node = kmalloc_node(size, GFP_ATOMIC | __GFP_NOWARN,
+			    trie->map.numa_node);
 	if (!node)
 		return NULL;
 
@@ -405,6 +406,8 @@ static int trie_delete_elem(struct bpf_map *map, void *key)
 #define LPM_KEY_SIZE_MAX	LPM_KEY_SIZE(LPM_DATA_SIZE_MAX)
 #define LPM_KEY_SIZE_MIN	LPM_KEY_SIZE(LPM_DATA_SIZE_MIN)
 
+#define LPM_CREATE_FLAG_MASK	(BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE)
+
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
 	struct lpm_trie *trie;
@@ -416,7 +419,8 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 ||
-	    attr->map_flags != BPF_F_NO_PREALLOC ||
+	    !(attr->map_flags & BPF_F_NO_PREALLOC) ||
+	    attr->map_flags & ~LPM_CREATE_FLAG_MASK ||
 	    attr->key_size < LPM_KEY_SIZE_MIN ||
 	    attr->key_size > LPM_KEY_SIZE_MAX ||
 	    attr->value_size < LPM_VAL_SIZE_MIN ||
@@ -433,6 +437,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	trie->map.value_size = attr->value_size;
 	trie->map.max_entries = attr->max_entries;
 	trie->map.map_flags = attr->map_flags;
+	trie->map.numa_node = bpf_map_attr_numa_node(attr);
 	trie->data_size = attr->key_size -
 			  offsetof(struct bpf_lpm_trie_key, data);
 	trie->max_prefixlen = trie->data_size * 8;
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 39de541fbcdc..78b2bb9370ac 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -443,7 +443,9 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
 {
 	struct smap_psock *psock;
 
-	psock = kzalloc(sizeof(struct smap_psock), GFP_ATOMIC | __GFP_NOWARN);
+	psock = kzalloc_node(sizeof(struct smap_psock),
+			     GFP_ATOMIC | __GFP_NOWARN,
+			     stab->map.numa_node);
 	if (!psock)
 		return ERR_PTR(-ENOMEM);
 
@@ -465,7 +467,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    attr->value_size != 4 || attr->map_flags)
+	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
 	if (attr->value_size > KMALLOC_MAX_SIZE)
@@ -481,6 +483,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	stab->map.value_size = attr->value_size;
 	stab->map.max_entries = attr->max_entries;
 	stab->map.map_flags = attr->map_flags;
+	stab->map.numa_node = bpf_map_attr_numa_node(attr);
 
 	/* make sure page count doesn't overflow */
 	cost = (u64) stab->map.max_entries * sizeof(struct sock *);
@@ -495,7 +498,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 		goto free_stab;
 
 	stab->sock_map = bpf_map_area_alloc(stab->map.max_entries *
-					    sizeof(struct sock *));
+					    sizeof(struct sock *),
+					    stab->map.numa_node);
 	if (!stab->sock_map)
 		goto free_stab;
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 31147d730abf..135be433e9a0 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -31,7 +31,8 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
 	int err;
 
-	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries);
+	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
+					 smap->map.numa_node);
 	if (!smap->elems)
 		return -ENOMEM;
 
@@ -59,7 +60,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	if (!capable(CAP_SYS_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	if (attr->map_flags)
+	if (attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
 	/* check sanity of attributes */
@@ -75,7 +76,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	if (cost >= U32_MAX - PAGE_SIZE)
 		return ERR_PTR(-E2BIG);
 
-	smap = bpf_map_area_alloc(cost);
+	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 
@@ -91,6 +92,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	smap->map.map_flags = attr->map_flags;
 	smap->n_buckets = n_buckets;
 	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+	smap->map.numa_node = bpf_map_attr_numa_node(attr);
 
 	err = bpf_map_precharge_memlock(smap->map.pages);
 	if (err)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d2f2bdf71ffa..693da918e84e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -105,7 +105,7 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
-void *bpf_map_area_alloc(size_t size)
+void *bpf_map_area_alloc(size_t size, int numa_node)
 {
 	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
 	 * trigger under memory pressure as we really just want to
@@ -115,12 +115,13 @@ void *bpf_map_area_alloc(size_t size)
 	void *area;
 
 	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
+		area = kmalloc_node(size, GFP_USER | flags, numa_node);
 		if (area != NULL)
 			return area;
 	}
 
-	return __vmalloc(size, GFP_KERNEL | flags, PAGE_KERNEL);
+	return __vmalloc_node_flags_caller(size, numa_node, GFP_KERNEL | flags,
+					   __builtin_return_address(0));
 }
 
 void bpf_map_area_free(void *area)
@@ -309,10 +310,11 @@ int bpf_map_new_fd(struct bpf_map *map)
 		   offsetof(union bpf_attr, CMD##_LAST_FIELD) - \
 		   sizeof(attr->CMD##_LAST_FIELD)) != NULL
 
-#define BPF_MAP_CREATE_LAST_FIELD inner_map_fd
+#define BPF_MAP_CREATE_LAST_FIELD numa_node
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
+	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_map *map;
 	int err;
 
@@ -320,6 +322,10 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
+	if (numa_node != NUMA_NO_NODE &&
+	    (numa_node >= nr_node_ids || !node_online(numa_node)))
+		return -EINVAL;
+
 	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
 	map = find_and_alloc_map(attr);
 	if (IS_ERR(map))
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next 2/2] bpf: Allow numa selection in INNER_LRU_HASH_PREALLOC test of map_perf_test
From: Martin KaFai Lau @ 2017-08-18 18:28 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20170818182801.2518162-1-kafai@fb.com>

This patch makes the needed changes to allow each process of
the INNER_LRU_HASH_PREALLOC test to provide its numa node id
when creating the lru map.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 samples/bpf/bpf_load.c                    | 21 ++++++++++++--------
 samples/bpf/bpf_load.h                    |  1 +
 samples/bpf/map_perf_test_kern.c          |  2 ++
 samples/bpf/map_perf_test_user.c          | 12 +++++++++---
 tools/include/uapi/linux/bpf.h            | 10 +++++++++-
 tools/lib/bpf/bpf.c                       | 32 +++++++++++++++++++++++++++----
 tools/lib/bpf/bpf.h                       |  6 ++++++
 tools/testing/selftests/bpf/bpf_helpers.h |  1 +
 8 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a8552b8a2ab6..6aa50098dfb8 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -201,7 +201,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 static int load_maps(struct bpf_map_data *maps, int nr_maps,
 		     fixup_map_cb fixup_map)
 {
-	int i;
+	int i, numa_node;
 
 	for (i = 0; i < nr_maps; i++) {
 		if (fixup_map) {
@@ -213,21 +213,26 @@ static int load_maps(struct bpf_map_data *maps, int nr_maps,
 			}
 		}
 
+		numa_node = maps[i].def.map_flags & BPF_F_NUMA_NODE ?
+			maps[i].def.numa_node : -1;
+
 		if (maps[i].def.type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
 		    maps[i].def.type == BPF_MAP_TYPE_HASH_OF_MAPS) {
 			int inner_map_fd = map_fd[maps[i].def.inner_map_idx];
 
-			map_fd[i] = bpf_create_map_in_map(maps[i].def.type,
+			map_fd[i] = bpf_create_map_in_map_node(maps[i].def.type,
 							maps[i].def.key_size,
 							inner_map_fd,
 							maps[i].def.max_entries,
-							maps[i].def.map_flags);
+							maps[i].def.map_flags,
+							numa_node);
 		} else {
-			map_fd[i] = bpf_create_map(maps[i].def.type,
-						   maps[i].def.key_size,
-						   maps[i].def.value_size,
-						   maps[i].def.max_entries,
-						   maps[i].def.map_flags);
+			map_fd[i] = bpf_create_map_node(maps[i].def.type,
+							maps[i].def.key_size,
+							maps[i].def.value_size,
+							maps[i].def.max_entries,
+							maps[i].def.map_flags,
+							numa_node);
 		}
 		if (map_fd[i] < 0) {
 			printf("failed to create a map: %d %s\n",
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index ca0563d04744..453e3226b4ce 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -13,6 +13,7 @@ struct bpf_map_def {
 	unsigned int max_entries;
 	unsigned int map_flags;
 	unsigned int inner_map_idx;
+	unsigned int numa_node;
 };
 
 struct bpf_map_data {
diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 245165817fbe..ca3b22ed577a 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -40,6 +40,8 @@ struct bpf_map_def SEC("maps") inner_lru_hash_map = {
 	.key_size = sizeof(u32),
 	.value_size = sizeof(long),
 	.max_entries = MAX_ENTRIES,
+	.map_flags = BPF_F_NUMA_NODE,
+	.numa_node = 0,
 };
 
 struct bpf_map_def SEC("maps") array_of_lru_hashs = {
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 1a8894b5ac51..bccbf8478e43 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -97,14 +97,20 @@ static void do_test_lru(enum test_type test, int cpu)
 
 	if (test == INNER_LRU_HASH_PREALLOC) {
 		int outer_fd = map_fd[array_of_lru_hashs_idx];
+		unsigned int mycpu, mynode;
 
 		assert(cpu < MAX_NR_CPUS);
 
 		if (cpu) {
+			ret = syscall(__NR_getcpu, &mycpu, &mynode, NULL);
+			assert(!ret);
+
 			inner_lru_map_fds[cpu] =
-				bpf_create_map(BPF_MAP_TYPE_LRU_HASH,
-					       sizeof(uint32_t), sizeof(long),
-					       inner_lru_hash_size, 0);
+				bpf_create_map_node(BPF_MAP_TYPE_LRU_HASH,
+						    sizeof(uint32_t),
+						    sizeof(long),
+						    inner_lru_hash_size, 0,
+						    mynode);
 			if (inner_lru_map_fds[cpu] == -1) {
 				printf("cannot create BPF_MAP_TYPE_LRU_HASH %s(%d)\n",
 				       strerror(errno), errno);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2d97dd27c8f6..f8f6377fd541 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -168,6 +168,7 @@ enum bpf_sockmap_flags {
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
 
+/* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
 /* Instead of having one common LRU list in the
  * BPF_MAP_TYPE_LRU_[PERCPU_]HASH map, use a percpu LRU list
@@ -176,6 +177,8 @@ enum bpf_sockmap_flags {
  * across different LRU lists.
  */
 #define BPF_F_NO_COMMON_LRU	(1U << 1)
+/* Specify numa node during map creation */
+#define BPF_F_NUMA_NODE		(1U << 2)
 
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
@@ -183,8 +186,13 @@ union bpf_attr {
 		__u32	key_size;	/* size of key in bytes */
 		__u32	value_size;	/* size of value in bytes */
 		__u32	max_entries;	/* max number of entries in a map */
-		__u32	map_flags;	/* prealloc or not */
+		__u32	map_flags;	/* BPF_MAP_CREATE related
+					 * flags defined above.
+					 */
 		__u32	inner_map_fd;	/* fd pointing to the inner map */
+		__u32	numa_node;	/* numa node (effective only if
+					 * BPF_F_NUMA_NODE is set).
+					 */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 77660157a684..a0717610b116 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -57,8 +57,9 @@ static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
 	return syscall(__NR_bpf, cmd, attr, size);
 }
 
-int bpf_create_map(enum bpf_map_type map_type, int key_size,
-		   int value_size, int max_entries, __u32 map_flags)
+int bpf_create_map_node(enum bpf_map_type map_type, int key_size,
+			int value_size, int max_entries, __u32 map_flags,
+			int node)
 {
 	union bpf_attr attr;
 
@@ -69,12 +70,24 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size,
 	attr.value_size = value_size;
 	attr.max_entries = max_entries;
 	attr.map_flags = map_flags;
+	if (node >= 0) {
+		attr.map_flags |= BPF_F_NUMA_NODE;
+		attr.numa_node = node;
+	}
 
 	return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
 
-int bpf_create_map_in_map(enum bpf_map_type map_type, int key_size,
-			  int inner_map_fd, int max_entries, __u32 map_flags)
+int bpf_create_map(enum bpf_map_type map_type, int key_size,
+		   int value_size, int max_entries, __u32 map_flags)
+{
+	return bpf_create_map_node(map_type, key_size, value_size,
+				   max_entries, map_flags, -1);
+}
+
+int bpf_create_map_in_map_node(enum bpf_map_type map_type, int key_size,
+			       int inner_map_fd, int max_entries,
+			       __u32 map_flags, int node)
 {
 	union bpf_attr attr;
 
@@ -86,10 +99,21 @@ int bpf_create_map_in_map(enum bpf_map_type map_type, int key_size,
 	attr.inner_map_fd = inner_map_fd;
 	attr.max_entries = max_entries;
 	attr.map_flags = map_flags;
+	if (node >= 0) {
+		attr.map_flags |= BPF_F_NUMA_NODE;
+		attr.numa_node = node;
+	}
 
 	return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
 
+int bpf_create_map_in_map(enum bpf_map_type map_type, int key_size,
+			  int inner_map_fd, int max_entries, __u32 map_flags)
+{
+	return bpf_create_map_in_map_node(map_type, key_size, inner_map_fd,
+					  max_entries, map_flags, -1);
+}
+
 int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 		     size_t insns_cnt, const char *license,
 		     __u32 kern_version, char *log_buf, size_t log_buf_sz)
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index eaee585c1cea..90e9d4e85d08 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -24,8 +24,14 @@
 #include <linux/bpf.h>
 #include <stddef.h>
 
+int bpf_create_map_node(enum bpf_map_type map_type, int key_size,
+			int value_size, int max_entries, __u32 map_flags,
+			int node);
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 		   int max_entries, __u32 map_flags);
+int bpf_create_map_in_map_node(enum bpf_map_type map_type, int key_size,
+			       int inner_map_fd, int max_entries,
+			       __u32 map_flags, int node);
 int bpf_create_map_in_map(enum bpf_map_type map_type, int key_size,
 			  int inner_map_fd, int max_entries, __u32 map_flags);
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 73092d4a898e..98f3be26d390 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -94,6 +94,7 @@ struct bpf_map_def {
 	unsigned int max_entries;
 	unsigned int map_flags;
 	unsigned int inner_map_idx;
+	unsigned int numa_node;
 };
 
 static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
-- 
2.9.5

^ permalink raw reply related

* Re: pull request: bluetooth-next 2017-08-18
From: David Miller @ 2017-08-18 18:08 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20170818161623.GA22628@x1c.home>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Fri, 18 Aug 2017 19:16:23 +0300

> Here's one more bluetooth-next pull request for the 4.14 kernel:
> 
>  - Multiple fixes for Broadcom controllers
>  - Fixes to the bluecard HCI driver
>  - New USB ID for Realtek RTL8723BE controller
>  - Fix static analyzer warning with kfree
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thank you.

^ permalink raw reply

* Re: [PATCH net] bpf, doc: improve sysctl knob description
From: David Miller @ 2017-08-18 18:03 UTC (permalink / raw)
  To: daniel; +Cc: ast, mpe, netdev
In-Reply-To: <5590c8b2426b46f038184153918663019c3bc7c7.1503068601.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 18 Aug 2017 17:11:06 +0200

> Current context speaking of tcpdump filters is out of date these
> days, so lets improve the sysctl description for the BPF knobs
> a bit.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Applied, thanks Daniel.

^ permalink raw reply

* Re: [PATCH v2 net-next] ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t
From: Eric Dumazet @ 2017-08-18 18:01 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, netdev
In-Reply-To: <CAM_iQpWbxs6aa-axQrwN5pipaCDX3ws_08LaQON8p4WfrJ7SDg@mail.gmail.com>

On Fri, 2017-08-18 at 10:15 -0700, Cong Wang wrote:

> #include linux/refcount.h explicitly?

Sure, I will send a v3, thanks.

^ permalink raw reply

* Re: [PATCH] netxen: fix incorrect loop counter decrement
From: David Miller @ 2017-08-18 17:59 UTC (permalink / raw)
  To: colin.king
  Cc: manish.chopra, rahul.verma, Dept-GELinuxNICDev, netdev,
	linux-kernel
In-Reply-To: <20170818131206.15417-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri, 18 Aug 2017 14:12:06 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The loop counter k is currently being decremented from zero which
> is incorrect. Fix this by incrementing k instead
> 
> Detected by CoverityScan, CID#401847 ("Infinite loop")
> 
> Fixes: 83f18a557c6d ("netxen_nic: fw dump support")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v2] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-18 17:56 UTC (permalink / raw)
  To: Matthew Dawson; +Cc: Paolo Abeni, Network Development, Macieira, Thiago
In-Reply-To: <1789862.HUSStbeWG9@ring00>

>> > +   if (flags & MSG_PEEK && *off >= 0) {
>> > +           peek_at_off = true;
>> > +           _off = *off;
>> > +   }
>>
>> I think that unlikely() will fit the above condition
> Sounds good.

Doesn't the compiler implicitly mark branches as unlikely if they
do not have an else clause?

^ permalink raw reply

* Re: [PATCH net v2] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-18 17:52 UTC (permalink / raw)
  To: Matthew Dawson; +Cc: Network Development, Macieira, Thiago, Paolo Abeni
In-Reply-To: <20170818021157.20070-1-matthew@mjdsystems.ca>

On Thu, Aug 17, 2017 at 10:11 PM, Matthew Dawson <matthew@mjdsystems.ca> wrote:
> Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
> headers from UDP packets before queueing"), when udp packets are being
> peeked the requested extra offset is always 0 as there is no need to skip
> the udp header.  However, when the offset is 0 and the next skb is
> of length 0, it is only returned once.  The behaviour can be seen with
> the following python script:
>
> from socket import *;
> f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> f.bind(('::', 0));
> addr=('::1', f.getsockname()[1]);
> g.sendto(b'', addr)
> g.sendto(b'b', addr)
> print(f.recvfrom(10, MSG_PEEK));
> print(f.recvfrom(10, MSG_PEEK));
>
> Where the expected output should be the empty string twice.
>
> Instead, make sk_peek_offset return negative values, and pass those values
> to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
> to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
> __skb_try_recv_from_queue will then ensure the offset is reset back to 0
> if a peek is requested without an offset, unless no packets are found.
>
> Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
> greater then 0, and off is greater then or equal to skb->len, then
> (_off || skb->len) must always be true assuming skb->len >= 0 is always
> true.
>
> Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
> as it double checked if MSG_PEEK was set in the flags.
>
> V2:
>  - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
> redundant checks
>  - Fix peeking in udp{,v6}_recvmsg to report the right value when the
> offset is 0
>
> Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net v2] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Willem de Bruijn @ 2017-08-18 17:52 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago
In-Reply-To: <1503078168.3344.54.camel@redhat.com>

On Fri, Aug 18, 2017 at 1:42 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Fri, 2017-08-18 at 12:39 -0400, Matthew Dawson wrote:
>> On Friday, August 18, 2017 10:05:18 AM EDT Paolo Abeni wrote:
>> > On Thu, 2017-08-17 at 22:11 -0400, Matthew Dawson wrote:
>> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> > > index 7b52a380d710..be8982b4f8c0 100644
>> > > --- a/net/unix/af_unix.c
>> > > +++ b/net/unix/af_unix.c
>> > > @@ -2304,10 +2304,7 @@ static int unix_stream_read_generic(struct
>> > > unix_stream_read_state *state,>
>> > >    */
>> > >
>> > >   mutex_lock(&u->iolock);
>> > >
>> > > - if (flags & MSG_PEEK)
>> > > -         skip = sk_peek_offset(sk, flags);
>> > > - else
>> > > -         skip = 0;
>> > > + skip = max(sk_peek_offset(sk, flags), 0);
>> > >
>> > >   do {
>> > >
>> > >           int chunk;
>> >
>> > later we have:
>> >
>> >     chunk = min_t(unsigned int, unix_skb_len(skb) - skip, size);
>> >
>> > without any call to __skb_try_recv_from_queue(), so we will get
>> > bad/unexpected values from the above assignment when 'skip' is
>> > negative.
>>
>> The assignment to skip should ensure it is never less then zero, thanks to the
>> max(sk...(), 0).  Thus that shouldn't be an issue?
>
> Right, I missed the max() call. Thanks for pointing it out.
> I'm fine with the above.
>
>> >
>> > Overall I still think that adding/using an explicit MSG_PEEK_OFF bit
>> > would produce a simpler code, but is just a personal preference.
>>
>> I don't mind either way, that just seemed to be the preference I saw from the
>> discussion around the patch.  I think either way will work, so whatever the
>> list prefers I'm happy with.
>
> I'm ok either way. Probably it's worth continue this way.

I don't think anyone cares too much, as long as this is fixed.

I have a slight subjective preference for directly passing the sk_peek_off
variable and relying on the negative value as signal for whether the
SO_PEEK_OFF mode is enabled or not, simply because that signal
already exists and we avoid an intermediate conversion step.

That said, I have a follow-on bug fix for __sk_queue_drop_skb where
that signal is not available, but the flags argument is. I believe that
we will need to take the same action in both cases, so that this is moot.
Just mentioning it in case that would sway opinion the other way.

As said, I'm fine with both. Will Ack this.

^ permalink raw reply

* Re: [net-next PATCH] ipv6: fix false-postive maybe-uninitialized warning
From: David Miller @ 2017-08-18 17:49 UTC (permalink / raw)
  To: arnd
  Cc: kuznet, yoshfuji, fw, dsahern, kafai, weiwan, xiyou.wangcong,
	netdev, linux-kernel
In-Reply-To: <20170818113434.3037484-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 18 Aug 2017 13:34:22 +0200

> Adding a lock around one of the assignments prevents gcc from
> tracking the state of the local 'fibmatch' variable, so it can no
> longer prove that 'dst' is always initialized, leading to a bogus
> warning:
> 
> net/ipv6/route.c: In function 'inet6_rtm_getroute':
> net/ipv6/route.c:3659:2: error: 'dst' may be used uninitialized in this function [-Werror=maybe-uninitialized]
> 
> This moves the other assignment into the same lock to shut up the
> warning.
> 
> Fixes: 121622dba8da ("ipv6: route: make rtm_getroute not assume rtnl is locked")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  net/ipv6/route.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> This kind of warning involving an unlock between variable initialization
> and use is relatively frequent for false-positives. I should try to
> seek clarification from the gcc developers on whether this can be
> improved.

This will have to do for now I suppose.

I guess the issue is that if the local variable ever sits on the stack
then the memory barriers in the locks block the full dataflow
analysis.

But this makes no sense from a dataflow perspective.  Even if the
local variable has a stack slot, there is no "escapability" of that
memory addres to foreign modifications.

If I had a nickel for every uninitialized variable warning we had to
work around....

^ permalink raw reply

* Re: [PATCH net v2] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
From: Paolo Abeni @ 2017-08-18 17:42 UTC (permalink / raw)
  To: Matthew Dawson; +Cc: netdev, Macieira, Thiago, willemdebruijn.kernel
In-Reply-To: <1789862.HUSStbeWG9@ring00>

On Fri, 2017-08-18 at 12:39 -0400, Matthew Dawson wrote:
> On Friday, August 18, 2017 10:05:18 AM EDT Paolo Abeni wrote:
> > On Thu, 2017-08-17 at 22:11 -0400, Matthew Dawson wrote:
> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > > index 7b52a380d710..be8982b4f8c0 100644
> > > --- a/net/unix/af_unix.c
> > > +++ b/net/unix/af_unix.c
> > > @@ -2304,10 +2304,7 @@ static int unix_stream_read_generic(struct
> > > unix_stream_read_state *state,> 
> > >  	 */
> > >  	
> > >  	mutex_lock(&u->iolock);
> > > 
> > > -	if (flags & MSG_PEEK)
> > > -		skip = sk_peek_offset(sk, flags);
> > > -	else
> > > -		skip = 0;
> > > +	skip = max(sk_peek_offset(sk, flags), 0);
> > > 
> > >  	do {
> > >  	
> > >  		int chunk;
> > 
> > later we have:
> > 
> > 	chunk = min_t(unsigned int, unix_skb_len(skb) - skip, size);
> > 
> > without any call to __skb_try_recv_from_queue(), so we will get
> > bad/unexpected values from the above assignment when 'skip' is
> > negative.
> 
> The assignment to skip should ensure it is never less then zero, thanks to the 
> max(sk...(), 0).  Thus that shouldn't be an issue?

Right, I missed the max() call. Thanks for pointing it out. 
I'm fine with the above.

> > 
> > Overall I still think that adding/using an explicit MSG_PEEK_OFF bit
> > would produce a simpler code, but is just a personal preference.
> 
> I don't mind either way, that just seemed to be the preference I saw from the 
> discussion around the patch.  I think either way will work, so whatever the 
> list prefers I'm happy with.

I'm ok either way. Probably it's worth continue this way.

Paolo

^ permalink raw reply

* Re: [PATCH net-next 0/3] Misc. Bug fixes for HNS3 Ethernet Driver
From: David Miller @ 2017-08-18 17:32 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, dan.carpenter, mehta.salil.lnk, netdev,
	linux-kernel, linux-rdma, linuxarm
In-Reply-To: <20170818113139.153200-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Fri, 18 Aug 2017 12:31:36 +0100

> This patch-set fixes various bugs reported by community.

Series applied.

^ permalink raw reply

* Re: [patch net-next] net/sched: Fix the logic error to decide the ingress qdisc
From: David Miller @ 2017-08-18 17:29 UTC (permalink / raw)
  To: chrism; +Cc: netdev, jiri
In-Reply-To: <1503055460-36795-1-git-send-email-chrism@mellanox.com>

From: Chris Mi <chrism@mellanox.com>
Date: Fri, 18 Aug 2017 07:24:20 -0400

> The offending commit used a newly added helper function.
> But the logic is wrong. Without this fix, the affected NICs
> can't do HW offload. Error -EOPNOTSUPP will be returned directly.
> 
> Fixes: a2e8da9378cc ("net/sched: use newly added classid identity helpers")
> Signed-off-by: Chris Mi <chrism@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH] nfp: fix infinite loop on umapping cleanup
From: David Miller @ 2017-08-18 17:29 UTC (permalink / raw)
  To: colin.king
  Cc: simon.horman, daniel, oss-drivers, netdev, jakub.kicinski,
	linux-kernel
In-Reply-To: <20170818111150.13716-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri, 18 Aug 2017 12:11:50 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The while loop that performs the dma page unmapping never decrements
> index counter f and hence loops forever. Fix this with a pre-decrement
> on f.
> 
> Detected by CoverityScan, CID#1357309 ("Infinite loop")
> 
> Fixes: 4c3523623dc0 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net-next 0/7] s390/net: more updates for 4.14
From: David Miller @ 2017-08-18 17:22 UTC (permalink / raw)
  To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20170818081910.48869-1-jwi@linux.vnet.ibm.com>

From: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Date: Fri, 18 Aug 2017 10:19:03 +0200

> please apply another batch of qeth patches for net-next.
> This reworks the xmit path for L2 OSAs to use skb_cow_head() instead of
> skb_realloc_headroom().

Series applied, thanks Julian.

^ permalink raw reply

* Re: [patch net] net: sched: fix p_filter_chain check in tcf_chain_flush
From: David Miller @ 2017-08-18 17:19 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jhs, xiyou.wangcong, mlxsw
In-Reply-To: <20170818081043.2130-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Fri, 18 Aug 2017 10:10:43 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> The dereference before check is wrong and leads to an oops when
> p_filter_chain is NULL. The check needs to be done on the pointer to
> prevent NULL dereference.
> 
> Fixes: f93e1cdcf42c ("net/sched: fix filter flushing")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH net-next] bpf: fix a return in sockmap_get_from_fd()
From: David Miller @ 2017-08-18 17:18 UTC (permalink / raw)
  To: dan.carpenter; +Cc: ast, john.fastabend, daniel, netdev, kernel-janitors
In-Reply-To: <20170818071210.wyq37kura6wz6bx6@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 18 Aug 2017 10:27:02 +0300

> "map" is a valid pointer.  We wanted to return "err" instead.  Also
> let's return a zero literal at the end.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2 net-next] ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t
From: Cong Wang @ 2017-08-18 17:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1502925599.4936.153.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Aug 16, 2017 at 4:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2: fix a missing change in net/ipv4/fib_semantics.c
>
>  include/net/dst.h        |    2 +-
>  net/core/dst.c           |    6 +++---
>  net/ipv4/fib_semantics.c |    4 ++--
>  net/ipv4/route.c         |    4 ++--
>  4 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/dst.h b/include/net/dst.h
> index f73611ec401754d4f52b5310a24da53566dafce6..dd38177c3a61f5c4e48be9d57d4d10d6b7d14672 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -107,7 +107,7 @@ struct dst_entry {
>
>  struct dst_metrics {
>         u32             metrics[RTAX_MAX];
> -       atomic_t        refcnt;
> +       refcount_t      refcnt;
>  };
>  extern const struct dst_metrics dst_default_metrics;


#include linux/refcount.h explicitly?

^ permalink raw reply

* Re: [PATCH net-next 0/2] liquidio: initialization fixes for embedded firmware
From: David Miller @ 2017-08-18 17:15 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	ricardo.farrington
In-Reply-To: <20170818061037.GA4024@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Thu, 17 Aug 2017 23:10:37 -0700

> From: Rick Farrington <ricardo.farrington@cavium.com>
> 
> Fix problems when using an adapter w/embedded f/w (param "fw_type=none").
> 
> 1. Add support for PF FLR when exiting.
> 2. Skip some initialization (don't try to load f/w, activate consoles).
> 3. Issue credits BEFORE enabling DROQs.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v3 3/4] net: stmmac: register parent MDIO node for sun8i-h3-emac
From: Chen-Yu Tsai @ 2017-08-18 17:05 UTC (permalink / raw)
  To: Corentin Labbe
  Cc: Rob Herring, Mark Rutland, Russell King, Maxime Ripard,
	Chen-Yu Tsai, Giuseppe Cavallaro, Alexandre Torgue, devicetree,
	linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <20170818122118.4925-4-clabbe.montjoie@gmail.com>

On Fri, Aug 18, 2017 at 8:21 PM, Corentin Labbe
<clabbe.montjoie@gmail.com> wrote:
> In case of a MDIO switch, the registered MDIO node should be
> the parent of the PHY. Otherwise of_phy_connect will fail.
>
> Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> index a366b3747eeb..ca3cc99d8960 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> @@ -312,10 +312,12 @@ static int stmmac_dt_phy(struct plat_stmmacenet_data *plat,
>         static const struct of_device_id need_mdio_ids[] = {
>                 { .compatible = "snps,dwc-qos-ethernet-4.10" },
>                 { .compatible = "allwinner,sun8i-a83t-emac" },
> -               { .compatible = "allwinner,sun8i-h3-emac" },
>                 { .compatible = "allwinner,sun8i-v3s-emac" },
>                 { .compatible = "allwinner,sun50i-a64-emac" },
>         };
> +       static const struct of_device_id need_mdio_mux_ids[] = {
> +               { .compatible = "allwinner,sun8i-h3-emac" },
> +       };
>
>         /* If phy-handle property is passed from DT, use it as the PHY */
>         plat->phy_node = of_parse_phandle(np, "phy-handle", 0);
> @@ -332,7 +334,13 @@ static int stmmac_dt_phy(struct plat_stmmacenet_data *plat,
>                 mdio = false;
>         }
>
> -       if (of_match_node(need_mdio_ids, np)) {
> +       /*
> +        * In case of a MDIO switch/mux, the registered MDIO node should be
> +        * the parent of the PHY. Otherwise of_phy_connect will fail.
> +        */
> +       if (of_match_node(need_mdio_mux_ids, np)) {
> +                plat->mdio_node =  of_get_parent(plat->phy_node);

Extra space before of_get_parent.

Also this is going to fail horribly if a fixed link is used.

ChenYu

> +       } else if (of_match_node(need_mdio_ids, np)) {
>                 plat->mdio_node = of_get_child_by_name(np, "mdio");
>         } else {
>                 /**
> --
> 2.13.0
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox