Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
From: Eric Dumazet @ 2018-05-30 20:50 UTC (permalink / raw)
  To: Qing Huang
  Cc: David Miller, netdev, Eric Dumazet, John Sperbeck, Tarick Bedeir,
	Daniel Jurgens, Zhu Yanjun, Tariq Toukan
In-Reply-To: <c87db444-266d-eb76-d8d5-d8a0c11038b1@oracle.com>

On Wed, May 30, 2018 at 4:30 PM Qing Huang <qing.huang@oracle.com> wrote:
>
>
>
> On 5/29/2018 9:11 PM, Eric Dumazet wrote:
> > Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> > brought a regression caught in our regression suite, thanks to KASAN.
>
> If KASAN reported issue was really caused by smaller chunk sizes,
> changing allocation
> order dynamically will eventually hit the same issue.

Sigh, you have little idea of what your patch really did...

The KASAN part only shows the tip of the iceberg, but our main concern
is an increase of memory overhead.

Alternative is to revert your patch, since we are now very late in 4.17 cycle.

Memory usage has grown a lot with your patch, since each 4KB page needs a full
struct mlx4_icm_chunk (256 bytes of overhead !)

Really we have no choice here, your patch went too far and increased
memory consumption quite a lot.

My patch is simply the best way to address your original concern, and
not increase overall overhead.

( each struct mlx4_icm_chunk should be able to store
MLX4_ICM_CHUNK_LEN pages, instead of one page of 4KB )

^ permalink raw reply

* Re: [PATCH][next] bpf: devmap: remove redundant assignment of dev = dev
From: Alexei Starovoitov @ 2018-05-30 21:07 UTC (permalink / raw)
  To: Colin King
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180530150916.16808-1-colin.king@canonical.com>

On Wed, May 30, 2018 at 04:09:16PM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> The assignment dev = dev is redundant and should be removed.
> 
> Detected by CoverityScan, CID#1469486 ("Evaluation order violation")
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied, Thanks.

Please use [PATCH bpf-next] in the subject next time.
See Documentation/bpf/bpf_devel_QA.rst

^ permalink raw reply

* Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
From: Qing Huang @ 2018-05-30 21:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Eric Dumazet, John Sperbeck, Tarick Bedeir,
	Daniel Jurgens, Zhu Yanjun, Tariq Toukan, linux-rdma,
	santosh.shilimkar@oracle.com
In-Reply-To: <CANn89i+KRHvsxH9Gs4QhtMt4Ex2OH2h120gxjYkBf1_27O27_g@mail.gmail.com>



On 5/30/2018 1:50 PM, Eric Dumazet wrote:
> On Wed, May 30, 2018 at 4:30 PM Qing Huang <qing.huang@oracle.com> wrote:
>>
>>
>> On 5/29/2018 9:11 PM, Eric Dumazet wrote:
>>> Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
>>> brought a regression caught in our regression suite, thanks to KASAN.
>> If KASAN reported issue was really caused by smaller chunk sizes,
>> changing allocation
>> order dynamically will eventually hit the same issue.
> Sigh, you have little idea of what your patch really did...
>
> The KASAN part only shows the tip of the iceberg, but our main concern
> is an increase of memory overhead.

Well, the commit log only mentioned KASAN and but the change here didn't 
seem to solve
the issue.

>
> Alternative is to revert your patch, since we are now very late in 4.17 cycle.
>
> Memory usage has grown a lot with your patch, since each 4KB page needs a full
> struct mlx4_icm_chunk (256 bytes of overhead !)

Going to smaller chunks will have some overhead. It depends on the 
application though.
What's the total increased memory consumption in your env?

>
> Really we have no choice here, your patch went too far and increased
> memory consumption quite a lot.



>
> My patch is simply the best way to address your original concern, and
> not increase overall overhead.
>
> ( each struct mlx4_icm_chunk should be able to store
> MLX4_ICM_CHUNK_LEN pages, instead of one page of 4KB )

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: Change bpf_fib_lookup to return -EAFNOSUPPORT for unsupported address families
From: Alexei Starovoitov @ 2018-05-30 21:19 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, borkmann, ast, David Ahern
In-Reply-To: <20180530192417.9945-1-dsahern@kernel.org>

On Wed, May 30, 2018 at 12:24:17PM -0700, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> Update bpf_fib_lookup to return -EAFNOSUPPORT for unsupported address
> families. Allows userspace to probe for support as more are added
> (e.g., AF_MPLS).
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied, Thanks

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Samudrala, Sridhar @ 2018-05-30 21:23 UTC (permalink / raw)
  To: Jakub Kicinski, Michael Chan; +Cc: David Miller, Netdev, Or Gerlitz
In-Reply-To: <CAJpBn1yyyJueWrtR3frO1NkntTCZxsEZ4heW_sx5--7vpfF_Qw@mail.gmail.com>

On 5/29/2018 11:33 PM, Jakub Kicinski wrote:
> On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:
>> On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:
>>> On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:
>>>> On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:
>>>>> Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
>>>>> extended?
>>> +1 it's painful to see this feature being added to the legacy
>>> API :(  Another duplicated configuration knob.
>>>
>>>> I didn't know about that.
>>>>
>>>>> Shouldn't we enable this via ethtool on the port representor netdev?
>>>> We discussed about this.  ethtool on the VF representor will only work
>>>> in switchdev mode and also will not support min/max values.
>>> Ethtool channel API may be overdue a rewrite in devlink anyway, but I
>>> feel like implementing switchdev mode and rewriting features in devlink
>>> may be too much to ask.
>> Totally agreed.  And switchdev mode doesn't seem to be that widely
>> used at the moment.  Do you have other suggestions besides NDO?
> At some points you (Broadcom) were working whole bunch of devlink
> configuration options for the PCIe side of the ASIC.  The number of
> queues relates to things like number of allocated MSI-X vectors, which
> if memory serves me was in your devlink patch set.  In an ideal world
> we would try to keep all those in one place :)
>
> For PCIe config there is always the question of what can be configured
> at runtime, and what requires a HW reset.  Therefore that devlink API
> which could configure current as well as persistent device settings was
> quite nice.  I'm not sure if reallocating queues would ever require
> PCIe block reset but maybe...  Certainly it seems the notion of min
> queues would make more sense in PCIe configuration devlink API than
> ethtool channel API to me as well.
>
> Queues are in the grey area between netdev and non-netdev constructs.
> They make sense both from PCIe resource allocation perspective (i.e.
> devlink PCIe settings) and netdev perspective (ethtool) because they
> feed into things like qdisc offloads, maybe per-queue stats etc.
>
> So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
> API and/or switchdev representors.  But neither of those are really an
> option for you today so IDK :)

One reason why 'switchdev' mode is not yet widely used or enabled by default
could be due to the requirement to program the flow rules only via slow path.

Would it make sense to relax this requirement and support a mode where port
representors are created and let the PF driver implement a default policy that
adds flow rules for all the VFs to enable connectivity and let the user
add/modify the rules via port representors?

^ permalink raw reply

* [PATCH bpf] bpf: prevent non-ipv4 socket to be added into sock map
From: Wei Wang @ 2018-05-30 21:28 UTC (permalink / raw)
  To: netdev, John Fastabend; +Cc: Eric Dumazet, Willem de Bruijn, Wei Wang

From: Wei Wang <weiwan@google.com>

Sock map only supports IPv4 socket proto right now.
If a non-IPv4 socket gets stored in the BPF map, sk->sk_prot gets
overwritten with the v4 tcp prot.
It could potentially cause issues when invoking functions from
sk->sk_prot later in the stack.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 kernel/bpf/sockmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 95a84b2f10ce..1984922f99ee 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1873,6 +1873,11 @@ static int sock_map_update_elem(struct bpf_map *map,
 		return -EOPNOTSUPP;
 	}
 
+	if (skops.sk->sk_family != AF_INET) {
+		fput(socket->file);
+		return -EAFNOSUPPORT;
+	}
+
 	err = sock_map_ctx_update_elem(&skops, map, key, flags);
 	fput(socket->file);
 	return err;
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* [PATCH bpf-next] bpf: prevent non-IPv4 socket to be added into sock hash
From: Wei Wang @ 2018-05-30 21:29 UTC (permalink / raw)
  To: netdev, John Fastabend; +Cc: Eric Dumazet, Willem de Bruijn, Wei Wang

From: Wei Wang <weiwan@google.com>

Sock hash only supports IPv4 socket proto right now.
If a non-IPv4 socket gets stored in the BPF map, sk->sk_prot gets
overwritten with the v4 tcp prot.

Syskaller reported the following related issue on an IPv6 socket:
BUG: KASAN: slab-out-of-bounds in ip6_dst_idev include/net/ip6_fib.h:203 [inline]
BUG: KASAN: slab-out-of-bounds in ip6_xmit+0x2002/0x23f0 net/ipv6/ip6_output.c:264
Read of size 8 at addr ffff8801b300edb0 by task syz-executor888/4522

CPU: 0 PID: 4522 Comm: syz-executor888 Not tainted 4.17.0-rc4+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 ip6_dst_idev include/net/ip6_fib.h:203 [inline]
 ip6_xmit+0x2002/0x23f0 net/ipv6/ip6_output.c:264
 inet6_csk_xmit+0x377/0x630 net/ipv6/inet6_connection_sock.c:139
 tcp_transmit_skb+0x1be0/0x3e40 net/ipv4/tcp_output.c:1159
 tcp_send_syn_data net/ipv4/tcp_output.c:3441 [inline]
 tcp_connect+0x2207/0x45a0 net/ipv4/tcp_output.c:3480
 tcp_v4_connect+0x1934/0x1d50 net/ipv4/tcp_ipv4.c:272
 __inet_stream_connect+0x943/0x1120 net/ipv4/af_inet.c:655
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1162 [inline]
 tcp_sendmsg_locked+0x2859/0x3ee0 net/ipv4/tcp.c:1209
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
 inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43ff99
RSP: 002b:00007ffc00bd1cf8 EFLAGS: 00000217 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ff99
RDX: 0000000020000000 RSI: 0000000020000580 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000217 R12: 00000000004018c0
R13: 0000000000401950 R14: 0000000000000000 R15: 0000000000000000

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 kernel/bpf/sockmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 3b28955a6383..0e7b88bc3e3f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2300,6 +2300,11 @@ static int sock_hash_update_elem(struct bpf_map *map,
 		return -EINVAL;
 	}
 
+	if (skops.sk->sk_family != AF_INET) {
+		fput(socket->file);
+		return -EAFNOSUPPORT;
+	}
+
 	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
 	fput(socket->file);
 	return err;
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
From: Eric Dumazet @ 2018-05-30 21:30 UTC (permalink / raw)
  To: Qing Huang
  Cc: David Miller, netdev, Eric Dumazet, John Sperbeck, Tarick Bedeir,
	Daniel Jurgens, Zhu Yanjun, Tariq Toukan, linux-rdma,
	Santosh Shilimkar
In-Reply-To: <2cd40c9e-16a5-ba96-db7c-3aafc5af0957@oracle.com>

On Wed, May 30, 2018 at 5:08 PM Qing Huang <qing.huang@oracle.com> wrote:
>
>
>
> On 5/30/2018 1:50 PM, Eric Dumazet wrote:
> > On Wed, May 30, 2018 at 4:30 PM Qing Huang <qing.huang@oracle.com> wrote:
> >>
> >>
> >> On 5/29/2018 9:11 PM, Eric Dumazet wrote:
> >>> Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> >>> brought a regression caught in our regression suite, thanks to KASAN.
> >> If KASAN reported issue was really caused by smaller chunk sizes,
> >> changing allocation
> >> order dynamically will eventually hit the same issue.
> > Sigh, you have little idea of what your patch really did...
> >
> > The KASAN part only shows the tip of the iceberg, but our main concern
> > is an increase of memory overhead.
>
> Well, the commit log only mentioned KASAN and but the change here didn't
> seem to solve
> the issue.

Can you elaborate ?

My patch solves our problems.

Both the memory overhead and KASAN splats are gone.

>
> >
> > Alternative is to revert your patch, since we are now very late in 4.17 cycle.
> >
> > Memory usage has grown a lot with your patch, since each 4KB page needs a full
> > struct mlx4_icm_chunk (256 bytes of overhead !)
>
> Going to smaller chunks will have some overhead. It depends on the
> application though.
> What's the total increased memory consumption in your env?


As I explained, your patch adds 256 bytes of overhead per 4KB.

Your changelog did not mentioned that at all, and we discovered this
the hard way.

That is pretty intolerable, and is a blocker for us, memory is precious.

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2018-05-30 21:38 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180529132548.7f71c2ff@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

Hi all,

On Tue, 29 May 2018 13:25:48 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> x86_64-linux-ld: unknown architecture of input file `net/bpfilter/bpfilter_umh.o' is incompatible with i386:x86-64 output
> 
> Caused by commit
> 
>   d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> 
> In my builds, the host is PowerPC 64 LE ...
> 
> I have reverted that commit along with
> 
>   61a552eb487f ("bpfilter: fix build dependency")
>   13405468f49d ("bpfilter: don't pass O_CREAT when opening console for debug")
> 
> for today.

I am still getting this failure (well, at least yesterday I did).

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH] ieee802154: mcr20a: add missing includes
From: Arnd Bergmann @ 2018-05-30 21:47 UTC (permalink / raw)
  To: Xue Liu, Alexander Aring, Stefan Schmidt, David S. Miller
  Cc: Arnd Bergmann, Colin Ian King, Gustavo A. R. Silva, linux-wpan,
	netdev, linux-kernel

Without CONFIG_GPIOLIB, some headers are not included implicitly,
leading to a build failure:

drivers/net/ieee802154/mcr20a.c: In function 'mcr20a_probe':
drivers/net/ieee802154/mcr20a.c:1347:13: error: implicit declaration of function 'irq_get_trigger_type'; did you mean 'irq_get_irqchip_state'? [-Werror=implicit-function-declaration]

This includes gpio/consumer.h and irq.h directly rather through the
gpiolib header.

Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ieee802154/mcr20a.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
index de0d7f28a181..e428277781ac 100644
--- a/drivers/net/ieee802154/mcr20a.c
+++ b/drivers/net/ieee802154/mcr20a.c
@@ -15,10 +15,11 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
-#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
 #include <linux/spi/spi.h>
 #include <linux/workqueue.h>
 #include <linux/interrupt.h>
+#include <linux/irq.h>
 #include <linux/skbuff.h>
 #include <linux/of_gpio.h>
 #include <linux/regmap.h>
-- 
2.9.0

^ permalink raw reply related

* [PATCH] net: ti: cpsw: include gpio/consumer.h
From: Arnd Bergmann @ 2018-05-30 21:51 UTC (permalink / raw)
  To: David S. Miller, Grygorii Strashko
  Cc: linux-gpio, Florian Fainelli, Arnd Bergmann, Ivan Khoronzhuk,
	Keerthy, linux-omap, netdev, linux-kernel

On platforms that don't always enable CONFIG_GPIOLIB, we run into
a build failure:

drivers/net/ethernet/ti/cpsw.c: In function 'cpsw_probe':
drivers/net/ethernet/ti/cpsw.c:3006:9: error: implicit declaration of function 'devm_gpiod_get_array_optional' [-Werror=implicit-function-declaration]
  mode = devm_gpiod_get_array_optional(&pdev->dev, "mode", GPIOD_OUT_LOW);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/ti/cpsw.c:3006:59: error: 'GPIOD_OUT_LOW' undeclared (first use in this function); did you mean 'GPIOF_INIT_LOW'?
  mode = devm_gpiod_get_array_optional(&pdev->dev, "mode", GPIOD_OUT_LOW);

Since we cannot rely on this to be visible from gpio.h, we have to include
gpio/consumer.h directly.

Fixes: 2652113ff043 ("net: ethernet: ti: Allow most drivers with COMPILE_TEST")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/ti/cpsw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 643cd2d9dfb6..534596ce00d3 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -29,7 +29,7 @@
 #include <linux/workqueue.h>
 #include <linux/delay.h>
 #include <linux/pm_runtime.h>
-#include <linux/gpio.h>
+#include <linux/gpio/consumer.h>
 #include <linux/of.h>
 #include <linux/of_mdio.h>
 #include <linux/of_net.h>
-- 
2.9.0

^ permalink raw reply related

* Re: [bpf-next V1 PATCH 1/8] xdp: add flags argument to ndo_xdp_xmit API
From: Song Liu @ 2018-05-30 21:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, John Fastabend,
	makita.toshiaki
In-Reply-To: <152770323794.20510.15169206222123614428.stgit@firesoul>

On Wed, May 30, 2018 at 11:00 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> This patch only change the API and reject any use of flags. This is an
> intermediate step that allows us to implement the flush flag operation
> later, for each individual driver in a separate patch.
>
> The plan is to implement flush operation via XDP_XMIT_FLUSH flag
> and then remove XDP_XMIT_FLAGS_NONE when done.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |    6 +++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |    3 ++-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    5 ++++-
>  drivers/net/tun.c                             |    8 ++++++--
>  drivers/net/virtio_net.c                      |    5 ++++-
>  include/linux/netdevice.h                     |    7 ++++---
>  include/net/xdp.h                             |    5 +++++
>  kernel/bpf/devmap.c                           |    2 +-
>  net/core/filter.c                             |    2 +-
>  9 files changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 9b698c5acd05..c0451d6e0790 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3670,7 +3670,8 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>   * For error cases, a negative errno code is returned and no-frames
>   * are transmitted (caller must handle freeing frames).
>   **/
> -int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames)
> +int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> +                 u32 flags)
>  {
>         struct i40e_netdev_priv *np = netdev_priv(dev);
>         unsigned int queue_index = smp_processor_id();
> @@ -3684,6 +3685,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames)
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return -ENXIO;
>
> +       if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +               return -EINVAL;
> +
>         for (i = 0; i < n; i++) {
>                 struct xdp_frame *xdpf = frames[i];
>                 int err;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index eb8804b3d7b6..820f76db251b 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -487,7 +487,8 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
>  void i40e_detect_recover_hung(struct i40e_vsi *vsi);
>  int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
>  bool __i40e_chk_linearize(struct sk_buff *skb);
> -int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames);
> +int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> +                 u32 flags);
>  void i40e_xdp_flush(struct net_device *dev);
>
>  /**
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 031d65c4178d..87f088f4af52 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -10023,7 +10023,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  }
>
>  static int ixgbe_xdp_xmit(struct net_device *dev, int n,
> -                         struct xdp_frame **frames)
> +                         struct xdp_frame **frames, u32 flags)
>  {
>         struct ixgbe_adapter *adapter = netdev_priv(dev);
>         struct ixgbe_ring *ring;
> @@ -10033,6 +10033,9 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>         if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
>                 return -ENETDOWN;
>
> +       if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +               return -EINVAL;
> +
>         /* During program transitions its possible adapter->xdp_prog is assigned
>          * but ring has not been configured yet. In this case simply abort xmit.
>          */
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 2265d2ccea47..b182b8cdd219 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1285,7 +1285,8 @@ static const struct net_device_ops tun_netdev_ops = {
>         .ndo_get_stats64        = tun_net_get_stats64,
>  };
>
> -static int tun_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames)
> +static int tun_xdp_xmit(struct net_device *dev, int n,
> +                       struct xdp_frame **frames, u32 flags)
>  {
>         struct tun_struct *tun = netdev_priv(dev);
>         struct tun_file *tfile;
> @@ -1294,6 +1295,9 @@ static int tun_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames
>         int cnt = n;
>         int i;
>
> +       if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +               return -EINVAL;
> +
>         rcu_read_lock();
>
>         numqueues = READ_ONCE(tun->numqueues);
> @@ -1332,7 +1336,7 @@ static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
>         if (unlikely(!frame))
>                 return -EOVERFLOW;
>
> -       return tun_xdp_xmit(dev, 1, &frame);
> +       return tun_xdp_xmit(dev, 1, &frame, 0);
>  }
>
>  static void tun_xdp_flush(struct net_device *dev)
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b2647dd5d302..4ed823625953 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -468,7 +468,7 @@ static int __virtnet_xdp_tx_xmit(struct virtnet_info *vi,
>  }
>
>  static int virtnet_xdp_xmit(struct net_device *dev,
> -                           int n, struct xdp_frame **frames)
> +                           int n, struct xdp_frame **frames, u32 flags)
>  {
>         struct virtnet_info *vi = netdev_priv(dev);
>         struct receive_queue *rq = vi->rq;
> @@ -481,6 +481,9 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>         int err;
>         int i;
>
> +       if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +               return -EINVAL;
> +
>         qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
>         sq = &vi->sq[qp];
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8452f72087ef..7f17785a59d7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1185,13 +1185,13 @@ struct dev_ifalias {
>   *     This function is used to set or query state related to XDP on the
>   *     netdevice and manage BPF offload. See definition of
>   *     enum bpf_netdev_command for details.
> - * int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame **xdp);
> + * int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame **xdp,
> + *                     u32 flags);
>   *     This function is used to submit @n XDP packets for transmit on a
>   *     netdevice. Returns number of frames successfully transmitted, frames
>   *     that got dropped are freed/returned via xdp_return_frame().
>   *     Returns negative number, means general error invoking ndo, meaning
>   *     no frames were xmit'ed and core-caller will free all frames.
> - *     TODO: Consider add flag to allow sending flush operation.
>   * void (*ndo_xdp_flush)(struct net_device *dev);
>   *     This function is used to inform the driver to flush a particular
>   *     xdp tx queue. Must be called on same CPU as xdp_xmit.
> @@ -1380,7 +1380,8 @@ struct net_device_ops {
>         int                     (*ndo_bpf)(struct net_device *dev,
>                                            struct netdev_bpf *bpf);
>         int                     (*ndo_xdp_xmit)(struct net_device *dev, int n,
> -                                               struct xdp_frame **xdp);
> +                                               struct xdp_frame **xdp,
> +                                               u32 flags);
>         void                    (*ndo_xdp_flush)(struct net_device *dev);
>  };
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 7ad779237ae8..308a4b30b484 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -40,6 +40,11 @@ enum xdp_mem_type {
>         MEM_TYPE_MAX,
>  };
>
> +/* XDP flags for ndo_xdp_xmit */
> +#define XDP_XMIT_FLAGS_NONE    0U
> +#define XDP_XMIT_FLUSH         (1U << 0)
> +#define XDP_XMIT_FLAGS_MASK    XDP_XMIT_FLUSH
> +

I guess we need more documentation here on what XDP_XMIT_FLUSH does.

Other than this, it looks good to me.

Acked-by: Song Liu <songliubraving@fb.com>


>  struct xdp_mem_info {
>         u32 type; /* enum xdp_mem_type, but known size type */
>         u32 id;
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index ae16d0c373ef..04fbd75a5274 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -232,7 +232,7 @@ static int bq_xmit_all(struct bpf_dtab_netdev *obj,
>                 prefetch(xdpf);
>         }
>
> -       sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q);
> +       sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, 0);
>         if (sent < 0) {
>                 err = sent;
>                 sent = 0;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 81bd2e9fe8fc..6a21dbcad350 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3056,7 +3056,7 @@ static int __bpf_tx_xdp(struct net_device *dev,
>         if (unlikely(!xdpf))
>                 return -EOVERFLOW;
>
> -       sent = dev->netdev_ops->ndo_xdp_xmit(dev, 1, &xdpf);
> +       sent = dev->netdev_ops->ndo_xdp_xmit(dev, 1, &xdpf, 0);
>         if (sent <= 0)
>                 return sent;
>         dev->netdev_ops->ndo_xdp_flush(dev);
>

^ permalink raw reply

* Re: [bpf-next V1 PATCH 2/8] i40e: implement flush flag for ndo_xdp_xmit
From: Song Liu @ 2018-05-30 21:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, John Fastabend,
	makita.toshiaki
In-Reply-To: <152770324304.20510.3553783983496502887.stgit@firesoul>

On Wed, May 30, 2018 at 11:00 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

I guess we still need to say something in the commit message?

> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index c0451d6e0790..03c1446f0465 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3685,7 +3685,7 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>         if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>                 return -ENXIO;
>
> -       if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +       if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>                 return -EINVAL;
>
>         for (i = 0; i < n; i++) {
> @@ -3699,6 +3699,9 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>                 }
>         }
>
> +       if (unlikely(flags & XDP_XMIT_FLUSH))
> +               i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
> +
>         return n - drops;

Do we still flush when drops > 0?

Thanks,
Song

>  }
>
>

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: Lock before br_fdb_find()
From: Petr Machata @ 2018-05-30 22:03 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev, stephen
In-Reply-To: <20180530.124215.898586103229215718.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Petr Machata <petrm@mellanox.com>
> Date: Mon, 28 May 2018 17:44:16 +0200
>
>> Callers of br_fdb_find() need to hold the hash lock, which
>> br_fdb_find_port() doesn't do. Add the missing lock/unlock
>> pair.
>> 
>> Signed-off-by: Petr Machata <petrm@mellanox.com>
>
> If all of the these uses of br_fdb_find_port() are safe, then it
> should use the RCU fdb lookup variant.
>
> So I basically agree with Stephen that this locking doesn't make any
> sense.
>
> The lock is needed when you are going to add or delete an FDB entry.
>
> Here we are doing a lookup and returning a device pointer via the FDB
> entry found in the lookup.
>
> The RTNL assertion assures that the device returned won't disappear.
>
> If the device can disappear, the spinlock added by this patch doesn't
> change that at all.

OK, I'll take another look at this.

Thanks,
Petr

^ permalink raw reply

* Re: [bpf-next V1 PATCH 8/8] bpf/xdp: devmap can avoid calling ndo_xdp_flush
From: Song Liu @ 2018-05-30 22:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, John Fastabend,
	makita.toshiaki
In-Reply-To: <152770327343.20510.17045843025201198801.stgit@firesoul>

On Wed, May 30, 2018 at 11:01 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead
> instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in
> appropriate places.
>
> Notice after this patch it is possible to remove ndo_xdp_flush
> completely, as this is the last user of ndo_xdp_flush. This is left
> for later patches, to keep driver changes separate.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  kernel/bpf/devmap.c |   20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 04fbd75a5274..9c846a7a8cff 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -217,7 +217,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 bit)
>  }
>
>  static int bq_xmit_all(struct bpf_dtab_netdev *obj,
> -                        struct xdp_bulk_queue *bq)
> +                      struct xdp_bulk_queue *bq, bool flush)

How about we use "int flags" instead of "bool flush" for easier extension?

Thanks,
Song

>  {
>         struct net_device *dev = obj->dev;
>         int sent = 0, drops = 0, err = 0;
> @@ -232,7 +232,8 @@ static int bq_xmit_all(struct bpf_dtab_netdev *obj,
>                 prefetch(xdpf);
>         }
>
> -       sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, 0);
> +       sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q,
> +                                            flush ? XDP_XMIT_FLUSH : 0);
>         if (sent < 0) {
>                 err = sent;
>                 sent = 0;
> @@ -276,7 +277,6 @@ void __dev_map_flush(struct bpf_map *map)
>         for_each_set_bit(bit, bitmap, map->max_entries) {
>                 struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]);
>                 struct xdp_bulk_queue *bq;
> -               struct net_device *netdev;
>
>                 /* This is possible if the dev entry is removed by user space
>                  * between xdp redirect and flush op.
> @@ -287,10 +287,7 @@ void __dev_map_flush(struct bpf_map *map)
>                 __clear_bit(bit, bitmap);
>
>                 bq = this_cpu_ptr(dev->bulkq);
> -               bq_xmit_all(dev, bq);
> -               netdev = dev->dev;
> -               if (likely(netdev->netdev_ops->ndo_xdp_flush))
> -                       netdev->netdev_ops->ndo_xdp_flush(netdev);
> +               bq_xmit_all(dev, bq, true);
>         }
>  }
>
> @@ -320,7 +317,7 @@ static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf,
>         struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq);
>
>         if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
> -               bq_xmit_all(obj, bq);
> +               bq_xmit_all(obj, bq, false);
>
>         /* Ingress dev_rx will be the same for all xdp_frame's in
>          * bulk_queue, because bq stored per-CPU and must be flushed
> @@ -359,8 +356,7 @@ static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
>
>  static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
>  {
> -       if (dev->dev->netdev_ops->ndo_xdp_flush) {
> -               struct net_device *fl = dev->dev;
> +       if (dev->dev->netdev_ops->ndo_xdp_xmit) {
>                 struct xdp_bulk_queue *bq;
>                 unsigned long *bitmap;
>
> @@ -371,9 +367,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
>                         __clear_bit(dev->bit, bitmap);
>
>                         bq = per_cpu_ptr(dev->bulkq, cpu);
> -                       bq_xmit_all(dev, bq);
> -
> -                       fl->netdev_ops->ndo_xdp_flush(dev->dev);
> +                       bq_xmit_all(dev, bq, true);
>                 }
>         }
>  }
>

^ permalink raw reply

* [PATCH net-next] net: dsa: mv88e6xxx: Be explicit about DT or pdata
From: Andrew Lunn @ 2018-05-30 22:15 UTC (permalink / raw)
  To: David Miller
  Cc: Vivien Didelot, Florian Fainelli, dan.carpenter, netdev,
	Andrew Lunn

Make it explicit that either device tree is used or platform data.  If
neither is available, abort the probe.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 877b7cb0b6f2 ("net: dsa: mv88e6xxx: Add minimal platform_data support")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 12df00f593b7..437cd6eb4faa 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -4389,6 +4389,9 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
 	int port;
 	int err;
 
+	if (!np && !pdata)
+		return -EINVAL;
+
 	if (np)
 		compat_info = of_device_get_match_data(dev);
 
-- 
2.17.0

^ permalink raw reply related

* Re: [bpf-next V1 PATCH 0/8] bpf/xdp: add flags argument to ndo_xdp_xmit and flag flush operation
From: Song Liu @ 2018-05-30 22:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Alexei Starovoitov, John Fastabend,
	makita.toshiaki
In-Reply-To: <152770312703.20510.5854417568847239931.stgit@firesoul>

Overall, this set looks good to me. The only suggestion I have is to add more
documentation on the expected behavior of XDP_XMIT_FLUSH in netdevice.h
(as part of 01/08).

Thanks,
Song


On Wed, May 30, 2018 at 11:00 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> As I mentioned in merge commit 10f678683e4 ("Merge branch 'xdp_xmit-bulking'")
> I plan to change the API for ndo_xdp_xmit once more, by adding a flags
> argument, which is done in this patchset.
>
> I know it is late in the cycle (currently at rc7), but it would be
> nice to avoid changing NDOs over several kernel releases, as it is
> annoying to vendors and distro backporters, but it is not strictly
> UAPI so it is allowed (according to Alexei).
>
> The end-goal is getting rid of the ndo_xdp_flush operation, as it will
> make it possible for drivers to implement a TXQ synchronization mechanism
> that is not necessarily derived from the CPU id (smp_processor_id).
>
> This patchset removes all callers of the ndo_xdp_flush operation, but
> it doesn't take the last step of removing it from all drivers.  This
> can be done later, or I can update the patchset on request.
>
> Micro-benchmarks only show a very small performance improvement, for
> map-redirect around ~2 ns, and for non-map redirect ~7 ns.  I've not
> benchmarked this with CONFIG_RETPOLINE, but the performance benefit
> should be more visible given we end-up removing an indirect call.
>
> ---
>
> Jesper Dangaard Brouer (8):
>       xdp: add flags argument to ndo_xdp_xmit API
>       i40e: implement flush flag for ndo_xdp_xmit
>       ixgbe: implement flush flag for ndo_xdp_xmit
>       tun: implement flush flag for ndo_xdp_xmit
>       virtio_net: implement flush flag for ndo_xdp_xmit
>       xdp: done implementing ndo_xdp_xmit flush flag for all drivers
>       bpf/xdp: non-map redirect can avoid calling ndo_xdp_flush
>       bpf/xdp: devmap can avoid calling ndo_xdp_flush
>
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |    9 ++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |    3 ++-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   23 +++++++++++++++++------
>  drivers/net/tun.c                             |   25 ++++++++++++++++++-------
>  drivers/net/virtio_net.c                      |    9 ++++++++-
>  include/linux/netdevice.h                     |    7 ++++---
>  include/net/xdp.h                             |    4 ++++
>  kernel/bpf/devmap.c                           |   20 +++++++-------------
>  net/core/filter.c                             |    3 +--
>  9 files changed, 69 insertions(+), 34 deletions(-)
>
> --

^ permalink raw reply

* [PATCH] rtnetlink: Remove VLA usage
From: Kees Cook @ 2018-05-30 22:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: Florian Westphal, David Ahern, netdev, linux-kernel

In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size expected for all possible types and adds
sanity-checks at both registration and usage to make sure nothing gets
out of sync. This matches the proposed VLA solution for nfnetlink[2]. The
values chosen here were based on finding assignments for .maxtype and
.slave_maxtype and manually counting the enums:

slave_maxtype (max 33):
	IFLA_BRPORT_MAX     33
	IFLA_BOND_SLAVE_MAX  9

maxtype (max 45):
	IFLA_BOND_MAX       28
	IFLA_BR_MAX         45
	__IFLA_CAIF_HSI_MAX  8
	IFLA_CAIF_MAX        4
	IFLA_CAN_MAX        16
	IFLA_GENEVE_MAX     12
	IFLA_GRE_MAX        25
	IFLA_GTP_MAX         5
	IFLA_HSR_MAX         7
	IFLA_IPOIB_MAX       4
	IFLA_IPTUN_MAX      21
	IFLA_IPVLAN_MAX      3
	IFLA_MACSEC_MAX     15
	IFLA_MACVLAN_MAX     7
	IFLA_PPP_MAX         2
	__IFLA_RMNET_MAX     4
	IFLA_VLAN_MAX        6
	IFLA_VRF_MAX         2
	IFLA_VTI_MAX         7
	IFLA_VXLAN_MAX      28
	VETH_INFO_MAX        2
	VXCAN_INFO_MAX       2

This additionally changes maxtype and slave_maxtype fields to unsigned,
since they're only ever using positive values.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
[2] https://patchwork.kernel.org/patch/10439647/

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/net/rtnetlink.h |  4 ++--
 net/core/rtnetlink.c    | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 14b6b3af8918..0bbaa5488423 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -64,7 +64,7 @@ struct rtnl_link_ops {
 	size_t			priv_size;
 	void			(*setup)(struct net_device *dev);
 
-	int			maxtype;
+	unsigned int		maxtype;
 	const struct nla_policy	*policy;
 	int			(*validate)(struct nlattr *tb[],
 					    struct nlattr *data[],
@@ -92,7 +92,7 @@ struct rtnl_link_ops {
 	unsigned int		(*get_num_tx_queues)(void);
 	unsigned int		(*get_num_rx_queues)(void);
 
-	int			slave_maxtype;
+	unsigned int		slave_maxtype;
 	const struct nla_policy	*slave_policy;
 	int			(*slave_changelink)(struct net_device *dev,
 						    struct net_device *slave_dev,
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 45936922d7e2..4ede33719ca9 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -59,6 +59,9 @@
 #include <net/rtnetlink.h>
 #include <net/net_namespace.h>
 
+#define RTNL_MAX_TYPE		48
+#define RTNL_SLAVE_MAX_TYPE	36
+
 struct rtnl_link {
 	rtnl_doit_func		doit;
 	rtnl_dumpit_func	dumpit;
@@ -389,6 +392,11 @@ int rtnl_link_register(struct rtnl_link_ops *ops)
 {
 	int err;
 
+	/* Sanity-check max sizes to avoid stack buffer overflow. */
+	if (WARN_ON(ops->maxtype > RTNL_MAX_TYPE ||
+		    ops->slave_maxtype > RTNL_SLAVE_MAX_TYPE))
+		return -EINVAL;
+
 	rtnl_lock();
 	err = __rtnl_link_register(ops);
 	rtnl_unlock();
@@ -2900,13 +2908,16 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	}
 
 	if (1) {
-		struct nlattr *attr[ops ? ops->maxtype + 1 : 1];
-		struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 1];
+		struct nlattr *attr[RTNL_MAX_TYPE + 1];
+		struct nlattr *slave_attr[RTNL_SLAVE_MAX_TYPE + 1];
 		struct nlattr **data = NULL;
 		struct nlattr **slave_data = NULL;
 		struct net *dest_net, *link_net = NULL;
 
 		if (ops) {
+			if (ops->maxtype > RTNL_MAX_TYPE)
+				return -EINVAL;
+
 			if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
 				err = nla_parse_nested(attr, ops->maxtype,
 						       linkinfo[IFLA_INFO_DATA],
@@ -2923,6 +2934,9 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		}
 
 		if (m_ops) {
+			if (ops->slave_maxtype > RTNL_SLAVE_MAX_TYPE)
+				return -EINVAL;
+
 			if (m_ops->slave_maxtype &&
 			    linkinfo[IFLA_INFO_SLAVE_DATA]) {
 				err = nla_parse_nested(slave_attr,
-- 
2.17.0


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Jakub Kicinski @ 2018-05-30 22:36 UTC (permalink / raw)
  To: Michael Chan; +Cc: Samudrala, Sridhar, David Miller, Netdev
In-Reply-To: <CACKFLi=-QVePYwAiG+boXoH0tdWTaM1ZRyV4qxsj0Au1w5PKBQ@mail.gmail.com>

On Wed, 30 May 2018 00:18:39 -0700, Michael Chan wrote:
> On Tue, May 29, 2018 at 11:33 PM, Jakub Kicinski wrote:
> > At some points you (Broadcom) were working whole bunch of devlink
> > configuration options for the PCIe side of the ASIC.  The number of
> > queues relates to things like number of allocated MSI-X vectors, which
> > if memory serves me was in your devlink patch set.  In an ideal world
> > we would try to keep all those in one place :)  
> 
> Yeah, another colleague is now working with Mellanox on something similar.
> 
> One difference between those devlink parameters and these queue
> parameters is that the former are more permanent and global settings.
> For example, number of VFs or number of MSIX per VF are persistent
> settings once they are set and after PCIe reset.  On the other hand,
> these queue settings are pure run-time settings and may be unique for
> each VF.  These are not stored as there is no room in NVRAM to store
> 128 sets or more of these parameters.

Indeed, I think the API must be flexible as to what is persistent and
what is not because HW will certainly differ in that regard.  And
agreed, queues may be a bit of a stretch here, but worth a try.

> Anyway, let me discuss this with my colleague to see if there is a
> natural fit for these queue parameters in the devlink infrastructure
> that they are working on.

Thank you!

^ permalink raw reply

* greetings
From: Miss Zeliha ömer Faruk @ 2018-05-30 22:34 UTC (permalink / raw)




-- 
Hello

I have been trying to contact you. Did you get my business proposal?

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215 Sisli - Istanbul, Turke

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Jakub Kicinski @ 2018-05-30 22:53 UTC (permalink / raw)
  To: Samudrala, Sridhar; +Cc: Michael Chan, David Miller, Netdev, Or Gerlitz
In-Reply-To: <6dd76ffc-4097-0e5e-6f66-78fd178f89c2@intel.com>

On Wed, 30 May 2018 14:23:06 -0700, Samudrala, Sridhar wrote:
> On 5/29/2018 11:33 PM, Jakub Kicinski wrote:
> > On Tue, 29 May 2018 23:08:11 -0700, Michael Chan wrote:  
> >> On Tue, May 29, 2018 at 10:56 PM, Jakub Kicinski wrote:  
> >>> On Tue, 29 May 2018 20:19:54 -0700, Michael Chan wrote:  
> >>>> On Tue, May 29, 2018 at 1:46 PM, Samudrala, Sridhar wrote:  
> >>>>> Isn't ndo_set_vf_xxx() considered a legacy interface and not planned to be
> >>>>> extended?  
> >>> +1 it's painful to see this feature being added to the legacy
> >>> API :(  Another duplicated configuration knob.
> >>>  
> >>>> I didn't know about that.
> >>>>  
> >>>>> Shouldn't we enable this via ethtool on the port representor netdev?  
> >>>> We discussed about this.  ethtool on the VF representor will only work
> >>>> in switchdev mode and also will not support min/max values.  
> >>> Ethtool channel API may be overdue a rewrite in devlink anyway, but I
> >>> feel like implementing switchdev mode and rewriting features in devlink
> >>> may be too much to ask.  
> >> Totally agreed.  And switchdev mode doesn't seem to be that widely
> >> used at the moment.  Do you have other suggestions besides NDO?  
> > At some points you (Broadcom) were working whole bunch of devlink
> > configuration options for the PCIe side of the ASIC.  The number of
> > queues relates to things like number of allocated MSI-X vectors, which
> > if memory serves me was in your devlink patch set.  In an ideal world
> > we would try to keep all those in one place :)
> >
> > For PCIe config there is always the question of what can be configured
> > at runtime, and what requires a HW reset.  Therefore that devlink API
> > which could configure current as well as persistent device settings was
> > quite nice.  I'm not sure if reallocating queues would ever require
> > PCIe block reset but maybe...  Certainly it seems the notion of min
> > queues would make more sense in PCIe configuration devlink API than
> > ethtool channel API to me as well.
> >
> > Queues are in the grey area between netdev and non-netdev constructs.
> > They make sense both from PCIe resource allocation perspective (i.e.
> > devlink PCIe settings) and netdev perspective (ethtool) because they
> > feed into things like qdisc offloads, maybe per-queue stats etc.
> >
> > So yes...  IMHO it would be nice to add this to a devlink SR-IOV config
> > API and/or switchdev representors.  But neither of those are really an
> > option for you today so IDK :)  
> 
> One reason why 'switchdev' mode is not yet widely used or enabled by default
> could be due to the requirement to program the flow rules only via slow path.

Do you mean the fallback traffic requirement?

> Would it make sense to relax this requirement and support a mode where port
> representors are created and let the PF driver implement a default policy that
> adds flow rules for all the VFs to enable connectivity and let the user
> add/modify the rules via port representors?

I definitely share your concerns, stopping a major HW vendor from using
this new and preferred mode is not helping us make progress.

The problem is that if we allow this diversion, i.e. driver to implement
some special policy, or pre-populate a bridge in a configuration that
suits the HW we may condition users to expect that as the standard Linux
behaviour.  And we will be stuck with it forever even tho your next gen
HW (ice?) may support correct behaviour.

We should perhaps separate switchdev mode from TC flower/OvS offloads.
Is your objective to implement OvS offload or just switchdev mode?  

For OvS without proper fallback behaviour you may struggle.

Switchdev mode could be within your reach even without changing the
default rules.  What if you spawned all port netdevs (I dislike the
term representor, sorry, it's confusing people) in down state and then
refuse to bring them up unless user instantiated a bridge that would
behave in a way that your HW can support?  If ports are down you won't
have fallback traffic so no problem to solve.

^ permalink raw reply

* Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
From: Qing Huang @ 2018-05-30 23:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Eric Dumazet, John Sperbeck, Tarick Bedeir,
	Daniel Jurgens, Zhu Yanjun, Tariq Toukan, linux-rdma,
	Santosh Shilimkar
In-Reply-To: <CANn89i+K3LS7+idPBOkcaBPBdBwz9tWCLqkzbWZA10W2mmR9hA@mail.gmail.com>



On 5/30/2018 2:30 PM, Eric Dumazet wrote:
> On Wed, May 30, 2018 at 5:08 PM Qing Huang<qing.huang@oracle.com>  wrote:
>>
>> On 5/30/2018 1:50 PM, Eric Dumazet wrote:
>>> On Wed, May 30, 2018 at 4:30 PM Qing Huang<qing.huang@oracle.com>  wrote:
>>>> On 5/29/2018 9:11 PM, Eric Dumazet wrote:
>>>>> Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
>>>>> brought a regression caught in our regression suite, thanks to KASAN.
>>>> If KASAN reported issue was really caused by smaller chunk sizes,
>>>> changing allocation
>>>> order dynamically will eventually hit the same issue.
>>> Sigh, you have little idea of what your patch really did...
>>>
>>> The KASAN part only shows the tip of the iceberg, but our main concern
>>> is an increase of memory overhead.
>> Well, the commit log only mentioned KASAN and but the change here didn't
>> seem to solve
>> the issue.
> Can you elaborate ?
>
> My patch solves our problems.
>
> Both the memory overhead and KASAN splats are gone.

If KASAN issue was triggered by using smaller chunks, when under memory 
pressure with lots of fragments,
low order memory allocation will do the similar things. So perhaps in 
your test env, memory allocation
and usage is relatively static, that's probably why using larger chunks 
didn't really utilize low order
allocation code path hence no KASAN issue was spotted.

Smaller chunk size in the mlx4 driver is not supposed to cause any 
memory corruption. We will probably
need to continue to investigate this. Can you provide the test command 
that trigger this issue when running
KASAN kernel so we can try to reproduce it in our lab? It could be that 
upstream code is missing some other
fixes.


>>> Alternative is to revert your patch, since we are now very late in 4.17 cycle.
>>>
>>> Memory usage has grown a lot with your patch, since each 4KB page needs a full
>>> struct mlx4_icm_chunk (256 bytes of overhead !)
>> Going to smaller chunks will have some overhead. It depends on the
>> application though.
>> What's the total increased memory consumption in your env?
> As I explained, your patch adds 256 bytes of overhead per 4KB.
>
> Your changelog did not mentioned that at all, and we discovered this
> the hard way.

If you have some concern regarding memory usage, you should bring this 
up during code review.

Repeated failure and retry for lower order allocations could be bad for 
latency too. This wasn't
mentioned in this commit either.

Like I said, how much overhead really depends on the application. 256 
bytes x chunks may not be
significant on a server with lots of memory.

> That is pretty intolerable, and is a blocker for us, memory is precious.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message tomajordomo@vger.kernel.org
> More majordomo info athttp://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH bpf-next] xsk: temporarily disable AF_XDP
From: Björn Töpel @ 2018-05-31  0:17 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson

From: Björn Töpel <bjorn.topel@intel.com>

Temporarily disable AF_XDP sockets, and hide uapi.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/{uapi => }/linux/if_xdp.h | 0
 net/xdp/Kconfig                   | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename include/{uapi => }/linux/if_xdp.h (100%)

diff --git a/include/uapi/linux/if_xdp.h b/include/linux/if_xdp.h
similarity index 100%
rename from include/uapi/linux/if_xdp.h
rename to include/linux/if_xdp.h
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
index 90e4a7152854..d845606dae7b 100644
--- a/net/xdp/Kconfig
+++ b/net/xdp/Kconfig
@@ -1,5 +1,5 @@
 config XDP_SOCKETS
-	bool "XDP sockets"
+	bool "XDP sockets" if n
 	depends on BPF_SYSCALL
 	default n
 	help
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH net-next] net: qcom/emac: fix unused variable
From: YueHaibing @ 2018-05-31  1:34 UTC (permalink / raw)
  To: Timur Tabi, davem; +Cc: netdev, linux-kernel
In-Reply-To: <0db547d4-27c1-a9f7-f443-86bebd831cb2@codeaurora.org>


On 2018/5/30 20:10, Timur Tabi wrote:
> On 5/29/18 5:43 AM, YueHaibing wrote:
>> When CONFIG_ACPI isn't set, variable qdf2400_ops/qdf2432_ops isn't used.
>> drivers/net/ethernet/qualcomm/emac/emac-sgmii.c:284:25: warning: ‘qdf2400_ops’ defined but not used [-Wunused-variable]
>>   static struct sgmii_ops qdf2400_ops = {
>>                           ^~~~~~~~~~~
>> drivers/net/ethernet/qualcomm/emac/emac-sgmii.c:276:25: warning: ‘qdf2432_ops’ defined but not used [-Wunused-variable]
>>   static struct sgmii_ops qdf2432_ops = {
>>                           ^~~~~~~~~~~
>>
>> Move the declaration and functions inside the CONFIG_ACPI ifdef
>> to fix the warning.
>> Signed-off-by: YueHaibing<yuehaibing@huawei.com>
> 
> I already fixed this with:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=d377df784178bf5b0a39e75dc8b1ee86e1abb3f6
>

Oh,I should notice this, thanks.

^ permalink raw reply

* Re: [PATCH] [net-next, wrong] make BPFILTER_UMH depend on X86
From: Masahiro Yamada @ 2018-05-31  1:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Arnd Bergmann, David S. Miller, Alexei Starovoitov,
	Linux Kbuild mailing list, netdev, Linux Kernel Mailing List
In-Reply-To: <20180530151736.nzpde2bgzn4koi7f@ast-mbp>

2018-05-31 0:17 GMT+09:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> On Mon, May 28, 2018 at 05:31:01PM +0200, Arnd Bergmann wrote:
>> When build testing across architectures, I run into a build error on
>> all targets other than X86:
>>
>> gcc-8.1.0-nolibc/arm-linux-gnueabi/bin/arm-linux-gnueabi-objdump: net/bpfilter/bpfilter_umh: File format not recognized
>> gcc-8.1.0-nolibc/arm-linux-gnueabi/bin/arm-linux-gnueabi-objcopy:net/bpfilter/bpfilter_umh.o: Invalid bfd target
>>
>> The problem is that 'hostprogs' get built with 'gcc' rather than
>> '$(CROSS_COMPILE)gcc', and my default gcc (as most people's) targets x86.
>>
>> To work around it, adding an X86 dependency gets randconfigs building
>> again on my box.
>>
>> Clearly, this is not a good solution, since it should actually work fine
>> when building native kernels on other architectures but that is now
>> disabled, while cross building an x86 kernel on another host is still
>> broken after my patch.
>>
>> What we probably want here is to try out if the compiler is able to build
>> executables for the target architecture and not build the helper otherwise,
>> at least when compile-testing. No idea how to do that though.
>>
>> Link: http://www.kernel.org/pub/tools/crosstool/
>> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
>> Cc: linux-kbuild@vger.kernel.org
>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>> ---
>>  net/bpfilter/Kconfig | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
>> index 60725c5f79db..61cc4fcbb4d0 100644
>> --- a/net/bpfilter/Kconfig
>> +++ b/net/bpfilter/Kconfig
>> @@ -9,6 +9,7 @@ menuconfig BPFILTER
>>  if BPFILTER
>>  config BPFILTER_UMH
>>       tristate "bpfilter kernel module with user mode helper"
>> +     depends on X86 # actually depends on native builds
>
> depends on X86 will break it on arm.
> I think the better short term fix would be to test that HOSTCC == CC
> It doesn't have to be the same compiler. HOSTCC's arch == kernel ARCH
> Not sure how to hack makefile to do that.
> Long term we need to get rid of HOSTCC dependency.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hmm.
For cross-compiling, we set 'ARCH' via the environment variable or the
command line.

ARCH is not explicitly set, the top-level Makefile sets it to $(SUBARCH)


ARCH ?= $(SUBARCH)


Maybe, we can assume the native build if $(ARCH) and $(SUBARCH) are the same?


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox