Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] atm: zatm: Fix potential Spectre v1
From: Randy Dunlap @ 2018-05-03 19:09 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Chas Williams, David S. Miller
  Cc: linux-atm-general, netdev, linux-kernel
In-Reply-To: <20180503181712.GA7443@embeddedor.com>

On 05/03/2018 11:17 AM, Gustavo A. R. Silva wrote:
> pool can be indirectly controlled by user-space, hence leading to
> a potential exploitation of the Spectre variant 1 vulnerability.
> 
> This issue was detected with the help of Smatch:
> 
> drivers/atm/zatm.c:1462 zatm_ioctl() warn: potential spectre issue
> 'zatm_dev->pool_info' (local cap)
> 
> Fix this by sanitizing pool before using it to index
> zatm_dev->pool_info
> 
> Notice that given that speculation windows are large, the policy is
> to kill the speculation on the first load and not worry if it can be
> completed with a dependent load/store [1].
> 
> [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Hi,

Just for (my) info:  all of these types of patches are to prevent
what is loaded in cache when the index is out of range, right?
Not some random pool_info[random], but pool_info[valid, i.e., 0].

Since the value of pool is already sanity checked and -EINVAL is
returned when it's out of range.

Thanks.

> Cc: stable@vger.kernel.org
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  drivers/atm/zatm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/atm/zatm.c b/drivers/atm/zatm.c
> index 1ef67db..9c9a229 100644
> --- a/drivers/atm/zatm.c
> +++ b/drivers/atm/zatm.c
> @@ -28,6 +28,7 @@
>  #include <asm/io.h>
>  #include <linux/atomic.h>
>  #include <linux/uaccess.h>
> +#include <linux/nospec.h>
>  
>  #include "uPD98401.h"
>  #include "uPD98402.h"
> @@ -1458,6 +1459,8 @@ static int zatm_ioctl(struct atm_dev *dev,unsigned int cmd,void __user *arg)
>  					return -EFAULT;
>  				if (pool < 0 || pool > ZATM_LAST_POOL)
>  					return -EINVAL;
> +				pool = array_index_nospec(pool,
> +							  ZATM_LAST_POOL + 1);
>  				spin_lock_irqsave(&zatm_dev->lock, flags);
>  				info = zatm_dev->pool_info[pool];
>  				if (cmd == ZATM_GETPOOLZ) {
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH net 0/3] net/smc: fixes 2018/05/03
From: David Miller @ 2018-05-03 18:47 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <20180503155739.55616-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu,  3 May 2018 17:57:36 +0200

> From: Ursula Braun <ursula.braun@de.ibm.com>
> 
> Dave,
> 
> here are smc fixes for 2 problems:
>  * receive buffers in SMC must be registered. If registration fails
>    these buffers must not be kept within the link group for reuse.
>    Patch 1 is a preparational patch; patch 2 contains the fix.
>  * sendpage: do not hold the sock lock when calling kernel_sendpage()
>              or sock_no_sendpage()

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] qed: fix spelling mistake: "offloded" -> "offloaded"
From: David Miller @ 2018-05-03 18:46 UTC (permalink / raw)
  To: colin.king
  Cc: Ariel.Elior, everest-linux-l2, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180503151932.22458-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu,  3 May 2018 16:19:32 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in DP_NOTICE message
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* [PATCH] net: atm: Fix potential Spectre v1
From: Gustavo A. R. Silva @ 2018-05-03 18:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-kernel, Gustavo A. R. Silva

ioc_data.dev_num can be controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:
net/atm/lec.c:702 lec_vcc_attach() warn: potential spectre issue
'dev_lec'

Fix this by sanitizing ioc_data.dev_num before using it to index
dev_lec. Also, notice that there is another instance in which array
dev_lec is being indexed using ioc_data.dev_num at line 705:
lec_vcc_added(netdev_priv(dev_lec[ioc_data.dev_num]),

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 net/atm/lec.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/atm/lec.c b/net/atm/lec.c
index 01d5d20..3138a86 100644
--- a/net/atm/lec.c
+++ b/net/atm/lec.c
@@ -41,6 +41,9 @@ static unsigned char bridge_ula_lec[] = { 0x01, 0x80, 0xc2, 0x00, 0x00 };
 #include <linux/module.h>
 #include <linux/init.h>
 
+/* Hardening for Spectre-v1 */
+#include <linux/nospec.h>
+
 #include "lec.h"
 #include "lec_arpc.h"
 #include "resources.h"
@@ -687,8 +690,10 @@ static int lec_vcc_attach(struct atm_vcc *vcc, void __user *arg)
 	bytes_left = copy_from_user(&ioc_data, arg, sizeof(struct atmlec_ioc));
 	if (bytes_left != 0)
 		pr_info("copy from user failed for %d bytes\n", bytes_left);
-	if (ioc_data.dev_num < 0 || ioc_data.dev_num >= MAX_LEC_ITF ||
-	    !dev_lec[ioc_data.dev_num])
+	if (ioc_data.dev_num < 0 || ioc_data.dev_num >= MAX_LEC_ITF)
+		return -EINVAL;
+	ioc_data.dev_num = array_index_nospec(ioc_data.dev_num, MAX_LEC_ITF);
+	if (!dev_lec[ioc_data.dev_num])
 		return -EINVAL;
 	vpriv = kmalloc(sizeof(struct lec_vcc_priv), GFP_KERNEL);
 	if (!vpriv)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] net/mlx5: fix spelling mistake: "modfiy" -> "modify"
From: David Miller @ 2018-05-03 18:44 UTC (permalink / raw)
  To: colin.king
  Cc: saeedm, matanb, leon, netdev, linux-rdma, kernel-janitors,
	linux-kernel
In-Reply-To: <20180503133503.14508-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu,  3 May 2018 14:35:03 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in netdev_warn warning message
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Saeed, please send this to me in your next pull request.

Thanks.

^ permalink raw reply

* KMSAN: uninit-value in strcmp
From: syzbot @ 2018-05-03 18:44 UTC (permalink / raw)
  To: davem, jon.maloy, linux-kernel, netdev, syzkaller-bugs,
	tipc-discussion, ying.xue

Hello,

syzbot found the following crash on:

HEAD commit:    d2d741e5d189 kmsan: add initialization for shmem pages
git tree:       https://github.com/google/kmsan.git/master
console output: https://syzkaller.appspot.com/x/log.txt?x=10051497800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=48f9de3384bcd0f
dashboard link: https://syzkaller.appspot.com/bug?extid=df0257c92ffd4fcc58cd
compiler:       clang version 7.0.0 (trunk 329391)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=11275657800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17c3d5e7800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+df0257c92ffd4fcc58cd@syzkaller.appspotmail.com

==================================================================
BUG: KMSAN: uninit-value in strcmp+0xf7/0x160 lib/string.c:329
CPU: 1 PID: 4527 Comm: syz-executor655 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  strcmp+0xf7/0x160 lib/string.c:329
  tipc_nl_node_get_link+0x220/0x6f0 net/tipc/node.c:1881
  genl_family_rcv_msg net/netlink/genetlink.c:599 [inline]
  genl_rcv_msg+0x1686/0x1810 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2447
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
  netlink_unicast+0x166b/0x1740 net/netlink/af_netlink.c:1337
  netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmsg net/socket.c:2080 [inline]
  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
  SyS_sendmsg+0x54/0x80 net/socket.c:2087
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x445589
RSP: 002b:00007fb7ee66cdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 0000000000445589
RDX: 0000000000000000 RSI: 0000000020023000 RDI: 0000000000000003
RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fffa2bf3f3f R14: 00007fb7ee66d9c0 R15: 0000000000000001

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
  netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmsg net/socket.c:2080 [inline]
  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
  SyS_sendmsg+0x54/0x80 net/socket.c:2087
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH net-next v2 00/15] ARM: sun8i: r40: Add Ethernet support
From: David Miller @ 2018-05-03 18:40 UTC (permalink / raw)
  To: maxime.ripard
  Cc: wens, mturquette, sboyd, peppe.cavallaro, robh+dt, mark.rutland,
	broonie, linux-arm-kernel, linux-clk, devicetree, netdev,
	clabbe.montjoie, icenowy
In-Reply-To: <20180503131257.rlqxetafejikmnji@flea>

From: Maxime Ripard <maxime.ripard@bootlin.com>
Date: Thu, 3 May 2018 15:12:57 +0200

> Hi Dave,
> 
> On Wed, May 02, 2018 at 11:06:17AM -0400, David Miller wrote:
>> From: Chen-Yu Tsai <wens@csie.org>
>> Date: Wed, 2 May 2018 00:33:45 +0800
>> 
>> > I should've mentioned that patches 3 ~ 10, and only these, should go
>> > through net-next. sunxi will handle the remaining clk, device tree, and
>> > soc driver patches.
>> 
>> Ok, I just noticed this.
>> 
>> Why don't you just post those patches separately as a series on their
>> own then, in order to avoid confusion?
>> 
>> Then you can adjust the patch series header posting to explain the
>> non-net-next changes, where they got merged, and what they provide
>> in order to faciliate the net-next changes.
> 
> I now that we usually have some feedback from non-net maintainers that
> they actually prefer seeing the full picture (and I also tend to
> prefer that as well) and having all the patches relevant to enable a
> particular feature, even if it means getting multiple maintainers
> involved.
> 
> Just to make sure we understood you fully, do you want Chen-Yu to
> resend his serie following your comments, or was that just a general
> remark for next time?

Yeah, good questions.

I think it can be argued either way.  For review having the complete
context is important.

But from a maintainer's standpoint, when there is any ambiguity
whatsoever about what patches go into this tree or that, it is really
frowned upon and is quite error prone.

Also, that header posting is _SO_ important.  It explains the series.
But for these 'partial apply' situations the header posting refers
to patches not in the series.

This looks terrible in the logs, when, as I do, the header posting
text is added to a marge commit for the series.  People will read it
and say "where are all of these other changes mentioned in the text?
was this series misapplied?"

That's why, maybe after the review is successful, I want the actual
patch series standalone with appropriately updated header posting
text.

^ permalink raw reply

* [PATCH] atm: zatm: Fix potential Spectre v1
From: Gustavo A. R. Silva @ 2018-05-03 18:17 UTC (permalink / raw)
  To: Chas Williams, David S. Miller
  Cc: linux-atm-general, netdev, linux-kernel, Gustavo A. R. Silva

pool can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/atm/zatm.c:1462 zatm_ioctl() warn: potential spectre issue
'zatm_dev->pool_info' (local cap)

Fix this by sanitizing pool before using it to index
zatm_dev->pool_info

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/atm/zatm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/atm/zatm.c b/drivers/atm/zatm.c
index 1ef67db..9c9a229 100644
--- a/drivers/atm/zatm.c
+++ b/drivers/atm/zatm.c
@@ -28,6 +28,7 @@
 #include <asm/io.h>
 #include <linux/atomic.h>
 #include <linux/uaccess.h>
+#include <linux/nospec.h>
 
 #include "uPD98401.h"
 #include "uPD98402.h"
@@ -1458,6 +1459,8 @@ static int zatm_ioctl(struct atm_dev *dev,unsigned int cmd,void __user *arg)
 					return -EFAULT;
 				if (pool < 0 || pool > ZATM_LAST_POOL)
 					return -EINVAL;
+				pool = array_index_nospec(pool,
+							  ZATM_LAST_POOL + 1);
 				spin_lock_irqsave(&zatm_dev->lock, flags);
 				info = zatm_dev->pool_info[pool];
 				if (cmd == ZATM_GETPOOLZ) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH rdma-next] MAINTAINERS: Remove bouncing @mellanox.com addresses
From: Leon Romanovsky @ 2018-05-03 18:37 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Leon Romanovsky, RDMA mailing list, netdev

From: Leon Romanovsky <leonro@mellanox.com>

Delete non-existent @mellanox.com addresses from MAINTAINERS file.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 MAINTAINERS | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0a1410d5a621..1b52c7e7fc7d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5391,7 +5391,6 @@ S:	Maintained
 F:	drivers/iommu/exynos-iommu.c

 EZchip NPS platform support
-M:	Elad Kanfi <eladkan@mellanox.com>
 M:	Vineet Gupta <vgupta@synopsys.com>
 S:	Supported
 F:	arch/arc/plat-eznps
@@ -9012,7 +9011,6 @@ Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx5/core/en_*

 MELLANOX ETHERNET INNOVA DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
 R:	Boris Pismenny <borisp@mellanox.com>
 L:	netdev@vger.kernel.org
 S:	Supported
@@ -9022,7 +9020,6 @@ F:	drivers/net/ethernet/mellanox/mlx5/core/fpga/*
 F:	include/linux/mlx5/mlx5_ifc_fpga.h

 MELLANOX ETHERNET INNOVA IPSEC DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
 R:	Boris Pismenny <borisp@mellanox.com>
 L:	netdev@vger.kernel.org
 S:	Supported
@@ -9078,7 +9075,6 @@ F:	include/uapi/rdma/mlx4-abi.h

 MELLANOX MLX5 core VPI driver
 M:	Saeed Mahameed <saeedm@mellanox.com>
-M:	Matan Barak <matanb@mellanox.com>
 M:	Leon Romanovsky <leonro@mellanox.com>
 L:	netdev@vger.kernel.org
 L:	linux-rdma@vger.kernel.org
@@ -9089,7 +9085,6 @@ F:	drivers/net/ethernet/mellanox/mlx5/core/
 F:	include/linux/mlx5/

 MELLANOX MLX5 IB driver
-M:	Matan Barak <matanb@mellanox.com>
 M:	Leon Romanovsky <leonro@mellanox.com>
 L:	linux-rdma@vger.kernel.org
 W:	http://www.mellanox.com
@@ -9821,7 +9816,6 @@ F:	net/netfilter/xt_CONNSECMARK.c
 F:	net/netfilter/xt_SECMARK.c

 NETWORKING [TLS]
-M:	Ilya Lesokhin <ilyal@mellanox.com>
 M:	Aviad Yehezkel <aviadye@mellanox.com>
 M:	Dave Watson <davejwatson@fb.com>
 L:	netdev@vger.kernel.org

^ permalink raw reply related

* Re: [PATCH net-next] net: stmmac: Add support for U32 TC filter using Flexible RX Parser
From: David Miller @ 2018-05-03 18:35 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <9f57f98ef35360619001882fdcf0d7ef0558863f.1525351447.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Thu,  3 May 2018 13:45:30 +0100

> +static int dwmac5_rxp_update_single_entry(void __iomem *ioaddr,
> +		struct stmmac_tc_entry *entry, int pos)

Please follow the Linux networking coding style for function arguments
in function declarations and definitions.

Each second and subsequent line of the declaration/definition shall start
precisely at the column after the openning parenthesis of the first line.
Use the appropriate number of TAB then SPACE characters necessary to
achieve this.

Otherwise, this patch looks great.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf-next v4 0/4] Hash support for sock
From: David Miller @ 2018-05-03 18:31 UTC (permalink / raw)
  To: john.fastabend; +Cc: borkmann, ast, netdev
In-Reply-To: <1525372108-8690-1-git-send-email-john.fastabend@gmail.com>

From: John Fastabend <john.fastabend@gmail.com>
Date: Thu,  3 May 2018 11:28:24 -0700

> In the original sockmap implementation we got away with using an
> array similar to devmap. However, unlike devmap where an ifindex
> has a nice 1:1 function into the map we have found some use cases
> with sockets that need to be referenced using longer keys.
> 
> This series adds support for a sockhash map reusing as much of
> the sockmap code as possible. I made the decision to add sockhash
> specific helpers vs trying to generalize the existing helpers
> because (a) they have sockmap in the name and (b) the keys are
> different types. I prefer to be explicit here rather than play
> type games or do something else tricky.
> 
> To test this we duplicate all the sockmap testing except swap out
> the sockmap with a sockhash.
> 
> v2: fix file stats and add v2 tag
> v3: move tool updates into test patch, move bpftool updates into
>     its own patch, and fixup the test patch stats to catch the
>     renamed file and provide only diffs +/- on that.
> v4: Add documentation to UAPI bpf.h

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH bpf-next v4 4/4] bpf: bpftool, support for sockhash
From: John Fastabend @ 2018-05-03 18:28 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525372108-8690-1-git-send-email-john.fastabend@gmail.com>

This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index a6cdb64..4420b1a 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -67,6 +67,7 @@
 	[BPF_MAP_TYPE_DEVMAP]		= "devmap",
 	[BPF_MAP_TYPE_SOCKMAP]		= "sockmap",
 	[BPF_MAP_TYPE_CPUMAP]		= "cpumap",
+	[BPF_MAP_TYPE_SOCKHASH]		= "sockhash",
 };
 
 static unsigned int get_possible_cpus(void)
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v4 3/4] bpf: selftest additions for SOCKHASH
From: John Fastabend @ 2018-05-03 18:28 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525372108-8690-1-git-send-email-john.fastabend@gmail.com>

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/include/uapi/linux/bpf.h                     |  6 ++++-
 tools/testing/selftests/bpf/Makefile               |  3 ++-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |  4 ++++
 tools/testing/selftests/bpf/test_sockmap.c         | 27 ++++++++++++++++------
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   | 10 ++++----
 5 files changed, 36 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => test_sockmap_kern.h} (97%)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da77a93..5cb983d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1835,7 +1836,10 @@ struct bpf_stack_build_id {
 	FN(msg_pull_data),		\
 	FN(bind),			\
 	FN(xdp_adjust_tail),		\
-	FN(skb_get_xfrm_state),
+	FN(skb_get_xfrm_state),		\
+	FN(sock_hash_update),		\
+	FN(msg_redirect_hash),		\
+	FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b64a7a3..03f9bf3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
-	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
+	test_sockmap_kern.o test_sockhash_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c b/tools/testing/selftests/bpf/test_sockhash_kern.c
new file mode 100644
index 0000000..3bf4ad4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
+#include "./test_sockmap_kern.h"
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 29c022d..df7afc7 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -47,7 +47,8 @@
 #define S1_PORT 10000
 #define S2_PORT 10001
 
-#define BPF_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"
 #define CG_PATH "/sockmap"
 
 /* global sockets */
@@ -1260,9 +1261,8 @@ static int test_start_end(int cgrp)
 	BPF_PROG_TYPE_SK_MSG,
 };
 
-static int populate_progs(void)
+static int populate_progs(char *bpf_file)
 {
-	char *bpf_file = BPF_FILENAME;
 	struct bpf_program *prog;
 	struct bpf_object *obj;
 	int i = 0;
@@ -1306,11 +1306,11 @@ static int populate_progs(void)
 	return 0;
 }
 
-static int test_suite(void)
+static int __test_suite(char *bpf_file)
 {
 	int cg_fd, err;
 
-	err = populate_progs();
+	err = populate_progs(bpf_file);
 	if (err < 0) {
 		fprintf(stderr, "ERROR: (%i) load bpf failed\n", err);
 		return err;
@@ -1347,17 +1347,30 @@ static int test_suite(void)
 
 out:
 	printf("Summary: %i PASSED %i FAILED\n", passed, failed);
+	cleanup_cgroup_environment();
 	close(cg_fd);
 	return err;
 }
 
+static int test_suite(void)
+{
+	int err;
+
+	err = __test_suite(BPF_SOCKMAP_FILENAME);
+	if (err)
+		goto out;
+	err = __test_suite(BPF_SOCKHASH_FILENAME);
+out:
+	return err;
+}
+
 int main(int argc, char **argv)
 {
 	struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
 	int iov_count = 1, length = 1024, rate = 1;
 	struct sockmap_options options = {0};
 	int opt, longindex, err, cg_fd = 0;
-	char *bpf_file = BPF_FILENAME;
+	char *bpf_file = BPF_SOCKMAP_FILENAME;
 	int test = PING_PONG;
 
 	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
@@ -1438,7 +1451,7 @@ int main(int argc, char **argv)
 		return -1;
 	}
 
-	err = populate_progs();
+	err = populate_progs(bpf_file);
 	if (err) {
 		fprintf(stderr, "populate program: (%s) %s\n",
 			bpf_file, strerror(errno));
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.c b/tools/testing/selftests/bpf/test_sockmap_kern.h
similarity index 97%
rename from tools/testing/selftests/bpf/test_sockmap_kern.c
rename to tools/testing/selftests/bpf/test_sockmap_kern.h
index 33de97e..2cc57a1 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.c
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.h
@@ -1,5 +1,5 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io */
 #include <stddef.h>
 #include <string.h>
 #include <linux/bpf.h>
@@ -36,21 +36,21 @@
 })
 
 struct bpf_map_def SEC("maps") sock_map = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
 };
 
 struct bpf_map_def SEC("maps") sock_map_txmsg = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
 };
 
 struct bpf_map_def SEC("maps") sock_map_redir = {
-	.type = BPF_MAP_TYPE_SOCKMAP,
+	.type = TEST_MAP_TYPE,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
 	.max_entries = 20,
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v4 2/4] bpf: sockmap, add hash map support
From: John Fastabend @ 2018-05-03 18:28 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525372108-8690-1-git-send-email-john.fastabend@gmail.com>

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/bpf.h       |   8 +
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |  53 ++++-
 kernel/bpf/core.c         |   1 +
 kernel/bpf/sockmap.c      | 494 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/bpf/verifier.c     |  14 +-
 net/core/filter.c         |  58 ++++++
 7 files changed, 611 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..add768a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -661,6 +661,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
@@ -668,6 +669,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 	return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+						    void *key)
+{
+	return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
 				struct bpf_prog *prog,
 				u32 type)
@@ -693,6 +700,7 @@ static inline int sock_map_prog(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf..3101118 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..c2613c5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1767,6 +1768,53 @@ struct bpf_stack_build_id {
  * 		**CONFIG_XFRM** configuration option.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		Add an entry to, or update a sockhash *map* referencing sockets.
+ *		The *skops* is used as a new value for the entry associated to
+ *		*key*. *flags* is one of:
+ *
+ *		**BPF_NOEXIST**
+ *			The entry for *key* must not exist in the map.
+ *		**BPF_EXIST**
+ *			The entry for *key* must already exist in the map.
+ *		**BPF_ANY**
+ *			No condition on the existence of the entry for *key*.
+ *
+ *		If the *map* has eBPF programs (parser and verdict), those will
+ *		be inherited by the socket being added. If the socket is
+ *		already attached to eBPF programs, this results in an error.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		socket level. If the message *msg* is allowed to pass (i.e. if
+ *		the verdict eBPF program returns **SK_PASS**), redirect it to
+ *		the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress path otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+ *	Description
+ *		This helper is used in programs implementing policies at the
+ *		skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
+ *		if the verdeict eBPF program returns **SK_PASS**), redirect it
+ *		to the socket referenced by *map* (of type
+ *		**BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ *		egress interfaces can be used for redirection. The
+ *		**BPF_F_INGRESS** value in *flags* is used to make the
+ *		distinction (ingress path is selected if the flag is present,
+ *		egress otherwise). This is the only flag supported for now.
+ *	Return
+ *		**SK_PASS** on success, or **SK_DROP** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1835,7 +1883,10 @@ struct bpf_stack_build_id {
 	FN(msg_pull_data),		\
 	FN(bind),			\
 	FN(xdp_adjust_tail),		\
-	FN(skb_get_xfrm_state),
+	FN(skb_get_xfrm_state),		\
+	FN(sock_hash_update),		\
+	FN(msg_redirect_hash),		\
+	FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..5917cc1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1782,6 +1782,7 @@ void bpf_user_rnd_init_once(void)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
+const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 4eef5b1..306eb5d 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -60,6 +60,28 @@ struct bpf_stab {
 	struct bpf_sock_progs progs;
 };
 
+struct bucket {
+	struct hlist_head head;
+	raw_spinlock_t lock;
+};
+
+struct bpf_htab {
+	struct bpf_map map;
+	struct bucket *buckets;
+	atomic_t count;
+	u32 n_buckets;
+	u32 elem_size;
+	struct bpf_sock_progs progs;
+};
+
+struct htab_elem {
+	struct rcu_head rcu;
+	struct hlist_node hash_node;
+	u32 hash;
+	struct sock *sk;
+	char key[0];
+};
+
 enum smap_psock_state {
 	SMAP_TX_RUNNING,
 };
@@ -67,6 +89,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
 	struct list_head list;
 	struct sock **entry;
+	struct htab_elem *hash_link;
+	struct bpf_htab *htab;
 };
 
 struct smap_psock {
@@ -195,6 +219,12 @@ static void bpf_tcp_release(struct sock *sk)
 	rcu_read_unlock();
 }
 
+static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+	atomic_dec(&htab->count);
+	kfree_rcu(l, rcu);
+}
+
 static void bpf_tcp_close(struct sock *sk, long timeout)
 {
 	void (*close_fun)(struct sock *sk, long timeout);
@@ -231,10 +261,16 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	}
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		osk = cmpxchg(e->entry, sk, NULL);
-		if (osk == sk) {
-			list_del(&e->list);
-			smap_release_sock(psock, sk);
+		if (e->entry) {
+			osk = cmpxchg(e->entry, sk, NULL);
+			if (osk == sk) {
+				list_del(&e->list);
+				smap_release_sock(psock, sk);
+			}
+		} else {
+			hlist_del_rcu(&e->hash_link->hash_node);
+			smap_release_sock(psock, e->hash_link->sk);
+			free_htab_elem(e->htab, e->hash_link);
 		}
 	}
 	write_unlock_bh(&sk->sk_callback_lock);
@@ -1526,12 +1562,14 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void smap_list_remove(struct smap_psock *psock, struct sock **entry)
+static void smap_list_remove(struct smap_psock *psock,
+			     struct sock **entry,
+			     struct htab_elem *hash_link)
 {
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		if (e->entry == entry) {
+		if (e->entry == entry || e->hash_link == hash_link) {
 			list_del(&e->list);
 			break;
 		}
@@ -1569,7 +1607,7 @@ static void sock_map_free(struct bpf_map *map)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, &stab->sock_map[i]);
+			smap_list_remove(psock, &stab->sock_map[i], NULL);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);
@@ -1628,7 +1666,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
 
 	if (psock->bpf_parse)
 		smap_stop_sock(psock, sock);
-	smap_list_remove(psock, &stab->sock_map[k]);
+	smap_list_remove(psock, &stab->sock_map[k], NULL);
 	smap_release_sock(psock, sock);
 out:
 	write_unlock_bh(&sock->sk_callback_lock);
@@ -1745,10 +1783,12 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 		new = true;
 	}
 
-	e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
-	if (!e) {
-		err = -ENOMEM;
-		goto out_progs;
+	if (map_link) {
+		e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
+		if (!e) {
+			err = -ENOMEM;
+			goto out_progs;
+		}
 	}
 
 	/* 3. At this point we have a reference to a valid psock that is
@@ -1782,6 +1822,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 	write_unlock_bh(&sock->sk_callback_lock);
 	return err;
 out_free:
+	kfree(e);
 	smap_release_sock(psock, sock);
 out_progs:
 	if (verdict)
@@ -1828,7 +1869,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		struct smap_psock *opsock = smap_psock_sk(osock);
 
 		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i]);
+		smap_list_remove(opsock, &stab->sock_map[i], NULL);
 		smap_release_sock(opsock, osock);
 		write_unlock_bh(&osock->sk_callback_lock);
 	}
@@ -1845,6 +1886,10 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
 		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 
 		progs = &stab->progs;
+	} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH) {
+		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+		progs = &htab->progs;
 	} else {
 		return -EINVAL;
 	}
@@ -1905,11 +1950,19 @@ static int sock_map_update_elem(struct bpf_map *map,
 
 static void sock_map_release(struct bpf_map *map)
 {
-	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
 	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	progs = &stab->progs;
+	if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+
+		progs = &stab->progs;
+	} else {
+		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+		progs = &htab->progs;
+	}
+
 	orig = xchg(&progs->bpf_parse, NULL);
 	if (orig)
 		bpf_prog_put(orig);
@@ -1922,6 +1975,390 @@ static void sock_map_release(struct bpf_map *map)
 		bpf_prog_put(orig);
 }
 
+static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
+{
+	struct bpf_htab *htab;
+	int i, err;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->value_size != 4 ||
+	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	err = bpf_tcp_ulp_register();
+	if (err && err != -EEXIST)
+		return ERR_PTR(err);
+
+	htab = kzalloc(sizeof(*htab), GFP_USER);
+	if (!htab)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&htab->map, attr);
+
+	htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
+	htab->elem_size = sizeof(struct htab_elem) +
+			  round_up(htab->map.key_size, 8);
+
+	if (htab->n_buckets == 0 ||
+	    htab->n_buckets > U32_MAX / sizeof(struct bucket))
+		goto free_htab;
+
+	cost = (u64) htab->n_buckets * sizeof(struct bucket) +
+	       (u64) htab->elem_size * htab->map.max_entries;
+
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_htab;
+
+	htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+	err = bpf_map_precharge_memlock(htab->map.pages);
+	if (err)
+		goto free_htab;
+
+	err = -ENOMEM;
+	htab->buckets = bpf_map_area_alloc(
+				htab->n_buckets * sizeof(struct bucket),
+				htab->map.numa_node);
+	if (!htab->buckets)
+		goto free_htab;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		INIT_HLIST_HEAD(&htab->buckets[i].head);
+		raw_spin_lock_init(&htab->buckets[i].lock);
+	}
+
+	return &htab->map;
+free_htab:
+	kfree(htab);
+	return ERR_PTR(err);
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
+static void sock_hash_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	int i;
+
+	synchronize_rcu();
+
+	/* At this point no update, lookup or delete operations can happen.
+	 * However, be aware we can still get a socket state event updates,
+	 * and data ready callabacks that reference the psock from sk_user_data
+	 * Also psock worker threads are still in-flight. So smap_release_sock
+	 * will only free the psock after cancel_sync on the worker threads
+	 * and a grace period expire to ensure psock is really safe to remove.
+	 */
+	rcu_read_lock();
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_node *n;
+		struct htab_elem *l;
+
+		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			struct sock *sock = l->sk;
+			struct smap_psock *psock;
+
+			hlist_del_rcu(&l->hash_node);
+			write_lock_bh(&sock->sk_callback_lock);
+			psock = smap_psock_sk(sock);
+			/* This check handles a racing sock event that can get
+			 * the sk_callback_lock before this case but after xchg
+			 * causing the refcnt to hit zero and sock user data
+			 * (psock) to be null and queued for garbage collection.
+			 */
+			if (likely(psock)) {
+				smap_list_remove(psock, NULL, l);
+				smap_release_sock(psock, sock);
+			}
+			write_unlock_bh(&sock->sk_callback_lock);
+			kfree(l);
+		}
+	}
+	rcu_read_unlock();
+	bpf_map_area_free(htab->buckets);
+	kfree(htab);
+}
+
+static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
+					      void *key, u32 key_size, u32 hash,
+					      struct sock *sk,
+					      struct htab_elem *old_elem)
+{
+	struct htab_elem *l_new;
+
+	if (atomic_inc_return(&htab->count) > htab->map.max_entries) {
+		if (!old_elem) {
+			atomic_dec(&htab->count);
+			return ERR_PTR(-E2BIG);
+		}
+	}
+	l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN,
+			     htab->map.numa_node);
+	if (!l_new)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(l_new->key, key, key_size);
+	l_new->sk = sk;
+	l_new->hash = hash;
+	return l_new;
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+					 u32 hash, void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+
+	return NULL;
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+	return jhash(key, key_len, 0);
+}
+
+static int sock_hash_get_next_key(struct bpf_map *map,
+				  void *key, void *next_key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l, *next_l;
+	struct hlist_head *h;
+	u32 hash, key_size;
+	int i = 0;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+	if (!key)
+		goto find_first_elem;
+	hash = htab_map_hash(key, key_size);
+	h = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(h, hash, key, key_size);
+	if (!l)
+		goto find_first_elem;
+	next_l = hlist_entry_safe(
+		     rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+		     struct htab_elem, hash_node);
+	if (next_l) {
+		memcpy(next_key, next_l->key, key_size);
+		return 0;
+	}
+
+	/* no more elements in this hash list, go to the next bucket */
+	i = hash & (htab->n_buckets - 1);
+	i++;
+
+find_first_elem:
+	/* iterate over buckets */
+	for (; i < htab->n_buckets; i++) {
+		h = select_bucket(htab, i);
+
+		/* pick first element in the bucket */
+		next_l = hlist_entry_safe(
+				rcu_dereference_raw(hlist_first_rcu(h)),
+				struct htab_elem, hash_node);
+		if (next_l) {
+			/* if it's not empty, just return it */
+			memcpy(next_key, next_l->key, key_size);
+			return 0;
+		}
+	}
+
+	/* iterated over all buckets and all elements */
+	return -ENOENT;
+}
+
+static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
+				     struct bpf_map *map,
+				     void *key, u64 map_flags)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct bpf_sock_progs *progs = &htab->progs;
+	struct htab_elem *l_new = NULL, *l_old;
+	struct smap_psock_map_entry *e = NULL;
+	struct hlist_head *head;
+	struct smap_psock *psock;
+	u32 key_size, hash;
+	struct sock *sock;
+	struct bucket *b;
+	int err;
+
+	sock = skops->sk;
+
+	if (sock->sk_type != SOCK_STREAM ||
+	    sock->sk_protocol != IPPROTO_TCP)
+		return -EOPNOTSUPP;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+
+	e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
+	if (!e)
+		return -ENOMEM;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	err = __sock_map_ctx_update_elem(map, progs, sock, NULL, key);
+	if (err)
+		goto err;
+
+	/* bpf_map_update_elem() can be called in_irq() */
+	raw_spin_lock_bh(&b->lock);
+	l_old = lookup_elem_raw(head, hash, key, key_size);
+	if (l_old && map_flags == BPF_NOEXIST) {
+		err = -EEXIST;
+		goto bucket_err;
+	}
+	if (!l_old && map_flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto bucket_err;
+	}
+
+	l_new = alloc_sock_hash_elem(htab, key, key_size, hash, sock, l_old);
+	if (IS_ERR(l_new)) {
+		err = PTR_ERR(l_new);
+		goto bucket_err;
+	}
+
+	psock = smap_psock_sk(sock);
+	if (unlikely(!psock)) {
+		err = -EINVAL;
+		goto bucket_err;
+	}
+
+	e->hash_link = l_new;
+	e->htab = container_of(map, struct bpf_htab, map);
+	list_add_tail(&e->list, &psock->maps);
+
+	/* add new element to the head of the list, so that
+	 * concurrent search will find it before old elem
+	 */
+	hlist_add_head_rcu(&l_new->hash_node, head);
+	if (l_old) {
+		psock = smap_psock_sk(l_old->sk);
+
+		hlist_del_rcu(&l_old->hash_node);
+		smap_list_remove(psock, NULL, l_old);
+		smap_release_sock(psock, l_old->sk);
+		free_htab_elem(htab, l_old);
+	}
+	raw_spin_unlock_bh(&b->lock);
+	return 0;
+bucket_err:
+	raw_spin_unlock_bh(&b->lock);
+err:
+	kfree(e);
+	psock = smap_psock_sk(sock);
+	if (psock)
+		smap_release_sock(psock, sock);
+	return err;
+}
+
+static int sock_hash_update_elem(struct bpf_map *map,
+				void *key, void *value, u64 flags)
+{
+	struct bpf_sock_ops_kern skops;
+	u32 fd = *(u32 *)value;
+	struct socket *socket;
+	int err;
+
+	socket = sockfd_lookup(fd, &err);
+	if (!socket)
+		return err;
+
+	skops.sk = socket->sk;
+	if (!skops.sk) {
+		fput(socket->file);
+		return -EINVAL;
+	}
+
+	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
+	fput(socket->file);
+	return err;
+}
+
+static int sock_hash_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct bucket *b;
+	struct htab_elem *l;
+	u32 hash, key_size;
+	int ret = -ENOENT;
+
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	raw_spin_lock_bh(&b->lock);
+	l = lookup_elem_raw(head, hash, key, key_size);
+	if (l) {
+		struct sock *sock = l->sk;
+		struct smap_psock *psock;
+
+		hlist_del_rcu(&l->hash_node);
+		write_lock_bh(&sock->sk_callback_lock);
+		psock = smap_psock_sk(sock);
+		/* This check handles a racing sock event that can get the
+		 * sk_callback_lock before this case but after xchg happens
+		 * causing the refcnt to hit zero and sock user data (psock)
+		 * to be null and queued for garbage collection.
+		 */
+		if (likely(psock)) {
+			smap_list_remove(psock, NULL, l);
+			smap_release_sock(psock, sock);
+		}
+		write_unlock_bh(&sock->sk_callback_lock);
+		free_htab_elem(htab, l);
+		ret = 0;
+	}
+	raw_spin_unlock_bh(&b->lock);
+	return ret;
+}
+
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	u32 key_size, hash;
+	struct bucket *b;
+	struct sock *sk;
+
+	key_size = map->key_size;
+	hash = htab_map_hash(key, key_size);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	raw_spin_lock_bh(&b->lock);
+	l = lookup_elem_raw(head, hash, key, key_size);
+	sk = l ? l->sk : NULL;
+	raw_spin_unlock_bh(&b->lock);
+	return sk;
+}
+
 const struct bpf_map_ops sock_map_ops = {
 	.map_alloc = sock_map_alloc,
 	.map_free = sock_map_free,
@@ -1932,6 +2369,15 @@ static void sock_map_release(struct bpf_map *map)
 	.map_release_uref = sock_map_release,
 };
 
+const struct bpf_map_ops sock_hash_ops = {
+	.map_alloc = sock_hash_alloc,
+	.map_free = sock_hash_free,
+	.map_lookup_elem = sock_map_lookup,
+	.map_get_next_key = sock_hash_get_next_key,
+	.map_update_elem = sock_hash_update_elem,
+	.map_delete_elem = sock_hash_delete_elem,
+};
+
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
@@ -1949,3 +2395,21 @@ static void sock_map_release(struct bpf_map *map)
 	.arg3_type	= ARG_PTR_TO_MAP_KEY,
 	.arg4_type	= ARG_ANYTHING,
 };
+
+BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, bpf_sock,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
+}
+
+const struct bpf_func_proto bpf_sock_hash_update_proto = {
+	.func		= bpf_sock_hash_update,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_PTR_TO_MAP_KEY,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb1a596..cd3966d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2078,6 +2078,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_msg_redirect_map)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_SOCKHASH:
+		if (func_id != BPF_FUNC_sk_redirect_hash &&
+		    func_id != BPF_FUNC_sock_hash_update &&
+		    func_id != BPF_FUNC_map_delete_elem &&
+		    func_id != BPF_FUNC_msg_redirect_hash)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -2114,11 +2121,14 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
+	case BPF_FUNC_sock_map_update:
 		if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
 			goto error;
 		break;
-	case BPF_FUNC_sock_map_update:
-		if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
+	case BPF_FUNC_sk_redirect_hash:
+	case BPF_FUNC_msg_redirect_hash:
+	case BPF_FUNC_sock_hash_update:
+		if (map->map_type != BPF_MAP_TYPE_SOCKHASH)
 			goto error;
 		break;
 	default:
diff --git a/net/core/filter.c b/net/core/filter.c
index 5623dc8..4cde871 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1851,6 +1851,33 @@ int skb_do_redirect(struct sk_buff *skb)
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+	/* If user passes invalid input drop the packet. */
+	if (unlikely(flags & ~(BPF_F_INGRESS)))
+		return SK_DROP;
+
+	tcb->bpf.flags = flags;
+	tcb->bpf.sk_redir = __sock_hash_lookup_elem(map, key);
+	if (!tcb->bpf.sk_redir)
+		return SK_DROP;
+
+	return SK_PASS;
+}
+
+static const struct bpf_func_proto bpf_sk_redirect_hash_proto = {
+	.func           = bpf_sk_redirect_hash,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_PTR_TO_MAP_KEY,
+	.arg4_type      = ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
@@ -1885,6 +1912,31 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
 	.arg4_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg_buff *, msg,
+	   struct bpf_map *, map, void *, key, u64, flags)
+{
+	/* If user passes invalid input drop the packet. */
+	if (unlikely(flags & ~(BPF_F_INGRESS)))
+		return SK_DROP;
+
+	msg->flags = flags;
+	msg->sk_redir = __sock_hash_lookup_elem(map, key);
+	if (!msg->sk_redir)
+		return SK_DROP;
+
+	return SK_PASS;
+}
+
+static const struct bpf_func_proto bpf_msg_redirect_hash_proto = {
+	.func           = bpf_msg_redirect_hash,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_PTR_TO_MAP_KEY,
+	.arg4_type      = ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
 	   struct bpf_map *, map, u32, key, u64, flags)
 {
@@ -3987,6 +4039,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_sock_ops_cb_flags_set_proto;
 	case BPF_FUNC_sock_map_update:
 		return &bpf_sock_map_update_proto;
+	case BPF_FUNC_sock_hash_update:
+		return &bpf_sock_hash_update_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -3998,6 +4052,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 	switch (func_id) {
 	case BPF_FUNC_msg_redirect_map:
 		return &bpf_msg_redirect_map_proto;
+	case BPF_FUNC_msg_redirect_hash:
+		return &bpf_msg_redirect_hash_proto;
 	case BPF_FUNC_msg_apply_bytes:
 		return &bpf_msg_apply_bytes_proto;
 	case BPF_FUNC_msg_cork_bytes:
@@ -4029,6 +4085,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_sk_redirect_map:
 		return &bpf_sk_redirect_map_proto;
+	case BPF_FUNC_sk_redirect_hash:
+		return &bpf_sk_redirect_hash_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v4 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap
From: John Fastabend @ 2018-05-03 18:28 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend
In-Reply-To: <1525372108-8690-1-git-send-email-john.fastabend@gmail.com>

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
    can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
    u32.
  - Instead of passing map/key through the metadata we now do
    the lookup inline. This avoids storing the key in the metadata
    which will be useful when keys can be longer than 4 bytes. We
    rename the sk pointers to sk_redir at this point as well to
    avoid any confusion between the current sk pointer and the
    redirect pointer sk_redir.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/filter.h |   3 +-
 include/net/tcp.h      |   3 +-
 kernel/bpf/sockmap.c   | 148 +++++++++++++++++++++++++++++--------------------
 net/core/filter.c      |  31 +++--------
 4 files changed, 98 insertions(+), 87 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..31cdfe8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -512,9 +512,8 @@ struct sk_msg_buff {
 	int sg_end;
 	struct scatterlist sg_data[MAX_SKB_FRAGS];
 	bool sg_copy[MAX_SKB_FRAGS];
-	__u32 key;
 	__u32 flags;
-	struct bpf_map *map;
+	struct sock *sk_redir;
 	struct sk_buff *skb;
 	struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
 		} header;	/* For incoming skbs */
 		struct {
-			__u32 key;
 			__u32 flags;
-			struct bpf_map *map;
+			struct sock *sk_redir;
 			void *data_end;
 		} bpf;
 	};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 801273b..4eef5b1 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-	struct bpf_map map;
-	struct sock **sock_map;
+struct bpf_sock_progs {
 	struct bpf_prog *bpf_tx_msg;
 	struct bpf_prog *bpf_parse;
 	struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+	struct bpf_map map;
+	struct sock **sock_map;
+	struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
 	SMAP_TX_RUNNING,
 };
@@ -460,7 +464,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
 	return ((_rc == SK_PASS) ?
-	       (md->map ? __SK_REDIRECT : __SK_PASS) :
+	       (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
 	       __SK_DROP);
 }
 
@@ -1091,7 +1095,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 	 * when we orphan the skb so that we don't have the possibility
 	 * to reference a stale map.
 	 */
-	TCP_SKB_CB(skb)->bpf.map = NULL;
+	TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
 	skb->sk = psock->sock;
 	bpf_compute_data_pointers(skb);
 	preempt_disable();
@@ -1101,7 +1105,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 
 	/* Moving return codes from UAPI namespace into internal namespace */
 	return rc == SK_PASS ?
-		(TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+		(TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
 		__SK_DROP;
 }
 
@@ -1371,7 +1375,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-			    struct bpf_stab *stab,
 			    struct bpf_prog *verdict,
 			    struct bpf_prog *parse)
 {
@@ -1449,14 +1452,13 @@ static void smap_gc_work(struct work_struct *w)
 	kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
-					  struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
 	struct smap_psock *psock;
 
 	psock = kzalloc_node(sizeof(struct smap_psock),
 			     GFP_ATOMIC | __GFP_NOWARN,
-			     stab->map.numa_node);
+			     node);
 	if (!psock)
 		return ERR_PTR(-ENOMEM);
 
@@ -1661,40 +1663,26 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use READ_ONCE and xchg operations
  */
-static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
-				    struct bpf_map *map,
-				    void *key, u64 flags)
+
+static int __sock_map_ctx_update_elem(struct bpf_map *map,
+				      struct bpf_sock_progs *progs,
+				      struct sock *sock,
+				      struct sock **map_link,
+				      void *key)
 {
-	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
-	struct smap_psock_map_entry *e = NULL;
 	struct bpf_prog *verdict, *parse, *tx_msg;
-	struct sock *osock, *sock;
+	struct smap_psock_map_entry *e = NULL;
 	struct smap_psock *psock;
-	u32 i = *(u32 *)key;
 	bool new = false;
 	int err;
 
-	if (unlikely(flags > BPF_EXIST))
-		return -EINVAL;
-
-	if (unlikely(i >= stab->map.max_entries))
-		return -E2BIG;
-
-	sock = READ_ONCE(stab->sock_map[i]);
-	if (flags == BPF_EXIST && !sock)
-		return -ENOENT;
-	else if (flags == BPF_NOEXIST && sock)
-		return -EEXIST;
-
-	sock = skops->sk;
-
 	/* 1. If sock map has BPF programs those will be inherited by the
 	 * sock being added. If the sock is already attached to BPF programs
 	 * this results in an error.
 	 */
-	verdict = READ_ONCE(stab->bpf_verdict);
-	parse = READ_ONCE(stab->bpf_parse);
-	tx_msg = READ_ONCE(stab->bpf_tx_msg);
+	verdict = READ_ONCE(progs->bpf_verdict);
+	parse = READ_ONCE(progs->bpf_parse);
+	tx_msg = READ_ONCE(progs->bpf_tx_msg);
 
 	if (parse && verdict) {
 		/* bpf prog refcnt may be zero if a concurrent attach operation
@@ -1702,11 +1690,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		 * we increment the refcnt. If this is the case abort with an
 		 * error.
 		 */
-		verdict = bpf_prog_inc_not_zero(stab->bpf_verdict);
+		verdict = bpf_prog_inc_not_zero(progs->bpf_verdict);
 		if (IS_ERR(verdict))
 			return PTR_ERR(verdict);
 
-		parse = bpf_prog_inc_not_zero(stab->bpf_parse);
+		parse = bpf_prog_inc_not_zero(progs->bpf_parse);
 		if (IS_ERR(parse)) {
 			bpf_prog_put(verdict);
 			return PTR_ERR(parse);
@@ -1714,7 +1702,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	}
 
 	if (tx_msg) {
-		tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
+		tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg);
 		if (IS_ERR(tx_msg)) {
 			if (verdict)
 				bpf_prog_put(verdict);
@@ -1747,7 +1735,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 			goto out_progs;
 		}
 	} else {
-		psock = smap_init_psock(sock, stab);
+		psock = smap_init_psock(sock, map->numa_node);
 		if (IS_ERR(psock)) {
 			err = PTR_ERR(psock);
 			goto out_progs;
@@ -1762,7 +1750,6 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		err = -ENOMEM;
 		goto out_progs;
 	}
-	e->entry = &stab->sock_map[i];
 
 	/* 3. At this point we have a reference to a valid psock that is
 	 * running. Attach any BPF programs needed.
@@ -1779,7 +1766,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		err = smap_init_sock(psock, sock);
 		if (err)
 			goto out_free;
-		smap_init_progs(psock, stab, verdict, parse);
+		smap_init_progs(psock, verdict, parse);
 		smap_start_sock(psock, sock);
 	}
 
@@ -1788,19 +1775,12 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	 * it with. Because we can only have a single set of programs if
 	 * old_sock has a strp we can stop it.
 	 */
-	list_add_tail(&e->list, &psock->maps);
-	write_unlock_bh(&sock->sk_callback_lock);
-
-	osock = xchg(&stab->sock_map[i], sock);
-	if (osock) {
-		struct smap_psock *opsock = smap_psock_sk(osock);
-
-		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i]);
-		smap_release_sock(opsock, osock);
-		write_unlock_bh(&osock->sk_callback_lock);
+	if (map_link) {
+		e->entry = map_link;
+		list_add_tail(&e->list, &psock->maps);
 	}
-	return 0;
+	write_unlock_bh(&sock->sk_callback_lock);
+	return err;
 out_free:
 	smap_release_sock(psock, sock);
 out_progs:
@@ -1815,23 +1795,69 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 	return err;
 }
 
-int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
+static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
+				    struct bpf_map *map,
+				    void *key, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct bpf_sock_progs *progs = &stab->progs;
+	struct sock *osock, *sock;
+	u32 i = *(u32 *)key;
+	int err;
+
+	if (unlikely(flags > BPF_EXIST))
+		return -EINVAL;
+
+	if (unlikely(i >= stab->map.max_entries))
+		return -E2BIG;
+
+	sock = READ_ONCE(stab->sock_map[i]);
+	if (flags == BPF_EXIST && !sock)
+		return -ENOENT;
+	else if (flags == BPF_NOEXIST && sock)
+		return -EEXIST;
+
+	sock = skops->sk;
+	err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i],
+					 key);
+	if (err)
+		goto out;
+
+	osock = xchg(&stab->sock_map[i], sock);
+	if (osock) {
+		struct smap_psock *opsock = smap_psock_sk(osock);
+
+		write_lock_bh(&osock->sk_callback_lock);
+		smap_list_remove(opsock, &stab->sock_map[i]);
+		smap_release_sock(opsock, osock);
+		write_unlock_bh(&osock->sk_callback_lock);
+	}
+out:
+	return 0;
+}
+
+int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
+{
+	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	if (unlikely(map->map_type != BPF_MAP_TYPE_SOCKMAP))
+	if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
+		struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+
+		progs = &stab->progs;
+	} else {
 		return -EINVAL;
+	}
 
 	switch (type) {
 	case BPF_SK_MSG_VERDICT:
-		orig = xchg(&stab->bpf_tx_msg, prog);
+		orig = xchg(&progs->bpf_tx_msg, prog);
 		break;
 	case BPF_SK_SKB_STREAM_PARSER:
-		orig = xchg(&stab->bpf_parse, prog);
+		orig = xchg(&progs->bpf_parse, prog);
 		break;
 	case BPF_SK_SKB_STREAM_VERDICT:
-		orig = xchg(&stab->bpf_verdict, prog);
+		orig = xchg(&progs->bpf_verdict, prog);
 		break;
 	default:
 		return -EOPNOTSUPP;
@@ -1880,16 +1906,18 @@ static int sock_map_update_elem(struct bpf_map *map,
 static void sock_map_release(struct bpf_map *map)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct bpf_sock_progs *progs;
 	struct bpf_prog *orig;
 
-	orig = xchg(&stab->bpf_parse, NULL);
+	progs = &stab->progs;
+	orig = xchg(&progs->bpf_parse, NULL);
 	if (orig)
 		bpf_prog_put(orig);
-	orig = xchg(&stab->bpf_verdict, NULL);
+	orig = xchg(&progs->bpf_verdict, NULL);
 	if (orig)
 		bpf_prog_put(orig);
 
-	orig = xchg(&stab->bpf_tx_msg, NULL);
+	orig = xchg(&progs->bpf_tx_msg, NULL);
 	if (orig)
 		bpf_prog_put(orig);
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index d3781da..5623dc8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1860,9 +1860,10 @@ int skb_do_redirect(struct sk_buff *skb)
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
 
-	tcb->bpf.key = key;
 	tcb->bpf.flags = flags;
-	tcb->bpf.map = map;
+	tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
+	if (!tcb->bpf.sk_redir)
+		return SK_DROP;
 
 	return SK_PASS;
 }
@@ -1870,16 +1871,8 @@ int skb_do_redirect(struct sk_buff *skb)
 struct sock *do_sk_redirect_map(struct sk_buff *skb)
 {
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-	struct sock *sk = NULL;
-
-	if (tcb->bpf.map) {
-		sk = __sock_map_lookup_elem(tcb->bpf.map, tcb->bpf.key);
 
-		tcb->bpf.key = 0;
-		tcb->bpf.map = NULL;
-	}
-
-	return sk;
+	return tcb->bpf.sk_redir;
 }
 
 static const struct bpf_func_proto bpf_sk_redirect_map_proto = {
@@ -1899,25 +1892,17 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return SK_DROP;
 
-	msg->key = key;
 	msg->flags = flags;
-	msg->map = map;
+	msg->sk_redir = __sock_map_lookup_elem(map, key);
+	if (!msg->sk_redir)
+		return SK_DROP;
 
 	return SK_PASS;
 }
 
 struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 {
-	struct sock *sk = NULL;
-
-	if (msg->map) {
-		sk = __sock_map_lookup_elem(msg->map, msg->key);
-
-		msg->key = 0;
-		msg->map = NULL;
-	}
-
-	return sk;
+	return msg->sk_redir;
 }
 
 static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
-- 
1.9.1

^ permalink raw reply related

* [PATCH bpf-next v4 0/4] Hash support for sock
From: John Fastabend @ 2018-05-03 18:28 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev, John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
    its own patch, and fixup the test patch stats to catch the
    renamed file and provide only diffs +/- on that.
v4: Add documentation to UAPI bpf.h

John Fastabend (4):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: bpftool, support for sockhash
  bpf: selftest additions for SOCKHASH

 include/linux/bpf.h                                |   8 +
 include/linux/bpf_types.h                          |   1 +
 include/linux/filter.h                             |   3 +-
 include/net/tcp.h                                  |   3 +-
 include/uapi/linux/bpf.h                           |   6 +-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/sockmap.c                               | 638 ++++++++++++++++++---
 kernel/bpf/verifier.c                              |  14 +-
 net/core/filter.c                                  |  89 ++-
 tools/bpf/bpftool/map.c                            |   1 +
 tools/include/uapi/linux/bpf.h                     |   6 +-
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |   4 +
 tools/testing/selftests/bpf/test_sockmap.c         |  27 +-
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   |   6 +-
 15 files changed, 695 insertions(+), 115 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => test_sockmap_kern.h} (98%)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH net-next] net: core: rework skb_probe_transport_header()
From: David Miller @ 2018-05-03 18:22 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, edumazet, jasowang
In-Reply-To: <1525370356.3233.8.camel@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 03 May 2018 19:59:16 +0200

> On Thu, 2018-05-03 at 13:32 -0400, David Miller wrote:
>> From: Paolo Abeni <pabeni@redhat.com>
>> Date: Thu,  3 May 2018 11:35:35 +0200
>> 
>> > When the transport header is not available, skb_probe_transport_header()
>> > resorts to fully dissect the flow keys, even if it only needs the
>> > ransport offset. We can obtain the latter using a simpler flow dissector -
>> > flow_keys_buf_dissector - and a smaller struct for key storage.
>> > 
>> > The above gives ~50% performance improvement in micro benchmarking around
>> > skb_probe_transport_header(), mostly due to the smaller memset. Small, but
>> > measurable improvement is measured also in macro benchmarking - raw xmit
>> > tput from a VM.
>> > 
>> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> 
>> Please make this optimization generally, then every driver using
>> eth_get_headlen() (11 or so) will get the same improvement, for all
>> traffic!
> 
> Sure! I though about sending a follow-up patch for other
> flow_keys_buf_dissector users, but if you prefer a single patch, I'm
> fine with your version (just give me a little time tomorrow to test it)

Ok, please give is a spin and let me know how the testing goes.

^ permalink raw reply

* Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Cong Wang @ 2018-05-03 18:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel test robot, kernel test robot,
	Linux Memory Management List, Johannes Weiner, LKP, David Miller,
	Linux Kernel Network Developers
In-Reply-To: <20180502224437.18fe3ebb9e6955f321638f82@linux-foundation.org>

On Wed, May 2, 2018 at 10:44 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 2 May 2018 21:58:25 -0700 Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> > So it's saying that something which got committed into Linus's tree
>> > after 4.17-rc3 has caused a NULL deref in
>> > sock_release->llc_ui_release+0x3a/0xd0
>>
>> Do you mean it contains commit 3a04ce7130a7
>> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?
>
> That was in 4.17-rc3 so if this report's bisection is correct, that
> patch is innocent.
>
> origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
> contains no changes to net/llc/af_llc.c so perhaps this crash is also
> occurring in 4.17-rc3 base.

The commit I pointed out is supposed to fix this bug...

Please let me know if it doesn't.

^ permalink raw reply

* Re: [PATCH net-next] net: core: rework skb_probe_transport_header()
From: Paolo Abeni @ 2018-05-03 17:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet, jasowang
In-Reply-To: <20180503.133254.1527765573520854366.davem@davemloft.net>

On Thu, 2018-05-03 at 13:32 -0400, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Thu,  3 May 2018 11:35:35 +0200
> 
> > When the transport header is not available, skb_probe_transport_header()
> > resorts to fully dissect the flow keys, even if it only needs the
> > ransport offset. We can obtain the latter using a simpler flow dissector -
> > flow_keys_buf_dissector - and a smaller struct for key storage.
> > 
> > The above gives ~50% performance improvement in micro benchmarking around
> > skb_probe_transport_header(), mostly due to the smaller memset. Small, but
> > measurable improvement is measured also in macro benchmarking - raw xmit
> > tput from a VM.
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Please make this optimization generally, then every driver using
> eth_get_headlen() (11 or so) will get the same improvement, for all
> traffic!

Sure! I though about sending a follow-up patch for other
flow_keys_buf_dissector users, but if you prefer a single patch, I'm
fine with your version (just give me a little time tomorrow to test it)

Please let me know,

Paolo

^ permalink raw reply

* Re: [PATCH net-next mlxsw v2 0/2] bridge: FDB: Notify about removal of non-user-added entries
From: David Miller @ 2018-05-03 17:47 UTC (permalink / raw)
  To: petrm
  Cc: netdev, bridge, jiri, idosch, ivecera, stephen, andrew,
	vivien.didelot, f.fainelli
In-Reply-To: <cover.1525350809.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Thu, 03 May 2018 14:43:40 +0200

> Device drivers may generally need to keep in sync with bridge's FDB. In
> particular, for its offload of tc mirror action where the mirrored-to
> device is a gretap device, mlxsw needs to listen to a number of events,
> FDB events among the others. SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE would be
> a natural notification in that case.
> 
> However, for removal of FDB entries added due to device activity (as
> opposed to explicit addition through "bridge fdb add" or similar), there
> are no notifications.
> 
> Thus in patch #1, add the "added_by_user" field to switchdev
> notifications sent for FDB activity. Adapt drivers to ignore activity on
> non-user-added entries, to maintain the current behavior. Specifically
> in case of mlxsw, allow mlxsw_sp_span_respin() call for any and all FDB
> updates.
> 
> In patch #2, change the bridge driver to actually emit notifications for
> these FDB entries. Take care not to send notification for bridge
> updates that itself originate in SWITCHDEV_FDB_*_TO_BRIDGE events.
> 
> Changes from v1 to v2:
> - Instead of introducing a new variant of fdb_delete(), add a new
>   parameter to the existing function.
> - Name the parameter swdev_notify, not notify.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/4] mlxsw: Introduce support for CQEv1/2
From: David Miller @ 2018-05-03 17:44 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jiri, mlxsw
In-Reply-To: <20180503115942.8279-1-idosch@mellanox.com>

From: Ido Schimmel <idosch@mellanox.com>
Date: Thu,  3 May 2018 14:59:38 +0300

> Jiri says:
> 
> Current SwitchX2 and Spectrum FWs support CQEv0 and that is what we
> implement in mlxsw. Spectrum FW also supports CQE v1 and v2.
> However, Spectrum-2 won't support CQEv0. Prepare for it and setup the
> CQE versions to use according to what is queried from FW.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: avoid duplicate notification on up/down/change netdev events
From: David Miller @ 2018-05-03 17:41 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, stephen, bridge
In-Reply-To: <20180503104724.17799-1-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Thu,  3 May 2018 13:47:24 +0300

> While handling netdevice events, br_device_event() sometimes uses
> br_stp_(disable|enable)_port which unconditionally send a notification,
> but then a second notification for the same event is sent at the end of
> the br_device_event() function. To avoid sending duplicate notifications
> in such cases, check if one has already been sent (i.e.
> br_stp_enable/disable_port have been called).
> The patch is based on a change by Satish Ashok.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> We've been running with a similar patch for over an year, it's been
> thoroughly tested. Sending for net-next since it's an improvement and
> not really a bug fix.

Looks good, applied, thanks!

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-05-03 17:40 UTC (permalink / raw)
  To: John Stoffel
  Cc: Andrew, dm-devel, eric.dumazet, mst, netdev, jasowang,
	Randy Dunlap, linux-kernel, Matthew Wilcox, Hocko,
	James Bottomley, Michal, edumazet, linux-mm, David Rientjes,
	Morton, virtualization, David Miller, Vlastimil Babka
In-Reply-To: <23273.48986.516559.317965@quad.stoffel.home>



On Wed, 2 May 2018, John Stoffel wrote:

> You miss my point, which is that there's no explanation of what the
> difference is between SLAB and SLUB and which I should choose.  The
> same goes here.  If the KConfig option doesn't give useful info, it's
> useless.

So what, we could write explamantion of that option.

> >> Now I also think that Linus has the right idea to not just sprinkle 
> >> BUG_ONs into the code, just dump and oops and keep going if you can.  
> >> If it's a filesystem or a device, turn it read only so that people 
> >> notice right away.
> 
> Mikulas> This vmalloc fallback is similar to
> Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE.  CONFIG_DEBUG_KOBJECT_RELEASE
> Mikulas> changes the behavior of kobject_put in order to cause
> Mikulas> deliberate crashes (that wouldn't happen otherwise) in
> Mikulas> drivers that misuse kobject_put. In the same sense, we want
> Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
> Mikulas> in drivers that misuse kvmalloc.
> 
> Mikulas> The crashes will only happen in debugging kernels, not in
> Mikulas> production kernels.
> 
> Says you.  What about people or distros that enable it
> unconditionally?  They're going to get all kinds of reports and then
> turn it off again.  Crashing the system isn't the answer here.  

I've made that kvmalloc bug too (in the function 
dm_integrity_free_journal_scatterlist). I'd much rather like if the kernel 
crashed (because then - I would fix the bug). The kernel didn't crash and 
the bug sneaked into the official linux tree, where may be causing random 
crashes for other users.

Mikulas

^ permalink raw reply

* Re: [PATCH net-next mlxsw 0/3] selftests: forwarding: Updates to sysctl handling
From: David Miller @ 2018-05-03 17:37 UTC (permalink / raw)
  To: petrm; +Cc: netdev, linux-kselftest, shuah
In-Reply-To: <cover.1525343276.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Thu, 03 May 2018 12:36:52 +0200

> Some selftests need to adjust sysctl settings. In order to be neutral to
> the system that the test is run on, it is a good practice to change back
> to the original setting after the test ends. That involves some
> boilerplate that can be abstracted away.
> 
> In patch #1, introduce two functions, sysctl_set() and sysctl_restore().
> The former stores the current value of a given setting, and sets a new
> value. The latter restores the setting to the previously-stored value.
> 
> In patch #2, use these wrappers in a number of tests.
> 
> Additionally in patch #3, fix a problem in mirror_gre_nh.sh, which
> neglected to set a sysctl that's crucial for the test to work.

Luckily that errant 'mlxsw' hides inside of [] so git will get rid of
it :-)

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: core: rework skb_probe_transport_header()
From: David Miller @ 2018-05-03 17:32 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, edumazet, jasowang
In-Reply-To: <7cbdf466f4a1bf44ddbb948428dc7bb0dad091a7.1525340013.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu,  3 May 2018 11:35:35 +0200

> When the transport header is not available, skb_probe_transport_header()
> resorts to fully dissect the flow keys, even if it only needs the
> ransport offset. We can obtain the latter using a simpler flow dissector -
> flow_keys_buf_dissector - and a smaller struct for key storage.
> 
> The above gives ~50% performance improvement in micro benchmarking around
> skb_probe_transport_header(), mostly due to the smaller memset. Small, but
> measurable improvement is measured also in macro benchmarking - raw xmit
> tput from a VM.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Please make this optimization generally, then every driver using
eth_get_headlen() (11 or so) will get the same improvement, for all
traffic!

Something like this:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 908d66e55b14..e635e9ab818a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1171,7 +1171,7 @@ void __skb_get_hash(struct sk_buff *skb);
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
 u32 skb_get_poff(const struct sk_buff *skb);
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
-		   const struct flow_keys *keys, int hlen);
+		   const struct flow_keys_basic *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
 			    void *data, int hlen_proto);
 
@@ -1208,7 +1208,7 @@ static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
 				  NULL, 0, 0, 0, flags);
 }
 
-static inline bool skb_flow_dissect_flow_keys_buf(struct flow_keys *flow,
+static inline bool skb_flow_dissect_flow_keys_buf(struct flow_keys_basic *flow,
 						  void *data, __be16 proto,
 						  int nhoff, int hlen,
 						  unsigned int flags)
@@ -2350,11 +2350,14 @@ static inline void skb_pop_mac_header(struct sk_buff *skb)
 static inline void skb_probe_transport_header(struct sk_buff *skb,
 					      const int offset_hint)
 {
-	struct flow_keys keys;
+	struct flow_keys_basic keys;
 
 	if (skb_transport_header_was_set(skb))
 		return;
-	else if (skb_flow_dissect_flow_keys(skb, &keys, 0))
+
+	memset(&keys, 0, sizeof(keys));
+	if (__skb_flow_dissect(skb, &flow_keys_buf_dissector, &keys,
+			       NULL, 0, 0, 0, 0))
 		skb_set_transport_header(skb, keys.control.thoff);
 	else
 		skb_set_transport_header(skb, offset_hint);
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 9a074776f70b..e81dab6e9ac6 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -226,6 +226,11 @@ struct flow_dissector {
 	unsigned short int offset[FLOW_DISSECTOR_KEY_MAX];
 };
 
+struct flow_keys_basic {
+	struct flow_dissector_key_control control;
+	struct flow_dissector_key_basic basic;
+};
+
 struct flow_keys {
 	struct flow_dissector_key_control control;
 #define FLOW_KEYS_HASH_START_FIELD basic
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d29f09bc5ff9..50d68b54f03a 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1253,7 +1253,7 @@ __u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
 EXPORT_SYMBOL(skb_get_hash_perturb);
 
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
-		   const struct flow_keys *keys, int hlen)
+		   const struct flow_keys_basic *keys, int hlen)
 {
 	u32 poff = keys->control.thoff;
 
@@ -1314,9 +1314,11 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
  */
 u32 skb_get_poff(const struct sk_buff *skb)
 {
-	struct flow_keys keys;
+	struct flow_keys_basic keys;
 
-	if (!skb_flow_dissect_flow_keys(skb, &keys, 0))
+	memset(&keys, 0, sizeof(keys));
+	if (!__skb_flow_dissect(skb, &flow_keys_buf_dissector, &keys,
+				NULL, 0, 0, 0, 0))
 		return 0;
 
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
@@ -1418,6 +1420,7 @@ struct flow_dissector flow_keys_dissector __read_mostly;
 EXPORT_SYMBOL(flow_keys_dissector);
 
 struct flow_dissector flow_keys_buf_dissector __read_mostly;
+EXPORT_SYMBOL(flow_keys_buf_dissector);
 
 static int __init init_default_flow_dissectors(void)
 {
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index eaeba9b99a73..873eca643f99 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -128,7 +128,7 @@ u32 eth_get_headlen(void *data, unsigned int len)
 {
 	const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 	const struct ethhdr *eth = (const struct ethhdr *)data;
-	struct flow_keys keys;
+	struct flow_keys_basic keys;
 
 	/* this should never happen, but better safe than sorry */
 	if (unlikely(len < sizeof(*eth)))

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox