Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
From: Theodore Ts'o @ 2016-12-14 16:37 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, David Miller, Linus Torvalds,
	kernel-hardening@lists.openwall.com, LKML, George Spelvin,
	Scott Bauer, Andi Kleen, Andy Lutomirski, Greg KH, Eric Biggers,
	linux-crypto, Jean-Philippe Aumasson
In-Reply-To: <20161214031037.25498-1-Jason@zx2c4.com>

On Wed, Dec 14, 2016 at 04:10:37AM +0100, Jason A. Donenfeld wrote:
> This duplicates the current algorithm for get_random_int/long, but uses
> siphash24 instead. This comes with several benefits. It's certainly
> faster and more cryptographically secure than MD5. This patch also
> hashes the pid, entropy, and timestamp as fixed width fields, in order
> to increase diffusion.
> 
> The previous md5 algorithm used a per-cpu md5 state, which caused
> successive calls to the function to chain upon each other. While it's
> not entirely clear that this kind of chaining is absolutely necessary
> when using a secure PRF like siphash24, it can't hurt, and the timing of
> the call chain does add a degree of natural entropy. So, in keeping with
> this design, instead of the massive per-cpu 64-byte md5 state, there is
> instead a per-cpu previously returned value for chaining.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>

The original reason for get_random_int was because the original
urandom algorithms were too slow.  When we moved to chacha20, which is
must faster, I didn't think to revisit get_random_int() and
get_random_long().

One somewhat undesirable aspect of the current algorithm is that we
never change random_int_secret.  So I've been toying with the
following, which is 4 times faster than md5.  (I haven't tried
benchmarking against siphash yet.)

[    3.606139] random benchmark!!
[    3.606276] get_random_int # cycles: 326578
[    3.606317] get_random_int_new # cycles: 95438
[    3.607423] get_random_bytes # cycles: 2653388

     	       			  	  - Ted

P.S.  It's interesting to note that siphash24 and chacha20 are both
add-rotate-xor based algorithms.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..be172ea75799 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1681,6 +1681,38 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+
+static int rand_benchmark(void)
+{
+	cycles_t start,finish;
+	int i, out;
+
+	pr_crit("random benchmark!!\n");
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_int();
+	}
+	finish = get_cycles();
+	pr_err("get_random_int # cycles: %llu\n", finish - start);
+
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_int_new();
+	}
+	finish = get_cycles();
+	pr_err("get_random_int_new # cycles: %llu\n", finish - start);
+
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_bytes(&out, sizeof(out));
+	}
+	finish = get_cycles();
+	pr_err("get_random_bytes # cycles: %llu\n", finish - start);
+	return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2096,10 @@ unsigned int get_random_int(void)
 	__u32 *hash;
 	unsigned int ret;
 
+#if 0	// force slow path
 	if (arch_get_random_int(&ret))
 		return ret;
+#endif
 
 	hash = get_cpu_var(get_random_int_hash);
 
@@ -2100,6 +2134,38 @@ unsigned long get_random_long(void)
 }
 EXPORT_SYMBOL(get_random_long);
 
+struct random_buf {
+	__u8 buf[CHACHA20_BLOCK_SIZE];
+	int ptr;
+};
+
+static DEFINE_PER_CPU(struct random_buf, batched_entropy);
+
+static void get_batched_entropy(void *buf, int n)
+{
+	struct random_buf *p;
+
+	p = &get_cpu_var(batched_entropy);
+
+	if ((p->ptr == 0) ||
+	    (p->ptr + n >= CHACHA20_BLOCK_SIZE)) {
+		extract_crng(p->buf);
+		p->ptr = 0;
+	}
+	BUG_ON(n > CHACHA20_BLOCK_SIZE);
+	memcpy(buf, p->buf, n);
+	p->ptr += n;
+	put_cpu_var(batched_entropy);
+}
+
+static unsigned int get_random_int_new(void)
+{
+	int	ret;
+
+	get_batched_entropy(&ret, sizeof(ret));
+	return ret;
+}
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start:	The smallest acceptable address the caller will take.

^ permalink raw reply related

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: John Fastabend @ 2016-12-14 16:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich
In-Reply-To: <20161214103914.3a9ebbbf@redhat.com>

On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend <john.fastabend@gmail.com>
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
>>>> What does "zero-copy send packet-pages to the application/socket that
>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>> memory was proposed and rejected for security reasons when we were
>>>> working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.

Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above comment along with other statements like
>> this "can be done with out HW filter features".
> 
> Does this address your concerns?
> 

I think we need to enforce strong isolation. An application should not
be able to read data or impact other applications. I gather this is
the case per comment about normal applications in mode-2. A slightly
weaker statement would be to say applications can only impace/read data
of other applications in their domain. This might be OK as well.

.John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: arvind Yadav @ 2016-12-14 16:30 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, peter.chen, fw, netdev, linux-kernel
In-Reply-To: <201612142225.IaytQyYf%fengguang.wu@intel.com>

Sorry for build failure. I have send new changes. Which does not
this failure.

Thanks
-Arvind

On Wednesday 14 December 2016 08:10 PM, kbuild test robot wrote:
> Hi Arvind,
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on v4.9 next-20161214]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624
> config: mips-cavium_octeon_defconfig (attached as .config)
> compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>          wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          make.cross ARCH=mips
>
> All errors (new ones prefixed by >>):
>
>     drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe':
>>> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' undeclared (first use in this function)
>        dev_err(dev, "failed to map I/O memory\n");
>                ^~~
>     drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each undeclared identifier is reported only once for each function it appears in
>
> vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>
>    1467	
>    1468		p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, p->mix_size);
>    1469		p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
>    1470		p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
>    1471						   p->agl_prt_ctl_size);
>    1472		if (!p->mix || !p->agl || !p->agl_prt_ctl) {
>> 1473			dev_err(dev, "failed to map I/O memory\n");
>    1474			result = -ENOMEM;
>    1475			goto err;
>    1476		}
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: Arvind Yadav @ 2016-12-14 16:25 UTC (permalink / raw)
  To: peter.chen, fw, david.daney; +Cc: netdev, linux-kernel

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev)
 	p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
 	p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
 					   p->agl_prt_ctl_size);
+	if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+		dev_err(&pdev->dev, "failed to map I/O memory\n");
+		result = -ENOMEM;
+		goto err;
+	}
+
 	spin_lock_init(&p->lock);
 
 	skb_queue_head_init(&p->tx_list);
-- 
2.7.4

^ permalink raw reply related

* ravb/sh_eth/b44: BUG: sleeping function called from invalid context
From: Geert Uytterhoeven @ 2016-12-14 16:12 UTC (permalink / raw)
  To: Sergei Shtylyov, Michael Chan; +Cc: Linux-Renesas, netdev@vger.kernel.org

Hi,

When CONFIG_DEBUG_ATOMIC_SLEEP=y, running "ethtool -s eth0 speed 100"
on Salvator-X gives:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97
in_atomic(): 1, irqs_disabled(): 128, pid: 1683, name: ethtool
CPU: 0 PID: 1683 Comm: ethtool Tainted: G        W
4.9.0-salvator-x-00426-g326519df42c65007-dirty #976
Hardware name: Renesas Salvator-X board based on r8a7796 (DT)
Call trace:
[<ffffff8008089400>] dump_backtrace+0x0/0x208
[<ffffff800808961c>] show_stack+0x14/0x1c
[<ffffff8008233424>] dump_stack+0x94/0xb4
[<ffffff80080c377c>] ___might_sleep+0x108/0x11c
[<ffffff80080c3814>] __might_sleep+0x84/0x94
[<ffffff800855db0c>] mutex_lock+0x24/0x40
[<ffffff800837610c>] phy_start_aneg+0x20/0x130
[<ffffff80083763b8>] phy_ethtool_ksettings_set+0xd0/0xe8
[<ffffff8008386724>] ravb_set_link_ksettings+0x4c/0xa4
[<ffffff80084a7b94>] ethtool_set_settings+0xec/0xfc
[<ffffff80084aa918>] dev_ethtool+0x188/0x17c4
[<ffffff80084bce3c>] dev_ioctl+0x53c/0x6b8
[<ffffff8008488acc>] sock_do_ioctl.constprop.45+0x3c/0x4c
[<ffffff80084897b4>] sock_ioctl+0x33c/0x370
[<ffffff8008171878>] vfs_ioctl+0x20/0x38
[<ffffff8008172198>] do_vfs_ioctl+0x844/0x954
[<ffffff80081722ec>] SyS_ioctl+0x44/0x68
[<ffffff80080830f0>] el0_svc_naked+0x24/0x28

ravb_set_link_ksettings() calls phy_ethtool_ksettings_set() with a spinlock
held and interrupts disabled, while phy_start_aneg() tries to obtain a mutex.

The same issue is present in sh_eth_set_link_ksettings() (verified) and
b44_set_link_ksettings() (code inspection).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* ipv6: handle -EFAULT from skb_copy_bits
From: Dave Jones @ 2016-12-14 15:47 UTC (permalink / raw)
  To: netdev

It seems to be possible to craft a packet for sendmsg that triggers
the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:

RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:ffff881f6c4a7c18  EFLAGS: 00010282
RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

Call Trace:
 [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830
 [<ffffffff81772697>] inet_sendmsg+0x67/0xa0
 [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50
 [<ffffffff816d982f>] SYSC_sendto+0xef/0x170
 [<ffffffff816da27e>] SyS_sendto+0xe/0x10
 [<ffffffff81002910>] do_syscall_64+0x50/0xa0
 [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25

Handle this in rawv6_push_pending_frames and jump to the failure path.

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 291ebc260e70..35aa82faa052 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
 	}
 
 	offset += skb_transport_offset(skb);
-	BUG_ON(skb_copy_bits(skb, offset, &csum, 2));
+	err = skb_copy_bits(skb, offset, &csum, 2);
+	if (err < 0)
+		goto out;
 
 	/* in case cksum was not initialized */
 	if (unlikely(csum))

^ permalink raw reply related

* Re: dsa: handling more than 1 cpu port
From: John Crispin @ 2016-12-14 15:45 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20161214110058.GE27370@lunn.ch>



On 14/12/2016 12:00, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>>
>>
>> On 14/12/2016 11:31, Andrew Lunn wrote:
>>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
>>>> Hi Andrew,
>>>>
>>>> switches supported by qca8k have 2 gmacs that we can wire an external
>>>> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
>>>> switch. Thw switch itself is aware of one of the MACs being the cpu port
>>>> and expects this to be port/mac0. Using the other will break the
>>>> hardware offloading features.
>>>
>>> Just to be sure here. There is no way to use the second port connected
>>> to the CPU as a CPU port?
>>
>> both macs are considered cpu ports and both allow for the tag to be
>> injected. for HW NAT/routing/... offloading to work, the lan ports neet
>> to trunk via port0 and not port6 however.
> 
> Maybe you can do a restricted version of the generic solution. LAN
> ports are mapped to cpu port0. WAN port to cpu port 6?
> 
>>> The Marvell chips do allow this. So i developed a proof of concept
>>> which had a mapping between cpu ports and slave ports. slave port X
>>> should you cpu port y for its traffic. This never got past proof of
>>> concept. 
>>>
>>> If this can be made to work for qca8k, i would prefer having this
>>> general concept, than specific hacks for pass through.
>>
>> oh cool, can you send those patches my way please ? how do you configure
>> this from userland ? does the cpu port get its on swdev which i just add
>> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
> 
> https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
> 
> You don't configure anything from userland. Which was actually a
> criticism. It is in device tree. But my solution is generic. Having
> one WAN port and four bridges LAN ports is a pure marketing
> requirement. The hardware will happily do two WAN ports and 3 LAN
> ports, for example. And the switch is happy to take traffic for the
> WAN port and a LAN port over the same CPU port, and keep the traffic
> separate. So we can have some form of load balancing. We are not
> limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
> performance. And to the user it is all transparent.
> 
> This PoC is for the old DSA binding. The new binding makes it easier
> to express this. Which is one of the reasons for the new binding.
> 
> 	Andrew
> 

Hi Andrew,

spent some time looking at this and thinking about possible solutions.
my initial idea was to detect which cpu port to based on the cpu port
being included inside the bridge. however that wont allow us to control
ports using the tag outside of a bridge. i think that your approach is
the only sane one. we could add a sysfs interface later, allowing us to
change the default cpu port <-> mappings, but the device tree needs to
include some sane defaults. i'll use your patches as a base for a series.

	John

^ permalink raw reply

* [PATCH RFC 2/2] bpf: Add tests for the lpm trie map
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack
In-Reply-To: <20161214154336.17639-1-daniel@zonque.org>

From: David Herrmann <dh.herrmann@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +++++++++++++++++++++++++++++
 3 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 0000000..08db750
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,348 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/bpf.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+	struct tlpm_node *next;
+	size_t n_bits;
+	uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+				  const uint8_t *key,
+				  size_t n_bits)
+{
+	struct tlpm_node *node;
+	size_t n;
+
+	/* add new entry with @key/@n_bits to @list and return new head */
+
+	n = (n_bits + 7) / 8;
+	node = malloc(sizeof(*node) + n);
+	assert(node);
+
+	node->next = list;
+	node->n_bits = n_bits;
+	memcpy(node->key, key, n);
+
+	return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+	struct tlpm_node *node;
+
+	/* free all entries in @list */
+
+	while ((node = list)) {
+		list = list->next;
+		free(node);
+	}
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+				    const uint8_t *key,
+				    size_t n_bits)
+{
+	struct tlpm_node *best = NULL;
+	size_t i;
+
+	/*
+	 * Perform longest prefix-match on @key/@n_bits. That is, iterate all
+	 * entries and match each prefix against @key. Remember the "best"
+	 * entry we find (i.e., the longest prefix that matches) and return it
+	 * to the caller when done.
+	 */
+
+	for ( ; list; list = list->next) {
+		for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+			if ((key[i / 8] & (1 << (7 - i % 8))) !=
+			    (list->key[i / 8] & (1 << (7 - i % 8))))
+				break;
+		}
+
+		if (i >= list->n_bits) {
+			if (!best || i > best->n_bits)
+				best = list;
+		}
+	}
+
+	return best;
+}
+
+static void test_lpm_basic(void)
+{
+	struct tlpm_node *list = NULL, *t1, *t2;
+
+	/* very basic, static tests to verify tlpm works as expected */
+
+	assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+	t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 }, 16));
+	assert(!tlpm_match(list, (uint8_t[]){ 0x7f }, 8));
+	assert(!tlpm_match(list, (uint8_t[]){ 0xfe }, 8));
+	assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 7));
+
+	t2 = list = tlpm_add(list, (uint8_t[]){ 0xff, 0xff }, 16);
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+	assert(t2 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 15));
+	assert(!tlpm_match(list, (uint8_t[]){ 0x7f, 0xff }, 16));
+
+	tlpm_clear(list);
+}
+
+static void test_lpm_order(void)
+{
+	struct tlpm_node *t1, *t2, *l1 = NULL, *l2 = NULL;
+	size_t i, j;
+
+	/*
+	 * Verify the tlpm implementation works correctly regardless of the
+	 * order of entries. Insert a random set of entries into @l1, and copy
+	 * the same data in reverse order into @l2. Then verify a lookup of
+	 * random keys will yield the same result in both sets.
+	 */
+
+	for (i = 0; i < (1 << 12); ++i)
+		l1 = tlpm_add(l1, (uint8_t[]){
+					rand() % 0xff,
+					rand() % 0xff,
+				}, rand() % 16 + 1);
+
+	for (t1 = l1; t1; t1 = t1->next)
+		l2 = tlpm_add(l2, t1->key, t1->n_bits);
+
+	for (i = 0; i < (1 << 8); ++i) {
+		uint8_t key[] = { rand() % 0xff, rand() % 0xff };
+
+		t1 = tlpm_match(l1, key, 16);
+		t2 = tlpm_match(l2, key, 16);
+
+		assert(!t1 == !t2);
+		if (t1) {
+			assert(t1->n_bits == t2->n_bits);
+			for (j = 0; j < t1->n_bits; ++j)
+				assert((t1->key[j / 8] & (1 << (7 - j % 8))) ==
+				       (t2->key[j / 8] & (1 << (7 - j % 8))));
+		}
+	}
+
+	tlpm_clear(l1);
+	tlpm_clear(l2);
+}
+
+static void test_lpm_map(void)
+{
+	size_t i, j, n_matches, n_nodes, n_lookups;
+	struct tlpm_node *t, *list = NULL;
+	struct bpf_lpm_trie_key *key;
+	uint8_t value[8] = {};
+	int r, map;
+
+	/*
+	 * Compare behavior of tlpm vs. bpf-lpm. Create a randomized set of
+	 * prefixes and insert it into both tlpm and bpf-lpm. Then run some
+	 * randomized lookups and verify both maps return the same result.
+	 */
+
+	n_matches = 0;
+	n_nodes = 1 << 8;
+	n_lookups = 1 << 16;
+
+	key = alloca(sizeof(*key) + 4);
+	memset(key, 0, sizeof(*key) + 4);
+
+	map = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+			     sizeof(*key) + 4,
+			     sizeof(value),
+			     4096,
+			     0);
+	assert(map >= 0);
+
+	for (i = 0; i < n_nodes; ++i) {
+		value[0] = rand() & 0xff;
+		value[1] = rand() & 0xff;
+		value[2] = rand() & 0xff;
+		value[3] = rand() & 0xff;
+		value[4] = rand() % 33;
+
+		list = tlpm_add(list, value, value[4]);
+
+		key->prefixlen = value[4];
+		memcpy(key->data, value, 4);
+		r = bpf_map_update(map, key, value, 0);
+		assert(!r);
+	}
+
+	for (i = 0; i < n_lookups; ++i) {
+		uint8_t data[] = {
+			rand() % 0xff,
+			rand() % 0xff,
+			rand() % 0xff,
+			rand() % 0xff
+		};
+
+		t = tlpm_match(list, data, 32);
+
+		key->prefixlen = 32;
+		memcpy(key->data, data, 4);
+		r = bpf_map_lookup(map, key, value);
+		assert(!r || errno == ENOENT);
+		assert(!t == !!r);
+
+		if (t) {
+			++n_matches;
+			assert(t->n_bits == value[4]);
+			for (j = 0; j < t->n_bits; ++j)
+				assert((t->key[j / 8] & (1 << (7 - j % 8))) ==
+				       (value[j / 8] & (1 << (7 - j % 8))));
+		}
+	}
+
+	close(map);
+	tlpm_clear(list);
+
+	/*
+	 * With 255 random nodes in the map, we are pretty likely to match
+	 * something on every lookup. For statistics, use this:
+	 *
+	 *     printf("  nodes: %zu\n"
+	 *            "lookups: %zu\n"
+	 *            "matches: %zu\n", n_nodes, n_lookups, n_matches);
+	 */
+}
+
+/* Test the implementation with some 'real world' examples */
+
+static void test_lpm_ipaddr(void)
+{
+	struct bpf_lpm_trie_key *key_ipv4;
+	struct bpf_lpm_trie_key *key_ipv6;
+	size_t key_size_ipv4;
+	size_t key_size_ipv6;
+	int map_fd_ipv4;
+	int map_fd_ipv6;
+	__u64 value;
+
+	key_size_ipv4 = sizeof(*key_ipv4) + sizeof(__u32);
+	key_size_ipv6 = sizeof(*key_ipv6) + sizeof(__u32) * 4;
+	key_ipv4 = alloca(key_size_ipv4);
+	key_ipv6 = alloca(key_size_ipv6);
+
+	map_fd_ipv4 = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+				     key_size_ipv4, sizeof(value),
+				     100, 0);
+	assert(map_fd_ipv4 >= 0);
+
+	map_fd_ipv6 = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+				     key_size_ipv6, sizeof(value),
+				     100, 0);
+	assert(map_fd_ipv6 >= 0);
+
+	/* Fill data some IPv4 and IPv6 address ranges */
+	value = 1;
+	key_ipv4->prefixlen = 16;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 2;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 3;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.128.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 5;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.1.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 4;
+	key_ipv4->prefixlen = 23;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 0xdeadbeef;
+	key_ipv6->prefixlen = 64;
+	inet_pton(AF_INET6, "2a00:1450:4001:814::200e", key_ipv6->data);
+	assert(bpf_map_update(map_fd_ipv6, key_ipv6, &value, 0) == 0);
+
+	/* Set tprefixlen to maximum for lookups */
+	key_ipv4->prefixlen = 32;
+	key_ipv6->prefixlen = 128;
+
+	/* Test some lookups that should come back with a value */
+	inet_pton(AF_INET, "192.168.128.23", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == 0);
+	assert(value == 3);
+
+	inet_pton(AF_INET, "192.168.0.1", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == 0);
+	assert(value == 2);
+
+	inet_pton(AF_INET6, "2a00:1450:4001:814::", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == 0);
+	assert(value == 0xdeadbeef);
+
+	inet_pton(AF_INET6, "2a00:1450:4001:814::1", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == 0);
+	assert(value == 0xdeadbeef);
+
+	/* Test some lookups that should not match any entry */
+	inet_pton(AF_INET, "10.0.0.1", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == -1 &&
+	       errno == ENOENT);
+
+	inet_pton(AF_INET, "11.11.11.11", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == -1 &&
+	       errno == ENOENT);
+
+	inet_pton(AF_INET6, "2a00:ffff::", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == -1 &&
+	       errno == ENOENT);
+
+	close(map_fd_ipv4);
+	close(map_fd_ipv6);
+}
+
+int main(void)
+{
+	/* we want predictable, pseudo random tests */
+	srand(0xf00ba1);
+
+	test_lpm_basic();
+	test_lpm_order();
+	test_lpm_map();
+	test_lpm_ipaddr();
+
+	printf("test_lpm: OK\n");
+	return 0;
+}
-- 
2.9.3

^ permalink raw reply related

* [PATCH RFC 0/2] bpf: add longest prefix match map
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack

This patch set adds longest prefix match algorithm that can be used to
match IP addresses to a stored set of ranges. It is exposed as a bpf
map type.

Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.

Patch 1/2 adds the implementation, and 2/2 an extensive test suite.

Feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h                   |   7 +
 kernel/bpf/Makefile                        |   2 +-
 kernel/bpf/lpm_trie.c                      | 491 +++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 ++++++++++++++++++++
 6 files changed, 850 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3

^ permalink raw reply

* [PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack
In-Reply-To: <20161214154336.17639-1-daniel@zonque.org>

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile      |   2 +-
 kernel/bpf/lpm_trie.c    | 491 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+	__u32	prefixlen;	/* up to 32 for AF_INET, 128 for AF_INET6 */
+	__u8	data[0];	/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGROUP_ARRAY,
 	BPF_MAP_TYPE_LRU_HASH,
 	BPF_MAP_TYPE_LRU_PERCPU_HASH,
+	BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 0000000..cae759d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,491 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/vmalloc.h>
+#include <net/ipv6.h>
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+	struct rcu_head rcu;
+	struct lpm_trie_node	*child[2];
+	u32			prefixlen;
+	u32			flags;
+	u64			value;
+	u8			data[0];
+};
+
+struct lpm_trie {
+	struct bpf_map		map;
+	struct lpm_trie_node	*root;
+	size_t			n_entries;
+	size_t			max_prefixlen;
+	size_t			data_size;
+	spinlock_t		lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child index depends on the next bit
+ * that is outside of that (1) matches, and that bit is 0, so (2) will be
+ * child[0] of (1):
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *                   |
+ *    +----------------+
+ *    |       (2)      |
+ *    | 192.168.0.0/24 |
+ *    |    value: 2    |
+ *    |   [0]    [1]   |
+ *    +----------------+
+ *
+ * The child[1] slot of (1) could be filled with another node which has bit #17
+ * (the next bit after the ones that (1) matches on) set to 1. For instance,
+ * 192.168.128.0/24:
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *                   |      |
+ *    +----------------+  +------------------+
+ *    |       (2)      |  |        (3)       |
+ *    | 192.168.0.0/24 |  | 192.168.128.0/24 |
+ *    |    value: 2    |  |     value: 3     |
+ *    |   [0]    [1]   |  |    [0]    [1]    |
+ *    +----------------+  +------------------+
+ *
+ * Let's add another node (4) to the game for 192.168.1.0/24. In order to place
+ * it, node (1) is looked at first, and because (4) of the semantics laid out
+ * above (bit #17 is 0), it would normally be attached to (1) as child[0].
+ * However, that slot is already allocated, so a new node is needed in between.
+ * That node is does not have a value attached to it and it will never be
+ * returned to users as result of a lookup. It is only there to differenciate
+ * the traversal further. It will get a prefix as wide as necessary to
+ * distinguish its two children:
+ *
+ *                      +----------------+
+ *                      |       (1)  (R) |
+ *                      | 192.168.0.0/16 |
+ *                      |    value: 1    |
+ *                      |   [0]    [1]   |
+ *                      +----------------+
+ *                           |      |
+ *            +----------------+  +------------------+
+ *            |       (4)  (I) |  |        (3)       |
+ *            | 192.168.0.0/23 |  | 192.168.128.0/24 |
+ *            |    value: ---  |  |     value: 3     |
+ *            |   [0]    [1]   |  |    [0]    [1]    |
+ *            +----------------+  +------------------+
+ *                 |      |
+ *  +----------------+  +----------------+
+ *  |       (2)      |  |       (5)      |
+ *  | 192.168.0.0/24 |  | 192.168.1.0/24 |
+ *  |    value: 2    |  |     value: 5   |
+ *  |   [0]    [1]   |  |   [0]    [1]   |
+ *  +----------------+  +----------------+
+ *
+ * 192.168.1.1/32 would be a child of (5) etc.
+ *
+ * An intermediate node will be turned into a 'real' node on demand. In the
+ * example above, (4) would be re-used if 192.168.0.0/23 is added to the trie.
+ *
+ * A fully populated trie would have a height of 32 nodes, as the trie was
+ * created with a prefix length of 32.
+ *
+ * The lookup starts at the root node. If the current node matches and if there
+ * is a child that can be used to become more specific, the trie is traversed
+ * downwards. The last node in the traversal that is a non-intermediate one is
+ * returned.
+ */
+
+static inline int extract_bit(const u8 *data, size_t index)
+{
+	return !!(data[index / 8] & (1 << (7 - (index % 8))));
+}
+
+/**
+ * longest_prefix_match() - determine the longest prefix
+ * @trie:	The trie to get internal sizes from
+ * @node:	The node to operate on
+ * @key:	The key to compare to @node
+ *
+ * Determine the longest prefix of @node that matches the bits in @key.
+ */
+static size_t longest_prefix_match(const struct lpm_trie *trie,
+				   const struct lpm_trie_node *node,
+				   const struct bpf_lpm_trie_key *key)
+{
+	size_t prefixlen = 0;
+	int i;
+
+	for (i = 0; i < trie->data_size; i++) {
+		size_t b;
+
+		b = 8 - fls(node->data[i] ^ key->data[i]);
+		prefixlen += b;
+
+		if (prefixlen >= node->prefixlen || prefixlen >= key->prefixlen)
+			return min(node->prefixlen, key->prefixlen);
+
+		if (b < 8)
+			break;
+	}
+
+	return prefixlen;
+}
+
+/* Called from syscall or from eBPF program */
+static void *trie_lookup_elem(struct bpf_map *map, void *_key)
+{
+	struct lpm_trie_node *node, *found = NULL;
+	struct bpf_lpm_trie_key *key = _key;
+	struct lpm_trie *trie =
+		container_of(map, struct lpm_trie, map);
+
+	/* Start walking the trie from the root node ... */
+
+	for (node = rcu_dereference(trie->root); node;) {
+		unsigned int next_bit;
+		size_t matchlen;
+
+		/*
+		 * Determine the longest prefix of @node that matches @key.
+		 * If it's the maximum possible prefix for this trie, we have
+		 * an exact match and can return it directly.
+		 */
+		matchlen = longest_prefix_match(trie, node, key);
+		if (matchlen == trie->max_prefixlen)
+			return &node->value;
+
+		/*
+		 * If the number of bits that match is smaller than the prefix
+		 * length of @node, bail out and return the node we have seen
+		 * last in the traversal (ie, the parent).
+		 */
+		if (matchlen < node->prefixlen)
+			break;
+
+		/*
+		 * Consider this node as return candidate unless it is an
+		 * artificially added intermediate one
+		 */
+		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
+			found = node;
+
+		/*
+		 * If the node match is fully satisfied, let's see if we can
+		 * become more specific. Determine the next bit in the key and
+		 * traverse down.
+		 */
+		next_bit = extract_bit(key->data, node->prefixlen);
+		node = rcu_dereference(node->child[next_bit]);
+	}
+
+	return found ? &found->value : NULL;
+}
+
+static struct lpm_trie_node *lpm_trie_node_alloc(size_t data_size)
+{
+	return kmalloc(sizeof(struct lpm_trie_node) + data_size,
+		       GFP_ATOMIC | __GFP_NOWARN);
+}
+
+/**
+ *_lpm_trie_find_target_node() - locate a spot to put a new node
+ * @trie:	The trie to walk
+ * @key:	The key to find a slot for
+ * @node_ret:	Return variable for a node slot
+ *
+ * Find a slot to put a new node for @key, and return it in @node_ret.
+ *
+ * If the target location is an empty child of an existing node, or the
+ * root is unused, a pointer to that empty spot is returned in @node_ret
+ * and 0 is returned by the function.
+ *
+ * Otherwise, if a node is detected that conflicts with @key, that conflicting
+ * node is returned in @node_ret. The caller should then replace that node with
+ * an intermediate node. In this case, the longest prefix match between the
+ * existing node and @key is returned.
+ */
+static size_t find_target_node(struct lpm_trie *trie,
+			       struct bpf_lpm_trie_key *key,
+			       struct lpm_trie_node ***node_ret)
+{
+	struct lpm_trie_node **node = &trie->root;
+	size_t matchlen = 0;
+
+	while (*node) {
+		unsigned int next_bit;
+
+		matchlen = longest_prefix_match(trie, *node, key);
+
+		if ((*node)->prefixlen != matchlen ||
+		    (*node)->prefixlen == key->prefixlen ||
+		    (*node)->prefixlen == trie->max_prefixlen)
+			break;
+
+		next_bit = extract_bit(key->data, (*node)->prefixlen);
+		node = &(*node)->child[next_bit];
+	}
+
+	*node_ret = node;
+
+	return *node ? matchlen : 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int trie_update_elem(struct bpf_map *map,
+			    void *_key, void *value, u64 flags)
+{
+	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+	struct lpm_trie_node **node, *im_node, *new_node = NULL;
+	struct bpf_lpm_trie_key *key = _key;
+	size_t matchlen;
+	int ret = 0;
+
+	if (key->prefixlen > trie->max_prefixlen)
+		return -EINVAL;
+
+	spin_lock(&trie->lock);
+
+	/* Allocate and fill a new node */
+
+	if (trie->n_entries == trie->map.max_entries) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	new_node = lpm_trie_node_alloc(trie->data_size);
+	if (!new_node) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	trie->n_entries++;
+	new_node->value = *(u64 *) value;
+	new_node->prefixlen = key->prefixlen;
+	new_node->flags = 0;
+	new_node->child[0] = NULL;
+	new_node->child[1] = NULL;
+	memcpy(new_node->data, key->data, trie->data_size);
+
+	/*
+	 * Now find a place to attach the new node. find_target_node()
+	 * either returned an empty slot (the root or an empty leaf), or the
+	 * closest match, in which case an intermediate node has to be created
+	 * and installed.
+	 */
+	matchlen = find_target_node(trie, key, &node);
+	if (!*node) {
+		rcu_assign_pointer(*node, new_node);
+		goto out;
+	}
+
+	/*
+	 * If the node we got back as target already exists, replace it
+	 * new_node, which already has the correct data array and value set.
+	 * If the node that is replaced is an intermediate one, turn it into a
+	 * 'real' node.
+	 */
+	if ((*node)->prefixlen == matchlen) {
+		struct lpm_trie_node *tmp;
+
+		new_node->child[0] = (*node)->child[0];
+		new_node->child[1] = (*node)->child[1];
+
+		tmp = rcu_dereference(*node);
+		if (!(tmp->flags & LPM_TREE_NODE_FLAG_IM))
+			trie->n_entries--;
+
+		rcu_assign_pointer(*node, new_node);
+		kfree_rcu(tmp, rcu);
+
+		goto out;
+	}
+
+	/*
+	 * If the new node matches the prefix completely, it must be an
+	 * inserted as an ancestor. Simply insert it between @node and @*node.
+	 */
+	if (matchlen == key->prefixlen) {
+		new_node->child[extract_bit((*node)->data, matchlen)] = *node;
+		rcu_assign_pointer(*node, new_node);
+		goto out;
+	}
+
+	/* Create an intermediate node and place it inbetween */
+	im_node = lpm_trie_node_alloc(trie->data_size);
+	if (!im_node) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	im_node->prefixlen = matchlen;
+	im_node->flags |= LPM_TREE_NODE_FLAG_IM;
+	memcpy(im_node->data, (*node)->data, trie->data_size);
+
+	/* Now determine which child to install in which slot */
+	if (extract_bit(key->data, matchlen)) {
+		im_node->child[0] = *node;
+		im_node->child[1] = new_node;
+	} else {
+		im_node->child[0] = new_node;
+		im_node->child[1] = *node;
+	}
+
+	/* Finally, assign the intermediate node to the determined spot */
+	rcu_assign_pointer(*node, im_node);
+
+out:
+	if (ret) {
+		if (new_node)
+			trie->n_entries--;
+
+		kfree(new_node);
+		kfree(im_node);
+	}
+
+	spin_unlock(&trie->lock);
+
+	return ret;
+}
+
+static struct bpf_map *trie_alloc(union bpf_attr *attr)
+{
+	struct lpm_trie *trie;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->map_flags ||
+	    attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
+	    attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
+	    attr->value_size != sizeof(u64))
+		return ERR_PTR(-EINVAL);
+
+	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
+	if (!trie)
+		return NULL;
+
+	/* copy mandatory map attributes */
+	trie->map.map_type = attr->map_type;
+	trie->map.key_size = attr->key_size;
+	trie->map.value_size = attr->value_size;
+	trie->map.max_entries = attr->max_entries;
+	trie->data_size = attr->key_size -
+				offsetof(struct bpf_lpm_trie_key, data);
+	trie->max_prefixlen = trie->data_size * 8;
+
+	spin_lock_init(&trie->lock);
+
+	return &trie->map;
+}
+
+static void trie_free(struct bpf_map *map)
+{
+	struct lpm_trie_node **node;
+	struct lpm_trie *trie =
+		container_of(map, struct lpm_trie, map);
+
+	spin_lock(&trie->lock);
+
+	/*
+	 * Always start at the root and walk down to a node that has no
+	 * children. Then free that node, nullify its parent pointer and
+	 * start over.
+	 */
+
+	for (;;) {
+		node = &trie->root;
+		if (!*node)
+			break;
+
+		for (;;) {
+			if ((*node)->child[0]) {
+				node = &(*node)->child[0];
+				continue;
+			}
+
+			if ((*node)->child[1]) {
+				node = &(*node)->child[1];
+				continue;
+			}
+
+			kfree(*node);
+			*node = NULL;
+			break;
+		}
+	}
+
+	spin_unlock(&trie->lock);
+}
+
+static const struct bpf_map_ops trie_ops = {
+	.map_alloc = trie_alloc,
+	.map_free = trie_free,
+	.map_lookup_elem = trie_lookup_elem,
+	.map_update_elem = trie_update_elem,
+};
+
+static struct bpf_map_type_list trie_type __read_mostly = {
+	.ops = &trie_ops,
+	.type = BPF_MAP_TYPE_LPM_TRIE,
+};
+
+static int __init register_trie_map(void)
+{
+	bpf_register_map_type(&trie_type);
+	return 0;
+}
+late_initcall(register_trie_map);
-- 
2.9.3

^ permalink raw reply related

* netfilter -stable backport request
From: Eric Desrochers @ 2016-12-14 15:35 UTC (permalink / raw)
  To: stable, netdev

Hi,

I would like to request a -stable backport for the following patchset that as we speak can be found in pablo's nf-next:

# git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

[PATCH 1/3]
commit 2394ae21e8b652aff0db1c02e946243c1e2f5edb
netfilter: x_tables: pass xt_counters struct instead of packet counter
   
[PATCH 2/3]
commit 18b61e8161cc308cbfd06d2e2c6c0758dfd925ef
netfilter: x_tables: pass xt_counters struct to counter allocator

[PATCH 3/3]
commit 722d6785e3b29a3b9f95c4d77542a1416094786a
netfilter: x_tables: pack percpu counter allocations

Please add this to stable branches : v4.4.x, v4.8.x

The above patchset is fixing a netfilter regression which introduced a performance slowdown in binary arp/ip/ip6tables starting at commit :

#v4.2-rc1
commit 71ae0dff02d756e4d2ca710b79f2ff5390029a5f
netfilter: xtables: use percpu rule counters

Regards,

Eric

^ permalink raw reply

* RE: [PATCH v2 net-next 1/2] phy: add phy fixup unregister functions
From: Woojung.Huh @ 2016-12-14 15:34 UTC (permalink / raw)
  To: lidongpo, davem, f.fainelli; +Cc: andrew, netdev, UNGLinuxDriver
In-Reply-To: <58510539.1030803@hisilicon.com>

> I just want to commit the unregister patch and found this patch. Good job!
> But I consider this patch may miss something.
> If one SoC has 2 MAC ports and each port uses the different network driver,
> the 2 drivers may register fixup for the same PHY chip with different
> "run" function because the PHY chip works in different mode.
> In such a case, this patch doesn't consider "run" function and may cause
> problem.
> When removing the driver which register fixup at last, it will remove another
> driver's fixup.
> Should this condition be considered and fixed?
Good point.
Current phy fixup is independent LIST from phydev structure,
and, fixup runs in two places of phy_device_register() and phy_init_hw().
It's not clear that it needs two separate fixup, but it may be good idea to
pass phy fixup when calling phy_attach() or phy_attach_direct() and
put it under phydev structure.
So, fixup can be called at phy_init_hw() per phy device and remove
When phy detached.
Welcome any comments.

- Woojung

^ permalink raw reply

* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Hannes Frederic Sowa @ 2016-12-14 15:09 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	Jean-Philippe Aumasson, Daniel J . Bernstein, Linus Torvalds,
	Eric Biggers
In-Reply-To: <CAHmME9qA6qKdp+qoih2Je4fxU+4E6=Gp7CVfhYU7VbOr6HJ=0Q@mail.gmail.com>

Hello,

On 14.12.2016 14:10, Jason A. Donenfeld wrote:
> On Wed, Dec 14, 2016 at 12:21 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Can you show or cite benchmarks in comparison with jhash? Last time I
>> looked, especially for short inputs, siphash didn't beat jhash (also on
>> all the 32 bit devices etc.).
> 
> I assume that jhash is likely faster than siphash, but I wouldn't be
> surprised if with optimization we can make siphash at least pretty
> close on 64-bit platforms. (I'll do some tests though; maybe I'm wrong
> and jhash is already slower.)

Yes, numbers would be very usable here. I am mostly concerned about
small plastic router cases. E.g. assume you double packet processing
time with a change of the hashing function at what point is the actual
packet processing more of an attack vector than the hashtable?

> With that said, siphash is here to replace uses of jhash where
> hashtable poisoning vulnerabilities make it necessary. Where there's
> no significant security improvement, if there's no speed improvement
> either, then of course nothing's going to change.

It still changes currently well working source. ;-)

> I should have mentioned md5_transform in this first message too, as
> two other patches in this series actually replace md5_transform usage
> with siphash. I think in this case, siphash is a clear performance
> winner (and security winner) over md5_transform. So if the push back
> against replacing jhash usages is just too high, at the very least it
> remains useful already for the md5_transform usage.

MD5 is considered broken because its collision resistance is broken?
SipHash doesn't even claim to have collision resistance (which we don't
need here)?

But I agree, certainly it could be a nice speed-up!

>> This pretty much depends on the linearity of the hash function? I don't
>> think a crypto secure hash function is needed for a hash table. Albeit I
>> agree that siphash certainly looks good to be used here.
> 
> In order to prevent the aforementioned poisoning attacks, a PRF with
> perfect linearity is required, which is what's achieved when it's a
> cryptographically secure one. Check out section 7 of
> https://131002.net/siphash/siphash.pdf .

I think you mean non-linearity. Otherwise I agree that siphash is
certainly a better suited hashing algorithm as far as I know. But it
would be really interesting to compare some performance numbers. Hard to
say anything without them.

>> I am pretty sure that SipHash still needs a random key per hash table
>> also. So far it was only the choice of hash function you are questioning.
> 
> Siphash needs a random secret key, yes. The point is that the hash
> function remains secure so long as the secret key is kept secret.
> Other functions can't make the same guarantee, and so nervous periodic
> key rotation is necessary, but in most cases nothing is done, and so
> things just leak over time.
> 
> 
>> Hmm, I tried to follow up with all the HashDoS work and so far didn't
>> see any HashDoS attacks against the Jenkins/SpookyHash family.
>>
>> If this is an issue we might need to also put those changes into stable.
> 
> jhash just isn't secure; it's not a cryptographically secure PRF. If
> there hasn't already been an academic paper put out there about it
> this year, let's make this thread 1000 messages long to garner
> attention, and next year perhaps we'll see one. No doubt that
> motivated government organizations, defense contractors, criminals,
> and other netizens have already done research in private. Replacing
> insecure functions with secure functions is usually a good thing.

I think this is a weak argument.

In general I am in favor to switch to siphash, but it would be nice to
see some benchmarks with the specific kernel implementation also on some
smaller 32 bit CPUs and especially without using any SIMD instructions
(which might have been used in paper comparison).

Bye,
Hannes

^ permalink raw reply

* [PATCH net-next 1/1] driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address
From: fgao @ 2016-12-14 14:52 UTC (permalink / raw)
  To: davem, maheshb, edumazet, netdev, gfree.wind

From: Gao Feng <gfree.wind@gmail.com>

There are some duplicated codes in ipvlan_add_addr6/4 and
ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
and ipvlan_del_addr to decrease the duplicated codes.
It could be helful to maintain the codes.

Signed-off-by: Gao Feng <gfree.wind@gmail.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 68 +++++++++++++++++-----------------------
 1 file changed, 29 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 693ec5b..5874d30 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -669,23 +669,22 @@ static int ipvlan_device_event(struct notifier_block *unused,
 	return NOTIFY_DONE;
 }
 
-static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+static int ipvlan_add_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
 {
 	struct ipvl_addr *addr;
 
-	if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
-		netif_err(ipvlan, ifup, ipvlan->dev,
-			  "Failed to add IPv6=%pI6c addr for %s intf\n",
-			  ip6_addr, ipvlan->dev->name);
-		return -EINVAL;
-	}
 	addr = kzalloc(sizeof(struct ipvl_addr), GFP_ATOMIC);
 	if (!addr)
 		return -ENOMEM;
 
 	addr->master = ipvlan;
-	memcpy(&addr->ip6addr, ip6_addr, sizeof(struct in6_addr));
-	addr->atype = IPVL_IPV6;
+	if (is_v6) {
+		memcpy(&addr->ip6addr, iaddr, sizeof(struct in6_addr));
+		addr->atype = IPVL_IPV6;
+	} else {
+		memcpy(&addr->ip4addr, iaddr, sizeof(struct in_addr));
+		addr->atype = IPVL_IPV4;
+	}
 	list_add_tail(&addr->anode, &ipvlan->addrs);
 
 	/* If the interface is not up, the address will be added to the hash
@@ -697,11 +696,11 @@ static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
 	return 0;
 }
 
-static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+static void ipvlan_del_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
 {
 	struct ipvl_addr *addr;
 
-	addr = ipvlan_find_addr(ipvlan, ip6_addr, true);
+	addr = ipvlan_find_addr(ipvlan, iaddr, is_v6);
 	if (!addr)
 		return;
 
@@ -712,6 +711,23 @@ static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
 	return;
 }
 
+static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+{
+	if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
+		netif_err(ipvlan, ifup, ipvlan->dev,
+			  "Failed to add IPv6=%pI6c addr for %s intf\n",
+			  ip6_addr, ipvlan->dev->name);
+		return -EINVAL;
+	}
+
+	return ipvlan_add_addr(ipvlan, ip6_addr, true);
+}
+
+static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+{
+	return ipvlan_del_addr(ipvlan, ip6_addr, true);
+}
+
 static int ipvlan_addr6_event(struct notifier_block *unused,
 			      unsigned long event, void *ptr)
 {
@@ -745,45 +761,19 @@ static int ipvlan_addr6_event(struct notifier_block *unused,
 
 static int ipvlan_add_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
 {
-	struct ipvl_addr *addr;
-
 	if (ipvlan_addr_busy(ipvlan->port, ip4_addr, false)) {
 		netif_err(ipvlan, ifup, ipvlan->dev,
 			  "Failed to add IPv4=%pI4 on %s intf.\n",
 			  ip4_addr, ipvlan->dev->name);
 		return -EINVAL;
 	}
-	addr = kzalloc(sizeof(struct ipvl_addr), GFP_KERNEL);
-	if (!addr)
-		return -ENOMEM;
-
-	addr->master = ipvlan;
-	memcpy(&addr->ip4addr, ip4_addr, sizeof(struct in_addr));
-	addr->atype = IPVL_IPV4;
-	list_add_tail(&addr->anode, &ipvlan->addrs);
-
-	/* If the interface is not up, the address will be added to the hash
-	 * list by ipvlan_open.
-	 */
-	if (netif_running(ipvlan->dev))
-		ipvlan_ht_addr_add(ipvlan, addr);
 
-	return 0;
+	return ipvlan_add_addr(ipvlan, ip4_addr, false);
 }
 
 static void ipvlan_del_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
 {
-	struct ipvl_addr *addr;
-
-	addr = ipvlan_find_addr(ipvlan, ip4_addr, false);
-	if (!addr)
-		return;
-
-	ipvlan_ht_addr_del(addr);
-	list_del(&addr->anode);
-	kfree_rcu(addr, rcu);
-
-	return;
+	return ipvlan_del_addr(ipvlan, ip4_addr, false);
 }
 
 static int ipvlan_addr4_event(struct notifier_block *unused,
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Arnaldo Carvalho de Melo @ 2016-12-14 14:55 UTC (permalink / raw)
  To: Daniel Borkmann, Joe Stringer; +Cc: linux-kernel, netdev, wangnan0, ast
In-Reply-To: <20161214132501.GP5482@kernel.org>

Em Wed, Dec 14, 2016 at 10:25:01AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
> > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
> > > > v3: Add ack for first patch.
> > > >      Split out second patch from v2 into separate changes for remaining diff.
> > > >      Add patches to switch samples/bpf over to using tools/lib/.
> > > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > > >      Don't shift non-bpf code into libbpf.
> > > >      Drop the patch to synchronize ELF definitions with tc.
> > > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > > >      First post.

> > > Thanks, applied after addressing the -I$(objtree) issue raised by Wang,

> > [ Sorry for late reply. ]

> > First of all, glad to see us getting rid of the duplicate lib eventually! :)
> > 
> > Please note that this might result in hopefully just a minor merge issue
> > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
> > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
> > just log buffer and some renames there, which can be discarded for both
> > files sitting in selftests.
> 
> Yeah, I've got to this point, and the merge has a little bit more than
> that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...

So, Joe, can you try refreshing this work, starting from what I have in
perf/core? It has the changes coming from net-next that Daniel warned us about
and some more.

[acme@jouet linux]$ git log --oneline -5
1f125a4aa4d8 tools lib bpf: Add flags to bpf_create_map()
5adf5614f72d tools lib bpf: use __u32 from linux/types.h
ff687c38d803 tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
53452c69b4c3 perf annotate: Fix jump target outside of function address range
2f41ae602b57 perf annotate: Support jump instruction with target as second operand
[acme@jouet linux]$

I tried refreshing it, but it seems samples/bpf/ needs some love and
care first, as I can't get it to build before these patches, to make
sure nothing gets broken.

Trying to bisect it I get to what seems multiple bisect breakages, last
tag I got it to build, with lots of warnings, was v4.8, after that I get
things like the ones below.

I could try fixing it, but may be missing something, and want to push the other
stuff in this branch...

[acme@jouet linux]$ egrep SAMPLES\|BPF .config
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
# CONFIG_TEST_BPF is not set
CONFIG_SAMPLES=y
[acme@jouet linux]$ 

[acme@jouet linux]$ make -C samples/bpf
make: Entering directory '/home/acme/git/linux/samples/bpf'
make -C ../../ $PWD/
make[1]: Entering directory '/home/acme/git/linux'
  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  CHK     include/generated/timeconst.h
  CHK     include/generated/bounds.h
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  HOSTCC  /home/acme/git/linux/samples/bpf/bpf_load.o
In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0:
/home/acme/git/linux/samples/bpf/bpf_helpers.h:76:11: error: ‘BPF_FUNC_skb_in_cgroup’ undeclared here (not in a function)
  (void *) BPF_FUNC_skb_in_cgroup;
           ^~~~~~~~~~~~~~~~~~~~~~
scripts/Makefile.host:124: recipe for target '/home/acme/git/linux/samples/bpf/bpf_load.o' failed
make[2]: *** [/home/acme/git/linux/samples/bpf/bpf_load.o] Error 1
Makefile:1646: recipe for target '/home/acme/git/linux/samples/bpf/' failed

[acme@jouet linux]$ make -C samples/bpf
make: Entering directory '/home/acme/git/linux/samples/bpf'
make -C ../../ $PWD/
make[1]: Entering directory '/home/acme/git/linux'
scripts/kconfig/conf  --silentoldconfig Kconfig
#
# configuration written to .config
#
  SYSTBL  arch/x86/entry/syscalls/../../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/entry/syscalls/../../include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/entry/syscalls/../../include/generated/uapi/asm/unistd_32.h
  CHK     include/config/kernel.release
  UPD     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  UPD     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  UPD     include/generated/utsrelease.h
  CHK     include/generated/timeconst.h
  CC      kernel/bounds.s
  CHK     include/generated/bounds.h
  GEN     scripts/gdb/linux/constants.py
  CC      arch/x86/kernel/asm-offsets.s
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  HOSTCC  /home/acme/git/linux/samples/bpf/bpf_load.o
In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0:
/home/acme/git/linux/samples/bpf/bpf_helpers.h:49:11: error: ‘BPF_FUNC_current_task_under_cgroup’ undeclared here (not in a function)
  (void *) BPF_FUNC_current_task_under_cgroup;
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/acme/git/linux/samples/bpf/bpf_helpers.h:80:11: error: ‘BPF_FUNC_skb_in_cgroup’ undeclared here (not in a function)
  (void *) BPF_FUNC_skb_in_cgroup;
           ^~~~~~~~~~~~~~~~~~~~~~
scripts/Makefile.host:124: recipe for target '/home/acme/git/linux/samples/bpf/bpf_load.o' failed

^ permalink raw reply

* RE: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: David Laight @ 2016-12-14 14:47 UTC (permalink / raw)
  To: 'Jason A. Donenfeld', Hannes Frederic Sowa
  Cc: Netdev, kernel-hardening@lists.openwall.com, Andi Kleen, LKML,
	Linux Crypto Mailing List
In-Reply-To: <CAHmME9o4NVi-MPeURio1Ga58rnW6JAGQdTg6scd+K3EZEf3RNA@mail.gmail.com>

From: Jason A. Donenfeld
> Sent: 14 December 2016 13:44
> To: Hannes Frederic Sowa
> > __packed not only removes all padding of the struct but also changes the
> > alignment assumptions for the whole struct itself. The rule, the struct
> > is aligned by its maximum alignment of a member is no longer true. That
> > said, the code accessing this struct will change (not on archs that can
> > deal efficiently with unaligned access, but on others).
> 
> That's interesting. There currently aren't any alignment requirements
> in siphash because we use the unaligned helper functions, but as David
> pointed out in another thread, maybe that too should change. In that
> case, we'd have an aligned-only version of the function that requires
> 8-byte aligned input. Perhaps the best way to go about that would be
> to just mark the struct as __packed __aligned(8). Or, I guess, since
> 64-bit accesses gets split into two on 32-bit, that'd be best descried
> as __packed __aligned(sizeof(long)). Would that be an acceptable
> solution?

Just remove the __packed and ensure that the structure is 'nice'.
This includes ensuring there is no 'tail padding'.
In some cases you'll need to put the port number into a 32bit field.

I'd also require that the key be aligned.
It probably ought to be a named structure type with two 64bit members
(or with an array member that has two elements).

	David


^ permalink raw reply

* RE: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: David Laight @ 2016-12-14 14:41 UTC (permalink / raw)
  To: 'Alexey Dobriyan'
  Cc: davem@davemloft.net, netdev@vger.kernel.org, xemul@openvz.org
In-Reply-To: <CACVxJT9HAMex1Pa+VbTVJwDPuY_dnfgfQgAZFSjMuUmgCiZD1w@mail.gmail.com>

From: Alexey Dobriyan 
> Sent: 14 December 2016 13:20
...
> > If you have foo->bar[id - const] then the compiler has to add the
> > offset of 'bar' and subtract for 'const'.
> > If the numbers match no add or subtract is needed.
> >
> > It is much cleaner to do this by explicitly removing the offset on the
> > accesses than using a union.
> 
> Surprisingly, the trick only works if array index is cast to "unsigned long"
> before subtracting.
> 
> Code becomes
> 
>     ...
>     ptr = ng->ptr[(unsigned long)id - 3];
>     ...

The compiler may also be able to optimise it away if 'id' is 'int'
rather than 'unsigned int'.

Oh, if you need casts like that use an accessor function.

	David



^ permalink raw reply

* Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: kbuild test robot @ 2016-12-14 14:40 UTC (permalink / raw)
  To: Arvind Yadav; +Cc: kbuild-all, peter.chen, fw, netdev, linux-kernel
In-Reply-To: <1481639670-17888-1-git-send-email-arvind.yadav.cs@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1852 bytes --]

Hi Arvind,

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624
config: mips-cavium_octeon_defconfig (attached as .config)
compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe':
>> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' undeclared (first use in this function)
      dev_err(dev, "failed to map I/O memory\n");
              ^~~
   drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each undeclared identifier is reported only once for each function it appears in

vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c

  1467	
  1468		p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, p->mix_size);
  1469		p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
  1470		p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
  1471						   p->agl_prt_ctl_size);
  1472		if (!p->mix || !p->agl || !p->agl_prt_ctl) {
> 1473			dev_err(dev, "failed to map I/O memory\n");
  1474			result = -ENOMEM;
  1475			goto err;
  1476		}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 15718 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit
From: Michał Mirosław @ 2016-12-14 14:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: open list:OPENVSWITCH, netdev-u79uwXL29TY76Z2rM5mHXA,
	moderated list:ETHERNET BRIDGE
In-Reply-To: <20161213172118.2f55c503@xeon-e3>

On Tue, Dec 13, 2016 at 05:21:18PM -0800, Stephen Hemminger wrote:
> On Sat,  3 Dec 2016 10:22:28 +0100 (CET)
> Michał Mirosław <mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org> wrote:
> 
> > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> > intact through linux networking stack.
> > 
> > Signed-off-by: Michał Mirosław <michal.miroslaw-sjE0K2xrq/hHxbwTTUZ4aWZHpeb/A1Y/@public.gmane.org>
> > ---
> > 
> > Dear NetDevs
> > 
> > I guess this needs to be split to the prep..convert[]..finish sequence,
> > but if you like it as is, then it's ready.
> > 
> > The biggest question is if the modified interface and vlan_present
> > is the way to go. This can be changed to use vlan_proto != 0 instead
> > of an extra flag bit.
> > 
> > As I can't test most of the driver changes, please look at them carefully.
> > OVS and bridge eyes are especially welcome.
> > 
> > Best Regards,
> > Michał Mirosław
> 
> Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)?
> 
> If so then you need to be more verbose in the commit log, and lots more
> work is needed. You need to rename fields and validate every place a
> driver is using DEI bit to make sure it really does the right thing
> on that hardware. It is not just a mechanical change.

There are not many mentions of CFI bit in the Linux tree. Places that
used it as VLAN_TAG_PRESENT are fixed with this patchset. Other uses are:

 - VLAN code: ignored
 - ebt_vlan: ignored
 - OVS: cleared because of netlink API assumptions
 - DSA: transferred to/from (E)DSA tag
 - drivers: gianfar: uses properly in filtering rules
 - drivers: cnic: false-positive (uses only VLAN ID, CFI bit marks the field 'valid')
 - drivers: qedr: false-positive (like cnic)

So unless there is something hidden in the hardware, no driver does anything
special with the CFI bit.

After this patchset only OVS will need further modifications to be able to
support handling of DEI bit.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH 2/3] selftests: do not require bash to run bpf tests
From: Shuah Khan @ 2016-12-14 14:22 UTC (permalink / raw)
  To: Daniel Borkmann, Rolf Eike Beer, linux-kselftest
  Cc: David S. Miller, netdev, Alexei Starovoitov, linux-kernel,
	Shuah Khan, Shuah Khan
In-Reply-To: <5851270F.4090709@iogearbox.net>

On 12/14/2016 04:03 AM, Daniel Borkmann wrote:
> On 12/14/2016 11:58 AM, Rolf Eike Beer wrote:
>>  From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
>> From: Rolf Eike Beer <eike-kernel@sf-tec.de>
>> Date: Wed, 14 Dec 2016 09:58:12 +0100
>> Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests
>>
>> Nothing in this minimal script seems to require bash. We often run these tests
>> on embedded devices where the only shell available is the busybox ash.
>>
>> Signed-off-by: Rolf Eike Beer <eb@emlix.com>
> 
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks. I will get these into 4.10-rc1 or rc2

-- Shuah

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Jason A. Donenfeld @ 2016-12-14 13:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Netdev, kernel-hardening, Andi Kleen, LKML,
	Linux Crypto Mailing List
In-Reply-To: <1e502c6b-cda3-c46d-2535-fcfb58f443a9@stressinduktion.org>

Hi Hannes,

Thanks for the feedback.

> __packed not only removes all padding of the struct but also changes the
> alignment assumptions for the whole struct itself. The rule, the struct
> is aligned by its maximum alignment of a member is no longer true. That
> said, the code accessing this struct will change (not on archs that can
> deal efficiently with unaligned access, but on others).

That's interesting. There currently aren't any alignment requirements
in siphash because we use the unaligned helper functions, but as David
pointed out in another thread, maybe that too should change. In that
case, we'd have an aligned-only version of the function that requires
8-byte aligned input. Perhaps the best way to go about that would be
to just mark the struct as __packed __aligned(8). Or, I guess, since
64-bit accesses gets split into two on 32-bit, that'd be best descried
as __packed __aligned(sizeof(long)). Would that be an acceptable
solution?

Jason

^ permalink raw reply

* Re: [PATCHv2 2/5] sh_eth: enable wake-on-lan for Gen2 devices
From: Sergei Shtylyov @ 2016-12-14 13:37 UTC (permalink / raw)
  To: Niklas Söderlund, Simon Horman, netdev, linux-renesas-soc
  Cc: Geert Uytterhoeven
In-Reply-To: <20161212160931.6478-3-niklas.soderlund+renesas@ragnatech.se>

Hello!

    You forgot "R-Car" before "Gen2" in the subject.

On 12/12/2016 07:09 PM, Niklas Söderlund wrote:

> Tested on Gen2 r8a7791/Koelsch.
>
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
> ---
>  drivers/net/ethernet/renesas/sh_eth.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index 87640b9..348ed22 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -624,8 +624,9 @@ static struct sh_eth_cpu_data r8a779x_data = {
>
>  	.register_type	= SH_ETH_REG_FAST_RCAR,
>
> -	.ecsr_value	= ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD,
> -	.ecsipr_value	= ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP,
> +	.ecsr_value	= ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD,
> +	.ecsipr_value	= ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP |
> +			  ECSIPR_MPDIP,

   These expressions seem to have been sorted by the bit # before your patch, 
now they aren't... care to fix? :-)

[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH] arp: do neigh confirm based on sk arg
From: kbuild test robot @ 2016-12-14 13:37 UTC (permalink / raw)
  To: YueHaibing
  Cc: kbuild-all, Julian Anastasov, Hannes Frederic Sowa, Eric Dumazet,
	David S. Miller, netdev
In-Reply-To: <74d41c47-7091-3c52-096d-5b9af2e0e9cf@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 2920 bytes --]

Hi YueHaibing,

[auto build test WARNING on v4.9-rc8]
[cannot apply to net/master net-next/master sparc-next/master next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/YueHaibing/arp-do-neigh-confirm-based-on-sk-arg/20161214-191755
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

>> include/net/sock.h:452: warning: No description found for parameter 'sk_dst_pending_confirm'

vim +/sk_dst_pending_confirm +452 include/net/sock.h

^1da177e4 Linus Torvalds  2005-04-16  436  	int			sk_write_pending;
4ea59a6cc YueHaibing      2016-12-14  437  	unsigned short          sk_dst_pending_confirm;
d5f642384 Alexey Dobriyan 2008-11-04  438  #ifdef CONFIG_SECURITY
^1da177e4 Linus Torvalds  2005-04-16  439  	void			*sk_security;
d5f642384 Alexey Dobriyan 2008-11-04  440  #endif
2a56a1fec Tejun Heo       2015-12-07  441  	struct sock_cgroup_data	sk_cgrp_data;
baac50bbc Johannes Weiner 2016-01-14  442  	struct mem_cgroup	*sk_memcg;
^1da177e4 Linus Torvalds  2005-04-16  443  	void			(*sk_state_change)(struct sock *sk);
676d23690 David S. Miller 2014-04-11  444  	void			(*sk_data_ready)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  445  	void			(*sk_write_space)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  446  	void			(*sk_error_report)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  447  	int			(*sk_backlog_rcv)(struct sock *sk,
^1da177e4 Linus Torvalds  2005-04-16  448  						  struct sk_buff *skb);
^1da177e4 Linus Torvalds  2005-04-16  449  	void                    (*sk_destruct)(struct sock *sk);
ef456144d Craig Gallek    2016-01-04  450  	struct sock_reuseport __rcu	*sk_reuseport_cb;
a4298e452 Eric Dumazet    2016-04-01  451  	struct rcu_head		sk_rcu;
^1da177e4 Linus Torvalds  2005-04-16 @452  };
^1da177e4 Linus Torvalds  2005-04-16  453  
559835ea7 Pravin B Shelar 2013-09-24  454  #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
559835ea7 Pravin B Shelar 2013-09-24  455  
559835ea7 Pravin B Shelar 2013-09-24  456  #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
559835ea7 Pravin B Shelar 2013-09-24  457  #define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
559835ea7 Pravin B Shelar 2013-09-24  458  
4a17fd522 Pavel Emelyanov 2012-04-19  459  /*
4a17fd522 Pavel Emelyanov 2012-04-19  460   * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK

:::::: The code at line 452 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6425 bytes --]

^ permalink raw reply

* Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
From: Michael S. Tsirkin @ 2016-12-14 13:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: daniel, shm, davem, tgraf, alexei.starovoitov, john.r.fastabend,
	netdev, brouer
In-Reply-To: <5849F52A.7050105@gmail.com>

On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote:
> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
> >> This adds support for dynamically setting the LRO feature flag. The
> >> message to control guest features in the backend uses the
> >> CTRL_GUEST_OFFLOADS msg type.
> >>
> >> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >> ---
> >>  drivers/net/virtio_net.c |   40 +++++++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 39 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index a21d93a..a5c47b1 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1419,6 +1419,36 @@ static void virtnet_init_settings(struct net_device *dev)
> >>  	.set_settings = virtnet_set_settings,
> >>  };
> >>  
> >> +static int virtnet_set_features(struct net_device *netdev,
> >> +				netdev_features_t features)
> >> +{
> >> +	struct virtnet_info *vi = netdev_priv(netdev);
> >> +	struct virtio_device *vdev = vi->vdev;
> >> +	struct scatterlist sg;
> >> +	u64 offloads = 0;
> >> +
> >> +	if (features & NETIF_F_LRO)
> >> +		offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
> >> +			    (1 << VIRTIO_NET_F_GUEST_TSO6);
> >> +
> >> +	if (features & NETIF_F_RXCSUM)
> >> +		offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
> >> +
> >> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
> >> +		sg_init_one(&sg, &offloads, sizeof(uint64_t));
> >> +		if (!virtnet_send_command(vi,
> >> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> >> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
> >> +					  &sg)) {
> > 
> > Hmm I just realised that this will slow down setups that bridge
> > virtio net interfaces since bridge calls this if provided.
> > See below.
> 
> 
> Really? What code is trying to turn off GRO via the GUEST_OFFLOADS LRO
> command. My qemu/Linux setup has a set of tap/vhost devices attached to
> a bridge and all of them have LRO enabled even with this patch series.
> 
> I must missing a setup handler somewhere?
> 
> > 
> >> +			dev_warn(&netdev->dev,
> >> +				 "Failed to set guest offloads by virtnet command.\n");
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> > 
> > Hmm if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS is off, this fails
> > silently. It might actually be a good idea to avoid
> > breaking setups.
> > 
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  static const struct net_device_ops virtnet_netdev = {
> >>  	.ndo_open            = virtnet_open,
> >>  	.ndo_stop   	     = virtnet_close,
> >> @@ -1435,6 +1465,7 @@ static void virtnet_init_settings(struct net_device *dev)
> >>  #ifdef CONFIG_NET_RX_BUSY_POLL
> >>  	.ndo_busy_poll		= virtnet_busy_poll,
> >>  #endif
> >> +	.ndo_set_features	= virtnet_set_features,
> >>  };
> >>  
> >>  static void virtnet_config_changed_work(struct work_struct *work)
> >> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
> >>  		dev->features |= NETIF_F_RXCSUM;
> >>  
> >> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
> >> +	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> >> +		dev->features |= NETIF_F_LRO;
> >> +		dev->hw_features |= NETIF_F_LRO;
> > 
> > So the issue is I think that the virtio "LRO" isn't really
> > LRO, it's typically just GRO forwarded to guests.
> > So these are easily re-split along MTU boundaries,
> > which makes it ok to forward these across bridges.
> > 
> > It's not nice that we don't document this in the spec,
> > but it's the reality and people rely on this.
> > 
> > For now, how about doing a custom thing and just disable/enable
> > it as XDP is attached/detached?
> 
> The annoying part about doing this is ethtool will say that it is fixed
> yet it will be changed by seemingly unrelated operation. I'm not sure I
> like the idea to start automatically configuring the link via xdp_set.

I really don't like the idea of dropping performance
by a factor of 3 for people bridging two virtio net
interfaces.

So how about a simple approach for now, just disable
XDP if GUEST_TSO is enabled?

We can discuss better approaches in next version.


> > 
> >> +	}
> >> +
> >>  	dev->vlan_features = dev->features;
> >>  
> >>  	/* MTU range: 68 - 65535 */
> >> @@ -2057,7 +2094,8 @@ static int virtnet_restore(struct virtio_device *vdev)
> >>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
> >>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>  	VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >> -	VIRTIO_NET_F_MTU
> >> +	VIRTIO_NET_F_MTU, \
> >> +	VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> >>  
> >>  static unsigned int features[] = {
> >>  	VIRTNET_FEATURES,

^ permalink raw reply

* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Arnaldo Carvalho de Melo @ 2016-12-14 13:25 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Joe Stringer, linux-kernel, netdev, wangnan0, ast
In-Reply-To: <584ACE2E.2090108@iogearbox.net>

Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
> Hi Arnaldo,
> 
> On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Dec 08, 2016 at 06:46:13PM -0800, Joe Stringer escreveu:
> > > (Was "libbpf: Synchronize implementations")
> > > 
> > > Update tools/lib/bpf to provide the remaining bpf wrapper pieces needed by the
> > > samples/bpf/ code, then get rid of all of the duplicate BPF libraries in
> > > samples/bpf/libbpf.[ch].
> > > 
> > > ---
> > > v3: Add ack for first patch.
> > >      Split out second patch from v2 into separate changes for remaining diff.
> > >      Add patches to switch samples/bpf over to using tools/lib/.
> > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > >      Don't shift non-bpf code into libbpf.
> > >      Drop the patch to synchronize ELF definitions with tc.
> > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > >      First post.
> > 
> > Thanks, applied after addressing the -I$(objtree) issue raised by Wang,
> 
> [ Sorry for late reply. ]
> 
> First of all, glad to see us getting rid of the duplicate lib eventually! :)
> 
> Please note that this might result in hopefully just a minor merge issue
> with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
> which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
> just log buffer and some renames there, which can be discarded for both
> files sitting in selftests.

Yeah, I've got to this point, and the merge has a little bit more than
that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...

- Arnaldo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox