Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] r8169: don't use MSI-X on RTL8106e
From: Heiner Kallweit @ 2018-08-22 19:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, helgaas, jian-hong, nic_swsd, netdev, linux-kernel,
	linux, linux-pci, marc.zyngier, hch
In-Reply-To: <alpine.DEB.2.21.1808221222260.2514@nanos.tec.linutronix.de>

On 22.08.2018 13:44, Thomas Gleixner wrote:
> On Tue, 21 Aug 2018, Heiner Kallweit wrote:
>> On 21.08.2018 21:31, David Miller wrote:
>>> From: Heiner Kallweit <hkallweit1@gmail.com>
>>> Date: Mon, 20 Aug 2018 22:46:48 +0200
>>>
>>>> I'm in contact with Realtek and according to them few chip versions
>>>> seem to clear MSI-X table entries on resume from suspend. Checking
>>>> with them how this could be fixed / worked around.
>>>> Worst case we may have to disable MSI-X in general.
>>>
>>> I worry that if the chip does this, and somehow MSI-X is enabled and
>>> an interrupt is generated, the chip will write to the cleared out
>>> MSI-X address.  This will either write garbage into memory or cause
>>> a bus error and require PCI error recovery.
>>>
>>> It also looks like your test patch doesn't fix things for people who
>>> have tested it.
>>>
>> The test patch was based on the first info from Realtek which made me
>> think that the base address of the MSI-X table is cleared, what
>> obviously is not the case.
>>
>> After some further tests it seems that the solution isn't as simple
>> as storing the MSI-X table entries on suspend and restore them on
>> resume. On my system (where MSI-X works fine) MSI-X table entries
>> on resume are partially different from the ones on suspend.
> 
> Which is not a surprise. Please don't try to fiddle with that at the driver
> level. The irq and PCI core code are the ones in charge and if you'd
> restore at the wrong point then hell breaks lose.
> 
Instead of spending a lot of effort on a workaround which may not be
acceptable, it may be better to fall back to MSI on all affected chip
versions. For two chip versions which were reported to have this issues
we're doing this already. I asked Realtek whether they have an overview
which chip versions are affected, let's see ..

The Realtek chips provide an alternative, register-based way to access
the MSI-X table, and their Windows driver seems to use it. See here:
https://patchwork.kernel.org/patch/4149171/

But as we handle all MSI-X basics in the PCI core, this isn't an option.


> Can you please do the following:
> 
>  1) Store the PCI config space at suspend time
>  2) Compare the PCI config space at resume time and print the difference
> 
> Do that on a working and a non-working version of Realtek NICs.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 

^ permalink raw reply

* [PATCH bpf] bpf, sockmap: fix sock hash count in alloc_sock_hash_elem
From: Daniel Borkmann @ 2018-08-22 16:24 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: john.fastabend, netdev, Daniel Borkmann

When we try to allocate a new sock hash entry and the allocation
fails, then sock hash map fails to reduce the map element counter,
meaning we keep accounting this element although it was never used.
Fix it by dropping the element counter on error.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 60ceb0e..40c6ef9 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2269,8 +2269,10 @@ static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
 	}
 	l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN,
 			     htab->map.numa_node);
-	if (!l_new)
+	if (!l_new) {
+		atomic_dec(&htab->count);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	memcpy(l_new->key, key, key_size);
 	l_new->sk = sk;
-- 
2.9.5

^ permalink raw reply related

* [PATCH] hv_netvsc: Fix a deadlock by getting rtnl_lock earlier in netvsc_probe()
From: Dexuan Cui @ 2018-08-22 21:20 UTC (permalink / raw)
  To: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	'David S. Miller', 'netdev@vger.kernel.org'
  Cc: 'devel@linuxdriverproject.org',
	'linux-kernel@vger.kernel.org', 'olaf@aepfle.de',
	'apw@canonical.com', 'jasowang@redhat.com',
	vkuznets, 'marcelo.cerri@canonical.com', Josh Poulson


This patch fixes the race between netvsc_probe() and
rndis_set_subchannel(), which can cause a deadlock.

Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


FYI: these are the related 3 paths which show the deadlock:

#1:
    Workqueue: hv_vmbus_con vmbus_onmessage_work [hv_vmbus]
    Call Trace:
     schedule
     schedule_preempt_disabled
     __mutex_lock
     __device_attach
     bus_probe_device
     device_add
     vmbus_device_register
     vmbus_onoffer
     vmbus_onmessage_work
     process_one_work
     worker_thread
     kthread
     ret_from_fork

#2:
    schedule
     schedule_preempt_disabled
     __mutex_lock
     netvsc_probe
     vmbus_probe
     really_probe
     __driver_attach
     bus_for_each_dev
     driver_attach_async
     async_run_entry_fn
     process_one_work
     worker_thread
     kthread
     ret_from_fork

#3:
    Workqueue: events netvsc_subchan_work [hv_netvsc]
    Call Trace:
     schedule
     rndis_set_subchannel
     netvsc_subchan_work
     process_one_work
     worker_thread
     kthread
     ret_from_fork

Before path #1 finishes, path #2 can start to run, because just before
the "bus_probe_device(dev);" in device_add() in path #1, there is a line
"object_uevent(&dev->kobj, KOBJ_ADD);", so systemd-udevd can
immediately try to load hv_netvsc and hence path #2 can start to run.

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1121a1ec..4fd14a0 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2206,6 +2206,16 @@ static int netvsc_probe(struct hv_device *dev,
 
 	memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
 
+	/* We must get rtnl_lock before scheduling nvdev->subchan_work,
+	 * otherwise netvsc_subchan_work() can get rtnl_lock first and wait
+	 * all subchannels to show up, but that may not happen because
+	 * netvsc_probe() can't get rtnl_lock and as a result vmbus_onoffer()
+	 * -> ... -> device_add() -> ... -> __device_attach() can't get
+	 * the device lock, so all the subchannels can't be processed --
+	 * finally netvsc_subchan_work() hangs for ever.
+	 */
+	rtnl_lock();
+
 	if (nvdev->num_chn > 1)
 		schedule_work(&nvdev->subchan_work);
 
@@ -2224,7 +2234,6 @@ static int netvsc_probe(struct hv_device *dev,
 	else
 		net->max_mtu = ETH_DATA_LEN;
 
-	rtnl_lock();
 	ret = register_netdevice(net);
 	if (ret != 0) {
 		pr_err("Unable to register netdev.\n");
-- 
2.7.4

^ permalink raw reply related

* [PATCH iproute2 1/3] testsuite: remove all temp files and implement make clean
From: Luca Boccassi @ 2018-08-22 18:09 UTC (permalink / raw)
  To: netdev; +Cc: stephen, stefan.bader

Some generated test files were not removed, including one executable in
the testsuite/tools directory.
Ensure make clean from the top level directory works for the testsuite
subdirs too, and that all the files are removed.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
 Makefile                 | 2 +-
 testsuite/Makefile       | 3 +++
 testsuite/tools/Makefile | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 651d2a50..ea2f797c 100644
--- a/Makefile
+++ b/Makefile
@@ -96,7 +96,7 @@ snapshot:
 		> include/SNAPSHOT.h
 
 clean:
-	@for i in $(SUBDIRS); \
+	@for i in $(SUBDIRS) testsuite; \
 	do $(MAKE) $(MFLAGS) -C $$i clean; done
 
 clobber:
diff --git a/testsuite/Makefile b/testsuite/Makefile
index 8fcbc557..2acd0427 100644
--- a/testsuite/Makefile
+++ b/testsuite/Makefile
@@ -43,6 +43,9 @@ alltests: $(TESTS)
 clean:
 	@echo "Removing $(RESULTS_DIR) dir ..."
 	@rm -rf $(RESULTS_DIR)
+	@rm -f iproute2/iproute2-this
+	@rm -f tests/ip/link/dev_wo_vf_rate.nl
+	$(MAKE) -C tools clean
 
 distclean: clean
 	echo "Entering iproute2" && cd iproute2 && $(MAKE) distclean && cd ..;
diff --git a/testsuite/tools/Makefile b/testsuite/tools/Makefile
index f2cdc980..f0ce4ee2 100644
--- a/testsuite/tools/Makefile
+++ b/testsuite/tools/Makefile
@@ -1,3 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 generate_nlmsg: generate_nlmsg.c ../../lib/libnetlink.c
 	$(CC) -o $@ $^
+
+clean:
+	rm -f generate_nlmsg
-- 
2.18.0

^ permalink raw reply related

* [PATCH iproute2 3/3] testsuite: run dmesg with sudo
From: Luca Boccassi @ 2018-08-22 18:09 UTC (permalink / raw)
  To: netdev; +Cc: stephen, stefan.bader
In-Reply-To: <20180822180903.26443-1-bluca@debian.org>

Some distributions like Debian nowadays restrict the dmesg command to
root-only. Run it with sudo in the testsuite.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
 testsuite/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/testsuite/Makefile b/testsuite/Makefile
index 5e269877..ef45d5a7 100644
--- a/testsuite/Makefile
+++ b/testsuite/Makefile
@@ -79,5 +79,5 @@ endif
 			echo "PASS"; \
 		fi; \
 		rm "$$TMP_ERR" "$$TMP_OUT"; \
-		dmesg > $(RESULTS_DIR)/$@.$$o.dmesg; \
+		sudo dmesg > $(RESULTS_DIR)/$@.$$o.dmesg; \
 	done
-- 
2.18.0

^ permalink raw reply related

* [PATCH iproute2 2/3] testsuite: let make compile build the netlink helper
From: Luca Boccassi @ 2018-08-22 18:09 UTC (permalink / raw)
  To: netdev; +Cc: stephen, stefan.bader
In-Reply-To: <20180822180903.26443-1-bluca@debian.org>

The generate_nlmsg binary is required but make -C testsuite compile
does not build it. Add the necessary includes and C*FLAGS to the tools
Makefile and have the compile target build it.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
 testsuite/Makefile       | 1 +
 testsuite/tools/Makefile | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/testsuite/Makefile b/testsuite/Makefile
index 2acd0427..5e269877 100644
--- a/testsuite/Makefile
+++ b/testsuite/Makefile
@@ -32,6 +32,7 @@ configure:
 
 compile: configure
 	echo "Entering iproute2" && cd iproute2 && $(MAKE) && cd ..;
+	$(MAKE) -C tools
 
 listtests:
 	@for t in $(TESTS); do \
diff --git a/testsuite/tools/Makefile b/testsuite/tools/Makefile
index f0ce4ee2..c936af71 100644
--- a/testsuite/tools/Makefile
+++ b/testsuite/tools/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../../config.mk
+
 generate_nlmsg: generate_nlmsg.c ../../lib/libnetlink.c
-	$(CC) -o $@ $^
+	$(CC) $(CPPFLAGS) $(CFLAGS) $(LDLIBS) $(EXTRA_CFLAGS) -I../../include -include../../include/uapi/linux/netlink.h -o $@ $^
 
 clean:
 	rm -f generate_nlmsg
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH 05/11] can: rcar: use SPDX identifier for Renesas drivers
From: Philippe Ombredanne @ 2018-08-22 18:16 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: Wolfram Sang, Greg Kroah-Hartman, Wolfram Sang, Linux-Renesas,
	Kuninori Morimoto, Wolfgang Grandegger, Marc Kleine-Budde,
	David S. Miller, linux-can, netdev, linux-kernel, Thomas Gleixner
In-Reply-To: <CAOMZO5BiAyDMSR=2Y-4-1ftNuVtRS59sSJd_cdFmzX=0kgF0Bw@mail.gmail.com>

Hi Fabio,
On Wed, Aug 22, 2018 at 2:55 PM, Fabio Estevam <festevam@gmail.com> wrote:
> On Wed, Aug 22, 2018 at 3:30 AM, Wolfram Sang <wsa@the-dreams.de> wrote:
>>
>>> > +// SPDX-License-Identifier: GPL-2.0-or-later
>>>
>>> According to Documentation/process/license-rules.rst the format should
>>> be like this instead:
>>>
>>> // SPDX-License-Identifier: GPL-2.0+
>>
>> According to https://spdx.org/licenses/ it should be what I did above.
>
> Previous advice I saw was to follow the format described in
> Documentation/process/license-rules.rst
>
> Greg/Philippe,
>
> Any inputs on this matter?
>
> Thanks

IMHO we should always treat and use the
Documentation/process/license-rules.rst as the reference and not SPDX
proper who moves at its own pace and evolves its specs and license ids
independently of where we stand in the kernel.
If this is not right Doc patches are welcomed!
In this is very specific case this has been discussed on list a few
times. If I recall correctly Thomas also had an opinion on this...
So you are correct and this should be for now:
// SPDX-License-Identifier: GPL-2.0+

-- 
Cordially
Philippe Ombredanne

^ permalink raw reply

* Re: [bpf PATCH 1/2] tls: possible hang when do_tcp_sendpages hits sndbuf is full case
From: Dave Watson @ 2018-08-22 18:22 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, davem
In-Reply-To: <20180822153732.27968.41693.stgit@john-Precision-Tower-5810>

On 08/22/18 08:37 AM, John Fastabend wrote:
> Currently, the lower protocols sk_write_space handler is not called if
> TLS is sending a scatterlist via  tls_push_sg. However, normally
> tls_push_sg calls do_tcp_sendpage, which may be under memory pressure,
> that in turn may trigger a wait via sk_wait_event. Typically, this
> happens when the in-flight bytes exceed the sdnbuf size. In the normal
> case when enough ACKs are received sk_write_space() will be called and
> the sk_wait_event will be woken up allowing it to send more data
> and/or return to the user.
> 
> But, in the TLS case because the sk_write_space() handler does not
> wake up the events the above send will wait until the sndtimeo is
> exceeded. By default this is MAX_SCHEDULE_TIMEOUT so it look like a
> hang to the user (especially this impatient user). To fix this pass
> the sk_write_space event to the lower layers sk_write_space event
> which in the TCP case will wake any pending events.
> 
> I observed the above while integrating sockmap and ktls. It
> initially appeared as test_sockmap (modified to use ktls) occasionally
> hanging. To reliably reproduce this reduce the sndbuf size and stress
> the tls layer by sending many 1B sends. This results in every byte
> needing a header and each byte individually being sent to the crypto
> layer.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Super, thanks!

Acked-by: Dave Watson <davejwatson@fb.com>

^ permalink raw reply

* Re: [bpf PATCH 0/2] tls, sockmap, fixes for sk_wait_event
From: Alexei Starovoitov @ 2018-08-22 18:40 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, davejwatson, netdev, davem
In-Reply-To: <20180822153314.27968.72499.stgit@john-Precision-Tower-5810>

On Wed, Aug 22, 2018 at 08:37:27AM -0700, John Fastabend wrote:
> I have been testing ktls and sockmap lately and noticed that neither
> was handling sk_write_space events correctly. We need to ensure
> these events are pushed down to the lower layer in all cases to
> handle the case where the lower layer sendpage call has called
> sk_wait_event and needs to be woken up. Without this I see
> occosional stalls of sndtimeo length while we wait for the
> timeout value even though space is available.
> 
> Two fixes below. Thanks.

for the set
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification
From: Edward Cree @ 2018-08-22 19:00 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

The first patch is a simplification of register liveness tracking by using
 a separate parentage chain for each register and stack slot, thus avoiding
 the need for logic to handle callee-saved registers when applying read
 marks.  In the future this idea may be extended to form use-def chains.
The second patch adds information about misc/zero data on the stack to the
 state dumps emitted to the log at various points; this information was
 found essential in debugging the first patch, and may be useful elsewhere.

Edward Cree (2):
  bpf/verifier: per-register parent pointers
  bpf/verifier: display non-spill stack slot types in
    print_verifier_state

 include/linux/bpf_verifier.h |   8 +-
 kernel/bpf/verifier.c        | 216 ++++++++++++++-----------------------------
 2 files changed, 72 insertions(+), 152 deletions(-)

^ permalink raw reply

* [RFC PATCH v2 bpf-next 1/2] bpf/verifier: per-register parent pointers
From: Edward Cree @ 2018-08-22 19:02 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <d16ea072-61a0-8f8a-aca1-13cac09d3d14@solarflare.com>

By giving each register its own liveness chain, we elide the skip_callee()
 logic.  Instead, each register's parent is the state it inherits from;
 both check_func_call() and prepare_func_exit() automatically connect
 reg states to the correct chain since when they copy the reg state across
 (r1-r5 into the callee as args, and r0 out as the return value) they also
 copy the parent pointer.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |   8 +-
 kernel/bpf/verifier.c        | 184 +++++++++++--------------------------------
 2 files changed, 47 insertions(+), 145 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 38b04f559ad3..b42b60a83e19 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -41,6 +41,7 @@ enum bpf_reg_liveness {
 };
 
 struct bpf_reg_state {
+	/* Ordering of fields matters.  See states_equal() */
 	enum bpf_reg_type type;
 	union {
 		/* valid when type == PTR_TO_PACKET */
@@ -59,7 +60,6 @@ struct bpf_reg_state {
 	 * came from, when one is tested for != NULL.
 	 */
 	u32 id;
-	/* Ordering of fields matters.  See states_equal() */
 	/* For scalar types (SCALAR_VALUE), this represents our knowledge of
 	 * the actual value.
 	 * For pointer types, this represents the variable part of the offset
@@ -76,15 +76,15 @@ struct bpf_reg_state {
 	s64 smax_value; /* maximum possible (s64)value */
 	u64 umin_value; /* minimum possible (u64)value */
 	u64 umax_value; /* maximum possible (u64)value */
+	/* parentage chain for liveness checking */
+	struct bpf_reg_state *parent;
 	/* Inside the callee two registers can be both PTR_TO_STACK like
 	 * R1=fp-8 and R2=fp-8, but one of them points to this function stack
 	 * while another to the caller's stack. To differentiate them 'frameno'
 	 * is used which is an index in bpf_verifier_state->frame[] array
 	 * pointing to bpf_func_state.
-	 * This field must be second to last, for states_equal() reasons.
 	 */
 	u32 frameno;
-	/* This field must be last, for states_equal() reasons. */
 	enum bpf_reg_liveness live;
 };
 
@@ -107,7 +107,6 @@ struct bpf_stack_state {
  */
 struct bpf_func_state {
 	struct bpf_reg_state regs[MAX_BPF_REG];
-	struct bpf_verifier_state *parent;
 	/* index of call instruction that called into this func */
 	int callsite;
 	/* stack frame number of this function state from pov of
@@ -129,7 +128,6 @@ struct bpf_func_state {
 struct bpf_verifier_state {
 	/* call stack tracking */
 	struct bpf_func_state *frame[MAX_CALL_FRAMES];
-	struct bpf_verifier_state *parent;
 	u32 curframe;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679a7fe5..b11d45916fff 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -380,9 +380,9 @@ static int copy_stack_state(struct bpf_func_state *dst,
 /* do_check() starts with zero-sized stack in struct bpf_verifier_state to
  * make it consume minimal amount of memory. check_stack_write() access from
  * the program calls into realloc_func_state() to grow the stack size.
- * Note there is a non-zero 'parent' pointer inside bpf_verifier_state
- * which this function copies over. It points to previous bpf_verifier_state
- * which is never reallocated
+ * Note there is a non-zero parent pointer inside each reg of bpf_verifier_state
+ * which this function copies over. It points to corresponding reg in previous
+ * bpf_verifier_state which is never reallocated
  */
 static int realloc_func_state(struct bpf_func_state *state, int size,
 			      bool copy_old)
@@ -466,7 +466,6 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 		dst_state->frame[i] = NULL;
 	}
 	dst_state->curframe = src->curframe;
-	dst_state->parent = src->parent;
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
 		if (!dst) {
@@ -732,6 +731,7 @@ static void init_reg_state(struct bpf_verifier_env *env,
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		mark_reg_not_init(env, regs, i);
 		regs[i].live = REG_LIVE_NONE;
+		regs[i].parent = NULL;
 	}
 
 	/* frame pointer */
@@ -876,74 +876,21 @@ static int check_subprogs(struct bpf_verifier_env *env)
 	return 0;
 }
 
-static
-struct bpf_verifier_state *skip_callee(struct bpf_verifier_env *env,
-				       const struct bpf_verifier_state *state,
-				       struct bpf_verifier_state *parent,
-				       u32 regno)
-{
-	struct bpf_verifier_state *tmp = NULL;
-
-	/* 'parent' could be a state of caller and
-	 * 'state' could be a state of callee. In such case
-	 * parent->curframe < state->curframe
-	 * and it's ok for r1 - r5 registers
-	 *
-	 * 'parent' could be a callee's state after it bpf_exit-ed.
-	 * In such case parent->curframe > state->curframe
-	 * and it's ok for r0 only
-	 */
-	if (parent->curframe == state->curframe ||
-	    (parent->curframe < state->curframe &&
-	     regno >= BPF_REG_1 && regno <= BPF_REG_5) ||
-	    (parent->curframe > state->curframe &&
-	       regno == BPF_REG_0))
-		return parent;
-
-	if (parent->curframe > state->curframe &&
-	    regno >= BPF_REG_6) {
-		/* for callee saved regs we have to skip the whole chain
-		 * of states that belong to callee and mark as LIVE_READ
-		 * the registers before the call
-		 */
-		tmp = parent;
-		while (tmp && tmp->curframe != state->curframe) {
-			tmp = tmp->parent;
-		}
-		if (!tmp)
-			goto bug;
-		parent = tmp;
-	} else {
-		goto bug;
-	}
-	return parent;
-bug:
-	verbose(env, "verifier bug regno %d tmp %p\n", regno, tmp);
-	verbose(env, "regno %d parent frame %d current frame %d\n",
-		regno, parent->curframe, state->curframe);
-	return NULL;
-}
-
+/* Parentage chain of this register (or stack slot) should take care of all
+ * issues like callee-saved registers, stack slot allocation time, etc.
+ */
 static int mark_reg_read(struct bpf_verifier_env *env,
-			 const struct bpf_verifier_state *state,
-			 struct bpf_verifier_state *parent,
-			 u32 regno)
+			 const struct bpf_reg_state *state,
+			 struct bpf_reg_state *parent)
 {
 	bool writes = parent == state->parent; /* Observe write marks */
 
-	if (regno == BPF_REG_FP)
-		/* We don't need to worry about FP liveness because it's read-only */
-		return 0;
-
 	while (parent) {
 		/* if read wasn't screened by an earlier write ... */
-		if (writes && state->frame[state->curframe]->regs[regno].live & REG_LIVE_WRITTEN)
+		if (writes && state->live & REG_LIVE_WRITTEN)
 			break;
-		parent = skip_callee(env, state, parent, regno);
-		if (!parent)
-			return -EFAULT;
 		/* ... then we depend on parent's value */
-		parent->frame[parent->curframe]->regs[regno].live |= REG_LIVE_READ;
+		parent->live |= REG_LIVE_READ;
 		state = parent;
 		parent = state->parent;
 		writes = true;
@@ -969,7 +916,10 @@ static int check_reg_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose(env, "R%d !read_ok\n", regno);
 			return -EACCES;
 		}
-		return mark_reg_read(env, vstate, vstate->parent, regno);
+		/* We don't need to worry about FP liveness because it's read-only */
+		if (regno != BPF_REG_FP)
+			return mark_reg_read(env, &regs[regno],
+					     regs[regno].parent);
 	} else {
 		/* check whether register used as dest operand can be written to */
 		if (regno == BPF_REG_FP) {
@@ -1080,8 +1030,8 @@ static int check_stack_write(struct bpf_verifier_env *env,
 	} else {
 		u8 type = STACK_MISC;
 
-		/* regular write of data into stack */
-		state->stack[spi].spilled_ptr = (struct bpf_reg_state) {};
+		/* regular write of data into stack destroys any spilled ptr */
+		state->stack[spi].spilled_ptr.type = NOT_INIT;
 
 		/* only mark the slot as written if all 8 bytes were written
 		 * otherwise read propagation may incorrectly stop too soon
@@ -1106,61 +1056,6 @@ static int check_stack_write(struct bpf_verifier_env *env,
 	return 0;
 }
 
-/* registers of every function are unique and mark_reg_read() propagates
- * the liveness in the following cases:
- * - from callee into caller for R1 - R5 that were used as arguments
- * - from caller into callee for R0 that used as result of the call
- * - from caller to the same caller skipping states of the callee for R6 - R9,
- *   since R6 - R9 are callee saved by implicit function prologue and
- *   caller's R6 != callee's R6, so when we propagate liveness up to
- *   parent states we need to skip callee states for R6 - R9.
- *
- * stack slot marking is different, since stacks of caller and callee are
- * accessible in both (since caller can pass a pointer to caller's stack to
- * callee which can pass it to another function), hence mark_stack_slot_read()
- * has to propagate the stack liveness to all parent states at given frame number.
- * Consider code:
- * f1() {
- *   ptr = fp - 8;
- *   *ptr = ctx;
- *   call f2 {
- *      .. = *ptr;
- *   }
- *   .. = *ptr;
- * }
- * First *ptr is reading from f1's stack and mark_stack_slot_read() has
- * to mark liveness at the f1's frame and not f2's frame.
- * Second *ptr is also reading from f1's stack and mark_stack_slot_read() has
- * to propagate liveness to f2 states at f1's frame level and further into
- * f1 states at f1's frame level until write into that stack slot
- */
-static void mark_stack_slot_read(struct bpf_verifier_env *env,
-				 const struct bpf_verifier_state *state,
-				 struct bpf_verifier_state *parent,
-				 int slot, int frameno)
-{
-	bool writes = parent == state->parent; /* Observe write marks */
-
-	while (parent) {
-		if (parent->frame[frameno]->allocated_stack <= slot * BPF_REG_SIZE)
-			/* since LIVE_WRITTEN mark is only done for full 8-byte
-			 * write the read marks are conservative and parent
-			 * state may not even have the stack allocated. In such case
-			 * end the propagation, since the loop reached beginning
-			 * of the function
-			 */
-			break;
-		/* if read wasn't screened by an earlier write ... */
-		if (writes && state->frame[frameno]->stack[slot].spilled_ptr.live & REG_LIVE_WRITTEN)
-			break;
-		/* ... then we depend on parent's value */
-		parent->frame[frameno]->stack[slot].spilled_ptr.live |= REG_LIVE_READ;
-		state = parent;
-		parent = state->parent;
-		writes = true;
-	}
-}
-
 static int check_stack_read(struct bpf_verifier_env *env,
 			    struct bpf_func_state *reg_state /* func where register points to */,
 			    int off, int size, int value_regno)
@@ -1198,8 +1093,8 @@ static int check_stack_read(struct bpf_verifier_env *env,
 			 */
 			state->regs[value_regno].live |= REG_LIVE_WRITTEN;
 		}
-		mark_stack_slot_read(env, vstate, vstate->parent, spi,
-				     reg_state->frameno);
+		mark_reg_read(env, &reg_state->stack[spi].spilled_ptr,
+			      reg_state->stack[spi].spilled_ptr.parent);
 		return 0;
 	} else {
 		int zeros = 0;
@@ -1215,8 +1110,8 @@ static int check_stack_read(struct bpf_verifier_env *env,
 				off, i, size);
 			return -EACCES;
 		}
-		mark_stack_slot_read(env, vstate, vstate->parent, spi,
-				     reg_state->frameno);
+		mark_reg_read(env, &reg_state->stack[spi].spilled_ptr,
+			      reg_state->stack[spi].spilled_ptr.parent);
 		if (value_regno >= 0) {
 			if (zeros == size) {
 				/* any size read into register is zero extended,
@@ -1908,8 +1803,8 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
 		/* reading any byte out of 8-byte 'spill_slot' will cause
 		 * the whole slot to be marked as 'read'
 		 */
-		mark_stack_slot_read(env, env->cur_state, env->cur_state->parent,
-				     spi, state->frameno);
+		mark_reg_read(env, &state->stack[spi].spilled_ptr,
+			      state->stack[spi].spilled_ptr.parent);
 	}
 	return update_stack_depth(env, state, off);
 }
@@ -2366,11 +2261,13 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			state->curframe + 1 /* frameno within this callchain */,
 			subprog /* subprog number within this prog */);
 
-	/* copy r1 - r5 args that callee can access */
+	/* copy r1 - r5 args that callee can access.  The copy includes parent
+	 * pointers, which connects us up to the liveness chain
+	 */
 	for (i = BPF_REG_1; i <= BPF_REG_5; i++)
 		callee->regs[i] = caller->regs[i];
 
-	/* after the call regsiters r0 - r5 were scratched */
+	/* after the call registers r0 - r5 were scratched */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		mark_reg_not_init(env, caller->regs, caller_saved[i]);
 		check_reg_arg(env, caller_saved[i], DST_OP_NO_MARK);
@@ -4370,7 +4267,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 		/* explored state didn't use this */
 		return true;
 
-	equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, frameno)) == 0;
+	equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0;
 
 	if (rold->type == PTR_TO_STACK)
 		/* two stack pointers are equal only if they're pointing to
@@ -4603,7 +4500,7 @@ static bool states_equal(struct bpf_verifier_env *env,
  * equivalent state (jump target or such) we didn't arrive by the straight-line
  * code, so read marks in the state must propagate to the parent regardless
  * of the state's write marks. That's what 'parent == state->parent' comparison
- * in mark_reg_read() and mark_stack_slot_read() is for.
+ * in mark_reg_read() is for.
  */
 static int propagate_liveness(struct bpf_verifier_env *env,
 			      const struct bpf_verifier_state *vstate,
@@ -4624,7 +4521,8 @@ static int propagate_liveness(struct bpf_verifier_env *env,
 		if (vparent->frame[vparent->curframe]->regs[i].live & REG_LIVE_READ)
 			continue;
 		if (vstate->frame[vstate->curframe]->regs[i].live & REG_LIVE_READ) {
-			err = mark_reg_read(env, vstate, vparent, i);
+			err = mark_reg_read(env, &vstate->frame[vstate->curframe]->regs[i],
+					    &vparent->frame[vstate->curframe]->regs[i]);
 			if (err)
 				return err;
 		}
@@ -4639,7 +4537,8 @@ static int propagate_liveness(struct bpf_verifier_env *env,
 			if (parent->stack[i].spilled_ptr.live & REG_LIVE_READ)
 				continue;
 			if (state->stack[i].spilled_ptr.live & REG_LIVE_READ)
-				mark_stack_slot_read(env, vstate, vparent, i, frame);
+				mark_reg_read(env, &state->stack[i].spilled_ptr,
+					      &parent->stack[i].spilled_ptr);
 		}
 	}
 	return err;
@@ -4649,7 +4548,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 {
 	struct bpf_verifier_state_list *new_sl;
 	struct bpf_verifier_state_list *sl;
-	struct bpf_verifier_state *cur = env->cur_state;
+	struct bpf_verifier_state *cur = env->cur_state, *new;
 	int i, j, err;
 
 	sl = env->explored_states[insn_idx];
@@ -4691,16 +4590,18 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 		return -ENOMEM;
 
 	/* add new state to the head of linked list */
-	err = copy_verifier_state(&new_sl->state, cur);
+	new = &new_sl->state;
+	err = copy_verifier_state(new, cur);
 	if (err) {
-		free_verifier_state(&new_sl->state, false);
+		free_verifier_state(new, false);
 		kfree(new_sl);
 		return err;
 	}
 	new_sl->next = env->explored_states[insn_idx];
 	env->explored_states[insn_idx] = new_sl;
 	/* connect new state to parentage chain */
-	cur->parent = &new_sl->state;
+	for (i = 0; i < BPF_REG_FP; i++)
+		cur_regs(env)[i].parent = &new->frame[new->curframe]->regs[i];
 	/* clear write marks in current state: the writes we did are not writes
 	 * our child did, so they don't screen off its reads from us.
 	 * (There are no read marks in current state, because reads always mark
@@ -4713,9 +4614,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 	/* all stack frames are accessible from callee, clear them all */
 	for (j = 0; j <= cur->curframe; j++) {
 		struct bpf_func_state *frame = cur->frame[j];
+		struct bpf_func_state *newframe = new->frame[j];
 
-		for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++)
+		for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++) {
 			frame->stack[i].spilled_ptr.live = REG_LIVE_NONE;
+			frame->stack[i].spilled_ptr.parent =
+						&newframe->stack[i].spilled_ptr;
+		}
 	}
 	return 0;
 }
@@ -4734,7 +4639,6 @@ static int do_check(struct bpf_verifier_env *env)
 	if (!state)
 		return -ENOMEM;
 	state->curframe = 0;
-	state->parent = NULL;
 	state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL);
 	if (!state->frame[0]) {
 		kfree(state);

^ permalink raw reply related

* [RFC PATCH v2 bpf-next 2/2] bpf/verifier: display non-spill stack slot types in print_verifier_state
From: Edward Cree @ 2018-08-22 19:02 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <d16ea072-61a0-8f8a-aca1-13cac09d3d14@solarflare.com>

If a stack slot does not hold a spilled register (STACK_SPILL), then each
 of its eight bytes could potentially have a different slot_type.  This
 information can be important for debugging, and previously we either did
 not print anything for the stack slot, or just printed fp-X=0 in the case
 where its first byte was STACK_ZERO.
Instead, print eight characters with either 0 (STACK_ZERO), m (STACK_MISC)
 or ? (STACK_INVALID) for any stack slot which is neither STACK_SPILL nor
 entirely STACK_INVALID.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 kernel/bpf/verifier.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b11d45916fff..2f4b52cf864c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -263,6 +263,13 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_PACKET_END]	= "pkt_end",
 };
 
+static char slot_type_char[] = {
+	[STACK_INVALID]	= '?',
+	[STACK_SPILL]	= 'r',
+	[STACK_MISC]	= 'm',
+	[STACK_ZERO]	= '0',
+};
+
 static void print_liveness(struct bpf_verifier_env *env,
 			   enum bpf_reg_liveness live)
 {
@@ -349,15 +356,26 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 		}
 	}
 	for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-		if (state->stack[i].slot_type[0] == STACK_SPILL) {
-			verbose(env, " fp%d",
-				(-i - 1) * BPF_REG_SIZE);
-			print_liveness(env, state->stack[i].spilled_ptr.live);
+		char types_buf[BPF_REG_SIZE + 1];
+		bool valid = false;
+		int j;
+
+		for (j = 0; j < BPF_REG_SIZE; j++) {
+			if (state->stack[i].slot_type[j] != STACK_INVALID)
+				valid = true;
+			types_buf[j] = slot_type_char[
+					state->stack[i].slot_type[j]];
+		}
+		types_buf[BPF_REG_SIZE] = 0;
+		if (!valid)
+			continue;
+		verbose(env, " fp%d", (-i - 1) * BPF_REG_SIZE);
+		print_liveness(env, state->stack[i].spilled_ptr.live);
+		if (state->stack[i].slot_type[0] == STACK_SPILL)
 			verbose(env, "=%s",
 				reg_type_str[state->stack[i].spilled_ptr.type]);
-		}
-		if (state->stack[i].slot_type[0] == STACK_ZERO)
-			verbose(env, " fp%d=0", (-i - 1) * BPF_REG_SIZE);
+		else
+			verbose(env, "=%s", types_buf);
 	}
 	verbose(env, "\n");
 }

^ permalink raw reply related

* Re: linux-next: build warning after merge of the net tree
From: Cong Wang @ 2018-08-22 19:26 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Linux Kernel Network Developers,
	Linux-Next Mailing List, LKML
In-Reply-To: <20180822080449.5887e244@canb.auug.org.au>

On Tue, Aug 21, 2018 at 3:04 PM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi all,
>
> After merging the net tree, today's linux-next build (KCONFIG_NAME)
> produced this warning:
>
> drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c: In function 'tc_fill_actions':
> drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c:64:6: warning: unused variable 'i' [-Wunused-variable]
>   int i;
>       ^
>
> Introduced by commit
>
>   244cd96adb5f ("net_sched: remove list_head from tc_action")

I bet you have CONFIG_NET_CLS_ACT=n?

Here is a quick fix:

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index c17d51865469..9ec471ffaa5d 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -303,7 +303,7 @@ static inline void tcf_exts_put_net(struct tcf_exts *exts)
        for (i = 0; i < TCA_ACT_MAX_PRIO && ((a) = (exts)->actions[i]); i++)
 #else
 #define tcf_exts_for_each_action(i, a, exts) \
-       for (; 0; )
+       for ((void)i, (void)a; 0; )
 #endif

 static inline void

Interestingly, neither my compiler nor kbuild-bot's compiler doesn't
catch this.

Thanks.

^ permalink raw reply related

* [Patch net] net_sched: fix a compiler warning for tcf_exts_for_each_action()
From: Cong Wang @ 2018-08-22 19:40 UTC (permalink / raw)
  To: netdev; +Cc: sfr, Cong Wang

When CONFIG_NET_CLS_ACT=n, tcf_exts_for_each_action()
is a nop, which leaves its parameters unused. Shut up
the compiler warning by casting them to void.

Fixes: 244cd96adb5f ("net_sched: remove list_head from tc_action")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/pkt_cls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index c17d51865469..9ec471ffaa5d 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -303,7 +303,7 @@ static inline void tcf_exts_put_net(struct tcf_exts *exts)
 	for (i = 0; i < TCA_ACT_MAX_PRIO && ((a) = (exts)->actions[i]); i++)
 #else
 #define tcf_exts_for_each_action(i, a, exts) \
-	for (; 0; )
+	for ((void)i, (void)a; 0; )
 #endif
 
 static inline void
-- 
2.14.4

^ permalink raw reply related

* [Patch net] addrconf: reduce unnecessary atomic allocations
From: Cong Wang @ 2018-08-22 19:58 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, David Ahern

All the 3 callers of addrconf_add_mroute() assert RTNL
lock, they don't take any additional lock either, so
it is safe to convert it to GFP_KERNEL.

Same for sit_add_v4_addrs().

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv6/addrconf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2fac4ad74867..d51a8c0b3372 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2398,7 +2398,7 @@ static void addrconf_add_mroute(struct net_device *dev)
 
 	ipv6_addr_set(&cfg.fc_dst, htonl(0xFF000000), 0, 0, 0);
 
-	ip6_route_add(&cfg, GFP_ATOMIC, NULL);
+	ip6_route_add(&cfg, GFP_KERNEL, NULL);
 }
 
 static struct inet6_dev *addrconf_add_dev(struct net_device *dev)
@@ -3062,7 +3062,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 	if (addr.s6_addr32[3]) {
 		add_addr(idev, &addr, plen, scope);
 		addrconf_prefix_route(&addr, plen, 0, idev->dev, 0, pflags,
-				      GFP_ATOMIC);
+				      GFP_KERNEL);
 		return;
 	}
 
@@ -3087,7 +3087,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
 
 				add_addr(idev, &addr, plen, flag);
 				addrconf_prefix_route(&addr, plen, 0, idev->dev,
-						      0, pflags, GFP_ATOMIC);
+						      0, pflags, GFP_KERNEL);
 			}
 		}
 	}
-- 
2.14.4

^ permalink raw reply related

* Re: [Patch net] addrconf: reduce unnecessary atomic allocations
From: David Ahern @ 2018-08-22 20:09 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: David Ahern
In-Reply-To: <20180822195834.7217-1-xiyou.wangcong@gmail.com>

On 8/22/18 1:58 PM, Cong Wang wrote:
> All the 3 callers of addrconf_add_mroute() assert RTNL
> lock, they don't take any additional lock either, so
> it is safe to convert it to GFP_KERNEL.
> 
> Same for sit_add_v4_addrs().
> 
> Cc: David Ahern <dsahern@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv6/addrconf.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 

Not sure how I missed the double ASSERT_RTNL() check for
sit_add_v4_addrs. Thanks for following up.

Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH net] ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state
From: Eric Dumazet @ 2018-08-22 20:30 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Geoff Alexander, Eric Dumazet, Eric Dumazet

tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
to send some control packets.

1) RST packets, through tcp_v4_send_reset()
2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()

These packets assert IP_DF, and also use the hashed IP ident generator
to provide an IPv4 ID number.

Geoff Alexander reported this could be used to build off-path attacks.

These packets should not be fragmented, since their size is smaller than
IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
regardless of inner IPID.

We really can use zero IPID, to address the flaw, and as a bonus,
avoid a couple of atomic operations in ip_idents_reserve()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Geoff Alexander <alexandg@cs.unm.edu>
Tested-by: Geoff Alexander <alexandg@cs.unm.edu>
---
 net/ipv4/tcp_ipv4.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9e041fa5c545367961f03fa8a9124aebbc1b6c69..44c09eddbb781c03da2417aaa925e360de01a6e9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2517,6 +2517,12 @@ static int __net_init tcp_sk_init(struct net *net)
 		if (res)
 			goto fail;
 		sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
+
+		/* Please enforce IP_DF and IPID==0 for RST and
+		 * ACK sent in SYN-RECV and TIME-WAIT state.
+		 */
+		inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
+
 		*per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk;
 	}

-- 
2.18.0.1017.ga543ac7ca45-goog

^ permalink raw reply related

* [PATCH] sh_eth: Add R7S9210 support
From: Chris Brandt @ 2018-08-22 20:57 UTC (permalink / raw)
  To: Sergei Shtylyov, David S . Miller, Rob Herring, Mark Rutland
  Cc: netdev, devicetree, linux-renesas-soc, Simon Horman, Chris Brandt

Add support for the R7S9210 which is part of the RZ/A2 series.

Signed-off-by: Chris Brandt <chris.brandt@renesas.com>
---
 Documentation/devicetree/bindings/net/sh_eth.txt |   1 +
 drivers/net/ethernet/renesas/sh_eth.c            | 107 +++++++++++++++++++++++
 drivers/net/ethernet/renesas/sh_eth.h            |   1 +
 3 files changed, 109 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/sh_eth.txt b/Documentation/devicetree/bindings/net/sh_eth.txt
index 76db9f13ad96..abc36274227c 100644
--- a/Documentation/devicetree/bindings/net/sh_eth.txt
+++ b/Documentation/devicetree/bindings/net/sh_eth.txt
@@ -16,6 +16,7 @@ Required properties:
 	      "renesas,ether-r8a7794"  if the device is a part of R8A7794 SoC.
 	      "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC.
 	      "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC.
+	      "renesas,ether-r7s9210" if the device is a part of R7S9210 SoC.
 	      "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device.
 	      "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1
 	                                device.
diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 5573199c4536..f3a575e56ae4 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -216,6 +216,59 @@ static const u16 sh_eth_offset_fast_rz[SH_ETH_MAX_REGISTER_OFFSET] = {
 	[RXALCR0]	= 0x008C,
 };
 
+static const u16 sh_eth_offset_fast_rza2[SH_ETH_MAX_REGISTER_OFFSET] = {
+	SH_ETH_OFFSET_DEFAULTS,
+
+	[EDMR]		= 0x0000,
+	[EDTRR]		= 0x0008,
+	[EDRRR]		= 0x0010,
+	[EESR]		= 0x0028,
+	[EESIPR]	= 0x0030,
+	[TDLAR]		= 0x0018,
+	[TBRAR]		= 0x00d4,
+	[TDFAR]		= 0x00d8,
+	[RDLAR]		= 0x0020,
+	[RBWAR]		= 0x00c8,
+	[RDFAR]		= 0x00cc,
+	[TRSCER]	= 0x0038,
+	[RMFCR]		= 0x0040,
+	[TFTR]		= 0x0048,
+	[FDR]		= 0x0050,
+	[RMCR]		= 0x0058,
+	[TFUCR]		= 0x0064,
+	[RFOCR]		= 0x0068,
+	[RPADIR]	= 0x0078,
+	[TRIMD]		= 0x007c,
+	[FCFTR]		= 0x0070,
+
+	[ECMR]		= 0x0100,
+	[RFLR]		= 0x0108,
+	[ECSR]		= 0x0110,
+	[ECSIPR]	= 0x0118,
+	[PIR]		= 0x0120,
+	[PSR]		= 0x0128,
+	[RDMLR]		= 0x0140,
+	[IPGR]		= 0x0150,
+	[APR]		= 0x0154,
+	[MPR]		= 0x0158,
+	[RFCF]		= 0x0160,
+	[TPAUSER]	= 0x0164,
+	[TPAUSECR]	= 0x0168,
+	[BCFRR]		= 0x016c,
+	[MAHR]		= 0x01c0,
+	[MALR]		= 0x01c8,
+	[TROCR]		= 0x01d0,
+	[CDCR]		= 0x01d4,
+	[LCCR]		= 0x01d8,
+	[CNDCR]		= 0x01dc,
+	[CEFCR]		= 0x01e4,
+	[FRECR]		= 0x01e8,
+	[TSFRCR]	= 0x01ec,
+	[TLFRCR]	= 0x01f0,
+	[RFCR]		= 0x01f4,
+	[MAFCR]		= 0x01f8,
+};
+
 static const u16 sh_eth_offset_fast_rcar[SH_ETH_MAX_REGISTER_OFFSET] = {
 	SH_ETH_OFFSET_DEFAULTS,
 
@@ -635,6 +688,56 @@ static struct sh_eth_cpu_data r7s72100_data = {
 	.no_tx_cntrs	= 1,
 };
 
+/* R7S9210 */
+static void sh_eth_set_rate_r7s9210(struct net_device *ndev)
+{
+	struct sh_eth_private *mdp = netdev_priv(ndev);
+	unsigned long RTM = 0x00000004;
+
+	switch (mdp->speed) {
+	case 10: /* 10BASE */
+		sh_eth_modify(ndev, ECMR, RTM, 0);
+		break;
+	case 100:/* 100BASE */
+		sh_eth_modify(ndev, ECMR, RTM, RTM);
+		break;
+	}
+}
+
+static struct sh_eth_cpu_data r7s9210_data = {
+	.soft_reset	= sh_eth_soft_reset,
+
+	.set_duplex	= sh_eth_set_duplex,
+	.set_rate	= sh_eth_set_rate_r7s9210,
+
+	.register_type	= SH_ETH_REG_FAST_RZA2,
+
+	.edtrr_trns	= EDTRR_TRNS_ETHER,
+	.ecsr_value	= ECSR_ICD,
+	.ecsipr_value	= ECSIPR_ICDIP,
+	.eesipr_value	= EESIPR_TWBIP | EESIPR_TABTIP | EESIPR_RABTIP |
+			  EESIPR_RFCOFIP | EESIPR_ECIIP | EESIPR_FTCIP |
+			  EESIPR_TDEIP | EESIPR_TFUFIP | EESIPR_FRIP |
+			  EESIPR_RDEIP | EESIPR_RFOFIP | EESIPR_CNDIP |
+			  EESIPR_DLCIP | EESIPR_CDIP | EESIPR_TROIP |
+			  EESIPR_RMAFIP | EESIPR_RRFIP | EESIPR_RTLFIP |
+			  EESIPR_RTSFIP | EESIPR_PREIP | EESIPR_CERFIP,
+
+	.tx_check	= EESR_FTC | EESR_CND | EESR_DLC | EESR_CD | EESR_TRO,
+	.eesr_err_check	= EESR_TWB | EESR_TABT | EESR_RABT | EESR_RFE |
+			  EESR_RDE | EESR_RFRMER | EESR_TFE | EESR_TDE,
+
+	.fdr_value	= 0x0000070f,
+
+	.apr		= 1,
+	.mpr		= 1,
+	.tpauser	= 1,
+	.hw_swap	= 1,
+	.rpadir		= 1,
+	.no_ade		= 1,
+	.xdfar_rw	= 1,
+};
+
 static void sh_eth_chip_reset_r8a7740(struct net_device *ndev)
 {
 	sh_eth_chip_reset(ndev);
@@ -3053,6 +3156,9 @@ static const u16 *sh_eth_get_register_offset(int register_type)
 	case SH_ETH_REG_FAST_RZ:
 		reg_offset = sh_eth_offset_fast_rz;
 		break;
+	case SH_ETH_REG_FAST_RZA2:
+		reg_offset = sh_eth_offset_fast_rza2;
+		break;
 	case SH_ETH_REG_FAST_RCAR:
 		reg_offset = sh_eth_offset_fast_rcar;
 		break;
@@ -3132,6 +3238,7 @@ static const struct of_device_id sh_eth_match_table[] = {
 	{ .compatible = "renesas,ether-r8a7794", .data = &rcar_gen2_data },
 	{ .compatible = "renesas,gether-r8a77980", .data = &r8a77980_data },
 	{ .compatible = "renesas,ether-r7s72100", .data = &r7s72100_data },
+	{ .compatible = "renesas,ether-r7s9210", .data = &r7s9210_data },
 	{ .compatible = "renesas,rcar-gen1-ether", .data = &rcar_gen1_data },
 	{ .compatible = "renesas,rcar-gen2-ether", .data = &rcar_gen2_data },
 	{ }
diff --git a/drivers/net/ethernet/renesas/sh_eth.h b/drivers/net/ethernet/renesas/sh_eth.h
index f94be99cf400..e501f1958cec 100644
--- a/drivers/net/ethernet/renesas/sh_eth.h
+++ b/drivers/net/ethernet/renesas/sh_eth.h
@@ -157,6 +157,7 @@ enum {
 enum {
 	SH_ETH_REG_GIGABIT,
 	SH_ETH_REG_FAST_RZ,
+	SH_ETH_REG_FAST_RZA2,
 	SH_ETH_REG_FAST_RCAR,
 	SH_ETH_REG_FAST_SH4,
 	SH_ETH_REG_FAST_SH3_SH2
-- 
2.16.1

^ permalink raw reply related

* Re: [PATCH] strparser: remove any offset before parsing messages
From: Dominique Martinet @ 2018-08-23  1:04 UTC (permalink / raw)
  To: Dave Watson
  Cc: Doron Roberts-Kedes, Tom Herbert, David S. Miller, netdev,
	linux-kernel
In-Reply-To: <20180822183852.jnwlxnz54gbbf6po@davejwatson-mba.dhcp.thefacebook.com>

Dave Watson wrote on Wed, Aug 22, 2018:
> > I've tried measuring that overhead as well by writing a more complex bpf
> > program that would fetch the offset in the skb but for some reason I'm
> > reading a 0 offset when it's not zero... well, not like there's much
> > choice for this at this point anyway; I don't think we'll do this
> > without pull, I'll put that on background.
> 
> For what it is worth we checked the offset in bpf, something
> along the lines of

Oh, thanks!

> 	  > struct kcm_rx_msg {   int full_len;  int offset;};
> static inline struct kcm_rx_msg *kcm_rx_msg(struct __sk_buff *skb)
>       { return (struct kcm_rx_msg *)skb->cb;}
> 
> int decode_framing(struct __sk_buff *skb)
> { return load_word(skb, kcm_rx_msg(skb)->offset);}

So you're taking directly the address at skb->cb but the linux code has
this function:
static inline struct strp_msg *strp_msg(struct sk_buff *skb)
{
        return (struct strp_msg *)((void *)skb->cb +
                offsetof(struct qdisc_skb_cb, data));
}
and qdisc_skb_cb.data is another 8 bytes in, that would explain I had
different results (and now I'm trying your snippet it does work), but
I'll have to admit I fail to understand this....

Ok, so 'cb' in __sk_buff is 48 bytes in but 'cb' in sk_buff is 40 bytes
in -- I might just start getting annoyed over this, is there a reason
for the different offset?!

> Although it did puzzle me for a while figuring that out when I ran in
> to it.

Well, at least it means some people were aware of the problem and worked
around it in their own way -- what do you think of pulling instead?
I mean, we could just document that "really well" and provide the
get-offset function in some header that would be made include-able from
bpf.. But right now this isn't really the case.

FWIW now I have this version I also don't notice any performance change
with the pull on my example, it actually looks like the bpf load_word is
slightly slower than pull to access data that is not in the head, but
the noise level is pretty bad.

Thanks,
-- 
Dominique

^ permalink raw reply

* [PATCH ipsec-next] xfrm: allow driver to quietly refuse offload
From: Shannon Nelson @ 2018-08-22 21:38 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev

If the "offload" attribute is used to create an IPsec SA
and the .xdo_dev_state_add() fails, the SA creation fails.
However, if the "offload" attribute is used on a device that
doesn't offer it, the attribute is quietly ignored and the SA
is created without an offload.

Along the same line of that second case, it would be good to
have a way for the device to refuse to offload an SA without
failing the whole SA creation.  This patch adds that feature
by allowing the driver to return -EOPNOTSUPP as a signal that
the SA may be fine, it just can't be offloaded.

This allows the user a little more flexibility in requesting
offloads and not needing to know every detail at all times about
each specific NIC when trying to create SAs.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---

More specifically, this will help one user experience issue
with the coming ixgbevf IPsec offload.

 Documentation/networking/xfrm_device.txt | 4 ++++
 net/xfrm/xfrm_device.c                   | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt
index 50c34ca..267f55b 100644
--- a/Documentation/networking/xfrm_device.txt
+++ b/Documentation/networking/xfrm_device.txt
@@ -68,6 +68,10 @@ and an indication of whether it is for Rx or Tx.  The driver should
 	- verify the algorithm is supported for offloads
 	- store the SA information (key, salt, target-ip, protocol, etc)
 	- enable the HW offload of the SA
+	- return status value:
+		0             success
+		-EOPNETSUPP   offload not supported, try SW IPsec
+		other         fail the request

 The driver can also set an offload_handle in the SA, an opaque void pointer
 that can be used to convey context into the fast-path offload requests.
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 5611b75..3a1d9d6 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -192,9 +192,13 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,

 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
+		xso->num_exthdrs = 0;
+		xso->flags = 0;
 		xso->dev = NULL;
 		dev_put(dev);
-		return err;
+
+		if (err != -EOPNOTSUPP)
+			return err;
 	}

 	return 0;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 0/3] tcp_bbr: PROBE_RTT minor bug fixes
From: Kevin Yang @ 2018-08-22 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kevin(Yudong) Yang

From: "Kevin(Yudong) Yang" <yyd@google.com>

This series includes two minor bug fixes for the TCP BBR PROBE_RTT
mechanism, and one preparatory patch:

(1) A preparatory patch to reorganize the PROBE_RTT logic by refactoring
    (into its own function) the code to exit PROBE_RTT, since the next
    patch will be using that code in a new context.

(2) Fix: When BBR restarts from idle and if BBR is in PROBE_RTT mode,
    BBR should check if it's time to exit PROBE_RTT. If yes, then BBR
    should exit PROBE_RTT mode and restore the cwnd to its full value.

(3) Fix: Apply the PROBE_RTT cwnd cap even if the count of fully-ACKed
    packets is 0.

Kevin Yang (3):
  tcp_bbr: add bbr_check_probe_rtt_done() helper
  tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
  tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0

 net/ipv4/tcp_bbr.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

-- 
2.18.0.1017.ga543ac7ca45-goog

^ permalink raw reply

* [PATCH net 1/3] tcp_bbr: add bbr_check_probe_rtt_done() helper
From: Kevin Yang @ 2018-08-22 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kevin Yang, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20180822214316.174161-1-yyd@google.com>

This patch add a helper function bbr_check_probe_rtt_done() to
  1. check the condition to see if bbr should exit probe_rtt mode;
  2. process the logic of exiting probe_rtt mode.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_bbr.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 13d34427ca3dd..fd7bccf36a263 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -95,11 +95,10 @@ struct bbr {
 	u32     mode:3,		     /* current bbr_mode in state machine */
 		prev_ca_state:3,     /* CA state on previous ACK */
 		packet_conservation:1,  /* use packet conservation? */
-		restore_cwnd:1,	     /* decided to revert cwnd to old value */
 		round_start:1,	     /* start of packet-timed tx->ack round? */
 		idle_restart:1,	     /* restarting after idle? */
 		probe_rtt_round_done:1,  /* a BBR_PROBE_RTT round at 4 pkts? */
-		unused:12,
+		unused:13,
 		lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
 		lt_rtt_cnt:7,	     /* round trips in long-term interval */
 		lt_use_bw:1;	     /* use lt_bw as our bw estimate? */
@@ -396,17 +395,11 @@ static bool bbr_set_cwnd_to_recover_or_restore(
 		cwnd = tcp_packets_in_flight(tp) + acked;
 	} else if (prev_state >= TCP_CA_Recovery && state < TCP_CA_Recovery) {
 		/* Exiting loss recovery; restore cwnd saved before recovery. */
-		bbr->restore_cwnd = 1;
+		cwnd = max(cwnd, bbr->prior_cwnd);
 		bbr->packet_conservation = 0;
 	}
 	bbr->prev_ca_state = state;
 
-	if (bbr->restore_cwnd) {
-		/* Restore cwnd after exiting loss recovery or PROBE_RTT. */
-		cwnd = max(cwnd, bbr->prior_cwnd);
-		bbr->restore_cwnd = 0;
-	}
-
 	if (bbr->packet_conservation) {
 		*new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked);
 		return true;	/* yes, using packet conservation */
@@ -748,6 +741,20 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
 		bbr_reset_probe_bw_mode(sk);  /* we estimate queue is drained */
 }
 
+static void bbr_check_probe_rtt_done(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (!(bbr->probe_rtt_done_stamp &&
+	      after(tcp_jiffies32, bbr->probe_rtt_done_stamp)))
+		return;
+
+	bbr->min_rtt_stamp = tcp_jiffies32;  /* wait a while until PROBE_RTT */
+	tp->snd_cwnd = max(tp->snd_cwnd, bbr->prior_cwnd);
+	bbr_reset_mode(sk);
+}
+
 /* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
  * periodically drain the bottleneck queue, to converge to measure the true
  * min_rtt (unloaded propagation delay). This allows the flows to keep queues
@@ -806,12 +813,8 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
 		} else if (bbr->probe_rtt_done_stamp) {
 			if (bbr->round_start)
 				bbr->probe_rtt_round_done = 1;
-			if (bbr->probe_rtt_round_done &&
-			    after(tcp_jiffies32, bbr->probe_rtt_done_stamp)) {
-				bbr->min_rtt_stamp = tcp_jiffies32;
-				bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
-				bbr_reset_mode(sk);
-			}
+			if (bbr->probe_rtt_round_done)
+				bbr_check_probe_rtt_done(sk);
 		}
 	}
 	/* Restart after idle ends only once we process a new S/ACK for data */
@@ -862,7 +865,6 @@ static void bbr_init(struct sock *sk)
 	bbr->has_seen_rtt = 0;
 	bbr_init_pacing_rate_from_rtt(sk);
 
-	bbr->restore_cwnd = 0;
 	bbr->round_start = 0;
 	bbr->idle_restart = 0;
 	bbr->full_bw_reached = 0;
-- 
2.18.0.1017.ga543ac7ca45-goog

^ permalink raw reply related

* [PATCH net 3/3] tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0
From: Kevin Yang @ 2018-08-22 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kevin Yang, Neal Cardwell
In-Reply-To: <20180822214316.174161-1-yyd@google.com>

This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.

The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_bbr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 1d4bdd3b5e4d0..02ff2dde96094 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -420,10 +420,10 @@ static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bbr *bbr = inet_csk_ca(sk);
-	u32 cwnd = 0, target_cwnd = 0;
+	u32 cwnd = tp->snd_cwnd, target_cwnd = 0;
 
 	if (!acked)
-		return;
+		goto done;  /* no packet fully ACKed; just apply caps */
 
 	if (bbr_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd))
 		goto done;
-- 
2.18.0.1017.ga543ac7ca45-goog

^ permalink raw reply related

* [PATCH net 2/3] tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
From: Kevin Yang @ 2018-08-22 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kevin Yang, Neal Cardwell
In-Reply-To: <20180822214316.174161-1-yyd@google.com>

This patch fix the case where BBR does not exit PROBE_RTT mode when
it restarts from idle. When BBR restarts from idle and if BBR is in
PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
full value.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_bbr.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index fd7bccf36a263..1d4bdd3b5e4d0 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -174,6 +174,8 @@ static const u32 bbr_lt_bw_diff = 4000 / 8;
 /* If we estimate we're policed, use lt_bw for this many round trips: */
 static const u32 bbr_lt_bw_max_rtts = 48;
 
+static void bbr_check_probe_rtt_done(struct sock *sk);
+
 /* Do we estimate that STARTUP filled the pipe? */
 static bool bbr_full_bw_reached(const struct sock *sk)
 {
@@ -308,6 +310,8 @@ static void bbr_cwnd_event(struct sock *sk, enum tcp_ca_event event)
 		 */
 		if (bbr->mode == BBR_PROBE_BW)
 			bbr_set_pacing_rate(sk, bbr_bw(sk), BBR_UNIT);
+		else if (bbr->mode == BBR_PROBE_RTT)
+			bbr_check_probe_rtt_done(sk);
 	}
 }
 
-- 
2.18.0.1017.ga543ac7ca45-goog

^ permalink raw reply related

* [PATCH bpf] bpf: use per htab salt for bucket hash
From: Daniel Borkmann @ 2018-08-22 21:49 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann

All BPF hash and LRU maps currently have a known and global seed
we feed into jhash() which is 0. This is suboptimal, thus fix it
by generating a random seed upon hashtab setup time which we can
later on feed into jhash() on lookup, update and deletions.

Fixes: 0f8e4bd8a1fc8 ("bpf: add hashtable type of eBPF maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 04b8eda..03cc59e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -15,6 +15,7 @@
 #include <linux/jhash.h>
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
+#include <linux/random.h>
 #include <uapi/linux/btf.h>
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
@@ -41,6 +42,7 @@ struct bpf_htab {
 	atomic_t count;	/* number of elements in this hashtable */
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
+	u32 hashrnd;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -371,6 +373,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (!htab->buckets)
 		goto free_htab;
 
+	htab->hashrnd = get_random_int();
 	for (i = 0; i < htab->n_buckets; i++) {
 		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
 		raw_spin_lock_init(&htab->buckets[i].lock);
@@ -402,9 +405,9 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static inline u32 htab_map_hash(const void *key, u32 key_len)
+static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
 {
-	return jhash(key, key_len, 0);
+	return jhash(key, key_len, hashrnd);
 }
 
 static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
@@ -470,7 +473,7 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	head = select_bucket(htab, hash);
 
@@ -597,7 +600,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	if (!key)
 		goto find_first_elem;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	head = select_bucket(htab, hash);
 
@@ -824,7 +827,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -880,7 +883,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -945,7 +948,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -998,7 +1001,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -1071,7 +1074,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
@@ -1103,7 +1106,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size);
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox