Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] 8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
From: David Miller @ 2018-05-02 17:22 UTC (permalink / raw)
  To: bigeasy; +Cc: netdev, mingo, tglx
In-Reply-To: <20180502113057.18209-1-bigeasy@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed,  2 May 2018 13:30:57 +0200

> From: Ingo Molnar <mingo@elte.hu>
> 
> Use disable_irq_nosync() instead of disable_irq() as this might be
> called in atomic context with netpoll.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 0/3] bpf: cleanups on managing subprog information
From: John Fastabend @ 2018-05-02 17:24 UTC (permalink / raw)
  To: Jiong Wang, Alexei Starovoitov; +Cc: borkmann, ecree, netdev, oss-drivers
In-Reply-To: <fa61c0ca-db3d-bc7b-f6ac-521335f539ed@netronome.com>

On 05/02/2018 09:59 AM, Jiong Wang wrote:
> On 01/05/2018 23:22, Alexei Starovoitov wrote:
> ...
>> [   27.784931]  ? bpf_int_jit_compile+0x7ac/0xab0
>> [   27.785475]  bpf_int_jit_compile+0x2b6/0xab0
>> [   27.786001]  ? do_jit+0x6020/0x6020
>> [   27.786428]  ? kasan_kmalloc+0xa0/0xd0
>> [   27.786885]  bpf_check+0x2c05/0x4c40
>> [   27.787346]  ? fixup_bpf_calls+0x1140/0x1140
>> [   27.787865]  ? kasan_unpoison_shadow+0x30/0x40
>> [   27.788406]  ? kasan_kmalloc+0xa0/0xd0
>> [   27.788865]  ? memset+0x1f/0x40
>> [   27.789255]  ? bpf_obj_name_cpy+0x2d/0x200
>> [   27.789750]  bpf_prog_load+0xb07/0xeb0
>>
>> simply running test_verifier with JIT and kasan on.
> 
> Ah, sorry, I should add "sysctl net/core/bpf_jit_enable=1" to my test
> script, error reproduced.
> 
> convert_ctx_accesses and fixup_bpf_calls might insert ebpf insns that
> prog->len would change.
> 
> The new fake "exit" subprog whose .start offset is prog->len should be
> updated as well.
> 
> The "for" condition in adjust_subprog_starts:
> 
>   for (i = 0; i < env->subprog_cnt; i++) {
> 
> need to be changed into:
> 
>   for (i = 0; i <= env->subprog_cnt; i++) {
> 
> Will respin the patch set.
> 
> Thanks.
> 
> Regards,
> Jiong
> 

Also a bit of a nit, but if you are doing a respin. How about
consider renaming BPF_MAX_SUBPROGS -> BPF_MAX_PROGS. It will
make the naming more accurate and also avoid some diffs below
where changing '>=' to '>' is required.

@@ -191,7 +191,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	u32 subprog_starts[BPF_MAX_SUBPROGS];
+	u32 subprog_starts[BPF_MAX_SUBPROGS + 1];
 	/* computes the stack depth of each bpf function */
 	u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
 	u32 subprog_cnt;

^ permalink raw reply

* [RFC] inet: add bound ports statistic
From: Stephen Hemminger @ 2018-05-02 17:25 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Stephen Hemminger

This adds a number of bound ports which fixes socket summary
command.  The ss -s has been broken since changes to slab info
and this is one way to recover the missing value by adding a
field onto /proc/net/sockstat.

Since this is an informational value only, there is no need
for locking.


Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 include/net/inet_hashtables.h    |  3 +++
 include/net/inet_timewait_sock.h |  2 ++
 net/dccp/proto.c                 |  1 +
 net/ipv4/inet_connection_sock.c  |  1 +
 net/ipv4/inet_hashtables.c       | 22 +++++++++++++++++++---
 net/ipv4/inet_timewait_sock.c    |  8 +++++---
 net/ipv4/proc.c                  |  5 +++--
 net/ipv4/tcp.c                   |  1 +
 8 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..dc74f7af4446 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -103,6 +103,7 @@ static inline struct net *ib_net(struct inet_bind_bucket *ib)
 
 struct inet_bind_hashbucket {
 	spinlock_t		lock;
+	int			count;
 	struct hlist_head	chain;
 };
 
@@ -193,7 +194,9 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
 			const unsigned short snum);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+			      struct inet_bind_hashbucket *head,
 			      struct inet_bind_bucket *tb);
+int inet_bind_bucket_count(struct proto *prot);
 
 static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
 			       const u32 bhash_size)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..4cdb8034ad80 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -87,7 +87,9 @@ static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
 void inet_twsk_free(struct inet_timewait_sock *tw);
 void inet_twsk_put(struct inet_timewait_sock *tw);
 
+struct inet_bind_hashbucket;
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
+			   struct inet_bind_hashbucket *head,
 			   struct inet_hashinfo *hashinfo);
 
 struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 84cd4e3fd01b..25f03e62cfea 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1208,6 +1208,7 @@ static int __init dccp_init(void)
 	for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&dccp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
+		dccp_hashinfo.bhash[i].count = 0;
 	}
 
 	rc = dccp_mib_init();
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d046f2..8a70465c240c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -309,6 +309,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 				     net, head, port);
 	if (!tb)
 		goto fail_unlock;
+	++head->count;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
 		if (sk->sk_reuse == SK_FORCE_REUSE)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 31ff46daae97..f7c7a589bfb3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -58,6 +58,18 @@ static u32 sk_ehashfn(const struct sock *sk)
 			    sk->sk_daddr, sk->sk_dport);
 }
 
+/* Count how many any entries are in the bind hash table */
+int inet_bind_bucket_count(struct proto *prot)
+{
+	struct inet_hashinfo *hinfo = prot->h.hashinfo;
+	int i, ports = 0;
+
+	for (i = 0; i < hinfo->bhash_size; i++)
+		ports += hinfo->bhash[i].count;
+
+	return ports;
+}
+
 /*
  * Allocate and initialize a new local port bind bucket.
  * The bindhash mutex for snum's hash chain must be held here.
@@ -76,6 +88,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 		tb->fastreuseport = 0;
 		INIT_HLIST_HEAD(&tb->owners);
 		hlist_add_head(&tb->node, &head->chain);
+		++head->count;
 	}
 	return tb;
 }
@@ -83,10 +96,13 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 /*
  * Caller must hold hashbucket lock for this tb with local BH disabled
  */
-void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb)
+void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+			      struct inet_bind_hashbucket *head,
+			      struct inet_bind_bucket *tb)
 {
 	if (hlist_empty(&tb->owners)) {
 		__hlist_del(&tb->node);
+		--head->count;
 		kmem_cache_free(cachep, tb);
 	}
 }
@@ -115,7 +131,7 @@ static void __inet_put_port(struct sock *sk)
 	__sk_del_bind_node(sk);
 	inet_csk(sk)->icsk_bind_hash = NULL;
 	inet_sk(sk)->inet_num = 0;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, head, tb);
 	spin_unlock(&head->lock);
 }
 
@@ -756,7 +772,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		inet_ehash_nolisten(sk, (struct sock *)tw);
 	}
 	if (tw)
-		inet_twsk_bind_unhash(tw, hinfo);
+		inet_twsk_bind_unhash(tw, head, hinfo);
 	spin_unlock(&head->lock);
 	if (tw)
 		inet_twsk_deschedule_put(tw);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 88c5069b5d20..dd888c52f958 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -26,7 +26,8 @@
  *	Returns 1 if caller should call inet_twsk_put() after lock release.
  */
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
-			  struct inet_hashinfo *hashinfo)
+			   struct inet_bind_hashbucket *head,
+			   struct inet_hashinfo *hashinfo)
 {
 	struct inet_bind_bucket *tb = tw->tw_tb;
 
@@ -35,7 +36,8 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 
 	__hlist_del(&tw->tw_bind_node);
 	tw->tw_tb = NULL;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep,
+				 head, tb);
 	__sock_put((struct sock *)tw);
 }
 
@@ -55,7 +57,7 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 			hashinfo->bhash_size)];
 
 	spin_lock(&bhead->lock);
-	inet_twsk_bind_unhash(tw, hashinfo);
+	inet_twsk_bind_unhash(tw, bhead, hashinfo);
 	spin_unlock(&bhead->lock);
 
 	atomic_dec(&tw->tw_dr->tw_count);
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 261b71d0ccc5..12621f8fb4d5 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -60,10 +60,11 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	sockets = proto_sockets_allocated_sum_positive(&tcp_prot);
 
 	socket_seq_show(seq);
-	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
+	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld ports %d\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   atomic_read(&net->ipv4.tcp_death_row.tw_count), sockets,
-		   proto_memory_allocated(&tcp_prot));
+		   proto_memory_allocated(&tcp_prot),
+		   inet_bind_bucket_count(&tcp_prot));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
 		   proto_memory_allocated(&udp_prot));
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 868ed74a76a8..f62e2fb02fdf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3836,6 +3836,7 @@ void __init tcp_init(void)
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&tcp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
+		tcp_hashinfo.bhash[i].count = 0;
 	}
 
 
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net 1/1] net/smc: restrict non-blocking connect finish
From: David Miller @ 2018-05-02 17:28 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <20180502145356.32913-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Wed,  2 May 2018 16:53:56 +0200

> The smc_poll code tries to finish connect() if the socket is in
> state SMC_INIT and polling of the internal CLC-socket returns with
> EPOLLOUT. This makes sense for a select/poll call following a connect
> call, but not without preceding connect().
> With this patch smc_poll starts connect logic only, if the CLC-socket
> is no longer in its initial state TCP_CLOSE.
> 
> In addition, a poll error on the internal CLC-socket is always
> propagated to the SMC socket.
> 
> With this patch the code path mentioned by syzbot
> https://syzkaller.appspot.com/bug?extid=03faa2dc16b8b64be396
> is no longer possible.
> 
> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
> Reported-by: syzbot+03faa2dc16b8b64be396@syzkaller.appspotmail.com

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Peter Zijlstra @ 2018-05-02 17:30 UTC (permalink / raw)
  To: Song Liu; +Cc: Networking, Kernel Team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <30F32ACF-5155-459B-BD47-5060CCA52788@fb.com>

On Wed, May 02, 2018 at 04:48:32PM +0000, Song Liu wrote:
> > It's broken though, I've bet you've never actually ran this with lockdep
> > enabled for example.
> 
> I am not following here. I just run the new selftest with CONFIG_LOCKDEP on, 
> and got no warning for this. 

Weird, I would be expecting complaints about releasing an unheld lock.

nmi_enter(),nmi_exit() have lockdep_off(),lockdep_on() resp. Which means
that the down_trylock() will not be recorded. The up, which is done from
IRQ context, will not be so supressed and should hit
print_unlock_imbalance_bug().

> > Also, you set work->sem before you do trylock, if the trylock fails you
> > return early and keep work->sem set, which will thereafter always result
> > in irq_work_busy.
> 
> work->sem was set after down_read_trylock(). I guess you misread the patch?

Argh, yes indeed. Sorry.

^ permalink raw reply

* Re: [PATCH net-next v2 0/4] net/smc: small features 2018/04/30
From: David Miller @ 2018-05-02 17:30 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <20180502145647.36147-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Wed,  2 May 2018 16:56:43 +0200

> here are 4 smc patches for net-next covering small new features
> in different areas:
>    * link health check
>    * diagnostics for IPv6 smc sockets
>    * ioctl
>    * improvement for vlan determination
> 
> v2 changes:
>    * better title
>    * patch 2 - remove compile problem for disabled CONFIG_IPV6

Series applied, thank you.

^ permalink raw reply

* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Michael Chan @ 2018-05-02 17:32 UTC (permalink / raw)
  To: Zumeng Chen
  Cc: Netdev, open list, Siva Reddy Kallam,
	prashant.sreedharan@broadcom.com, David Miller, Zumeng Chen
In-Reply-To: <5af186f8-9718-c295-4e34-e84dd78ea157@gmail.com>

On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
> On 2018年05月02日 13:12, Michael Chan wrote:
>>
>> On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>>
>>> diff --git a/drivers/net/ethernet/broadcom/tg3.h
>>> b/drivers/net/ethernet/broadcom/tg3.h
>>> index 3b5e98e..c61d83c 100644
>>> --- a/drivers/net/ethernet/broadcom/tg3.h
>>> +++ b/drivers/net/ethernet/broadcom/tg3.h
>>> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>>>          TG3_FLAG_ROBOSWITCH,
>>>          TG3_FLAG_ONE_DMA_AT_ONCE,
>>>          TG3_FLAG_RGMII_MODE,
>>> +       TG3_FLAG_HALT,
>>
>> I think you should be able to use the existing INIT_COMPLETE flag
>
>
> No,  it will bring the uncertain factors into the existed complicate logic
> of INIT_COMPLETE.
> And I think it's very simple logic here to fix the meaningless hw_stats
> reading and the problem
> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
> codes carefully.
>

We should use an existing flag whenever appropriate, instead of adding
yet another flag to do similar things. I've looked at the code briefly
and believe that INIT_COMPLETE will work.  If you think it won't work,
please be specific and point out why it won't work.  Thanks.

^ permalink raw reply

* Re: [RFC] inet: add bound ports statistic
From: David Miller @ 2018-05-02 17:35 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sthemmin
In-Reply-To: <20180502172531.3458-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed,  2 May 2018 10:25:31 -0700

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 881ac6d046f2..8a70465c240c 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -309,6 +309,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
>  				     net, head, port);
>  	if (!tb)
>  		goto fail_unlock;
> +	++head->count;
>  tb_found:
>  	if (!hlist_empty(&tb->owners)) {
>  		if (sk->sk_reuse == SK_FORCE_REUSE)

Are you really able to commit to the counter increment here?  We can still
fail after this point.

^ permalink raw reply

* [bpf PATCH 0/3] sockmap error path fixes
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev

When I added the test_sockmap to selftests I mistakenly changed the
test logic a bit. The result of this was on redirect cases we ended up
choosing the wrong sock from the BPF program and ended up sending to a
socket that had no receive handler. The result was the actual receive
handler, running on a different socket, is timing out and closing the
socket. This results in errors (-EPIPE to be specific) on the sending
side. Typically happening if the sender does not complete the send
before the receive side times out. So depending on timing and the size
of the send we may get errors. This exposed some bugs in the sockmap
error path handling.

This series fixes the errors. The primary issue is we did not do proper
memory accounting in these cases which resulted in missing a
sk_mem_uncharge(). This happened in the redirect path and in one case
on the normal send path. See the three patches for the details.

The other take-away from this is we need to fix the test_sockmap and
also add more negative test cases. That will happen in bpf-next.

Finally, I tested this using the existing test_sockmap program, the
older sockmap sample test script, and a few real use cases with
Cilium. All of these seem to be in working correctly.

---

John Fastabend (3):
      bpf: sockmap, fix scatterlist update on error path in send with apply
      bpf: sockmap, zero sg_size on error when buffer is released
      bpf: sockmap, fix error handling in redirect failures


 kernel/bpf/sockmap.c |   46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

^ permalink raw reply

* [bpf PATCH 1/3] bpf: sockmap, fix scatterlist update on error path in send with apply
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.

Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.

The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..943929a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -326,6 +326,9 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
 			if (ret > 0) {
 				if (apply)
 					apply_bytes -= ret;
+
+				sg->offset += ret;
+				sg->length -= ret;
 				size -= ret;
 				offset += ret;
 				if (uncharge)
@@ -333,8 +336,6 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
 				goto retry;
 			}
 
-			sg->length = size;
-			sg->offset = offset;
 			return ret;
 		}
 

^ permalink raw reply related

* [bpf PATCH 2/3] bpf: sockmap, zero sg_size on error when buffer is released
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.

In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.

In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 943929a..052c313 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -701,15 +701,22 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 		err = bpf_tcp_sendmsg_do_redirect(redir, send, m, flags);
 		lock_sock(sk);
 
+		if (unlikely(err < 0)) {
+			free_start_sg(sk, m);
+			psock->sg_size = 0;
+			if (!cork)
+				*copied -= send;
+		} else {
+			psock->sg_size -= send;
+		}
+
 		if (cork) {
 			free_start_sg(sk, m);
+			psock->sg_size = 0;
 			kfree(m);
 			m = NULL;
+			err = 0;
 		}
-		if (unlikely(err))
-			*copied -= err;
-		else
-			psock->sg_size -= send;
 		break;
 	case __SK_DROP:
 	default:

^ permalink raw reply related

* [bpf PATCH 3/3] bpf: sockmap, fix error handling in redirect failures
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.

With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.

The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 052c313..7e3c4cd 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -393,7 +393,8 @@ static void return_mem_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 	} while (i != md->sg_end);
 }
 
-static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
+static void free_bytes_sg(struct sock *sk, int bytes,
+			  struct sk_msg_buff *md, bool charge)
 {
 	struct scatterlist *sg = md->sg_data;
 	int i = md->sg_start, free;
@@ -403,11 +404,13 @@ static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 		if (bytes < free) {
 			sg[i].length -= bytes;
 			sg[i].offset += bytes;
-			sk_mem_uncharge(sk, bytes);
+			if (charge)
+				sk_mem_uncharge(sk, bytes);
 			break;
 		}
 
-		sk_mem_uncharge(sk, sg[i].length);
+		if (charge)
+			sk_mem_uncharge(sk, sg[i].length);
 		put_page(sg_page(&sg[i]));
 		bytes -= sg[i].length;
 		sg[i].length = 0;
@@ -418,6 +421,7 @@ static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 		if (i == MAX_SKB_FRAGS)
 			i = 0;
 	}
+	md->sg_start = i;
 }
 
 static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
@@ -578,7 +582,7 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
 {
 	struct smap_psock *psock;
 	struct scatterlist *sg;
-	int i, err, free = 0;
+	int i, err = 0;
 	bool ingress = !!(md->flags & BPF_F_INGRESS);
 
 	sg = md->sg_data;
@@ -607,16 +611,8 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
 out_rcu:
 	rcu_read_unlock();
 out:
-	i = md->sg_start;
-	while (sg[i].length) {
-		free += sg[i].length;
-		put_page(sg_page(&sg[i]));
-		sg[i].length = 0;
-		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	}
-	return free;
+	free_bytes_sg(NULL, send, md, false);
+	return err;
 }
 
 static inline void bpf_md_init(struct smap_psock *psock)
@@ -720,7 +716,7 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 		break;
 	case __SK_DROP:
 	default:
-		free_bytes_sg(sk, send, m);
+		free_bytes_sg(sk, send, m, true);
 		apply_bytes_dec(psock, send);
 		*copied -= send;
 		psock->sg_size -= send;

^ permalink raw reply related

* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Samudrala, Sridhar @ 2018-05-02 17:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <20180502161542.GI19250@nanopsycho>



On 5/2/2018 9:15 AM, Jiri Pirko wrote:
> Sat, Apr 28, 2018 at 11:06:01AM CEST, jiri@resnulli.us wrote:
>> Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:
> [...]
>
>
>>> +
>>> +	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
>>> +					 failover_dev);
>>> +	if (err) {
>>> +		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>>> +			   err);
>>> +		goto err_handler_register;
>>> +	}
>>> +
>>> +	err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> Please use netdev_master_upper_dev_link().
> Don't forget to fillup struct netdev_lag_upper_info - NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>
>
> Also, please call netdev_lower_state_changed() when the active slave
> device changes from primary->backup of backup->primary and whenever link
> state of a slave changes
>
Sure. will look into it.  Do you think this will help with the issue
you saw with having to change mac on standy twice to get the init scripts
working? We are now going to block changing the mac on both standby and
failover.

Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
and IFF_SLAVE on primary and standby. netvsc does this.
Does this help with the init scripts and network manager to skip slave
devices for dhcp requests?

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Ido Schimmel @ 2018-05-02 17:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, Ido Schimmel, netdev, davem, roopa, nikolay, pch,
	jkbs, yoshfuji, mlxsw
In-Reply-To: <20180502172106.GA12986@splinter>

On Wed, May 02, 2018 at 08:21:06PM +0300, Ido Schimmel wrote:
> On Wed, May 02, 2018 at 09:43:50AM -0700, Eric Dumazet wrote:
> > 
> > 
> > On 01/09/2018 07:43 PM, David Ahern wrote:
> > > On 1/9/18 7:40 AM, Ido Schimmel wrote:
> > >> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
> > >> first need each nexthop to store its region boundary in the hash
> > >> function's output space.
> > >>
> > >> The boundary is calculated by dividing the output space equally between
> > >> the different active nexthops. That is, nexthops that are not dead or
> > >> linkdown.
> > >>
> > >> The boundaries are rebalanced whenever a nexthop is added or removed to
> > >> a multipath route and whenever a nexthop becomes active or inactive.
> > >>
> > >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > >> ---
> > >>  include/net/ip6_fib.h   |  1 +
> > >>  include/net/ip6_route.h |  7 ++++
> > >>  net/ipv6/ip6_fib.c      |  8 ++---
> > >>  net/ipv6/route.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >>  4 files changed, 106 insertions(+), 6 deletions(-)
> > >>
> > > 
> > > LGTM.
> > > Acked-by: David Ahern <dsahern@gmail.com>
> > > 
> > 
> > For some reason I have a divide by zero error booting my hosts with latest net tree.
> > 
> > What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?
> 
> Thanks for the report, Eric. I believe I didn't cover all the cases and
> 'rt6i_nh_weight' might be 0 is some cases. I'll try to reproduce and
> work on a fix.

Hmmm, I think it's due to commit edd7ceb78296 ("ipv6: Allow non-gateway
ECMP for IPv6") which allows routes without a gateway (such as those
configured using slaac) to have siblings.

Can you please check if reverting the patch / applying the below fixes
the issue?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4d61736c41a..129dd4f4b264 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3606,6 +3606,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 	rt->dst.input = ip6_input;
 	rt->dst.output = ip6_output;
 	rt->rt6i_idev = idev;
+	rt->rt6i_nh_weight = 1;
 
 	rt->rt6i_protocol = RTPROT_KERNEL;
 	rt->rt6i_flags = RTF_UP | RTF_NONEXTHOP;

^ permalink raw reply related

* Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.
From: William Tu @ 2018-05-02 17:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Linux Kernel Network Developers,
	Yonghong Song, Yifeng Sun
In-Reply-To: <e4f93f62-d7f4-0851-d1c2-34a13c2af4c6@iogearbox.net>

On Wed, May 2, 2018 at 1:29 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/02/2018 06:52 AM, Alexei Starovoitov wrote:
>> On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:
>>>
>>>> How did you test this patch?
>>>>
>>> Without the patch, the test case will fail.
>>> With the patch, the test case passes.
>>
>> Please test it with real program and you'll see crashes and garbage returned.
>
> +1, *convert_ctx_access() use bpf_insn's off to determine what to rewrite,
> so this is definitely buggy, and wasn't properly tested as it should have
> been. The test case is also way too simple, just the LDX and then doing a
> return 0 will get you past verifier, but won't give you anything in terms
> of runtime testing that test_verifier is doing. A single test case for a
> non trivial verifier change like this is also _completely insufficient_,
> this really needs to test all sort of weird corner cases (involving out of
> bounds accesses, overflows, etc).

Thanks, now I understand.
It's much more complicated than I thought.

William

^ permalink raw reply

* INFO: rcu detected stall in __schedule
From: syzbot @ 2018-05-02 17:56 UTC (permalink / raw)
  To: linux-kernel, linux-ppp, netdev, paulus, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    f2125992e7cb Merge tag 'xfs-4.17-fixes-1' of  
git://git.kern...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?id=4755940087693312
kernel config:   
https://syzkaller.appspot.com/x/.config?id=6493557782959164711
dashboard link: https://syzkaller.appspot.com/bug?extid=f16b3e3512a1e3c1d1f6
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+f16b3e3512a1e3c1d1f6@syzkaller.appspotmail.com

do_dccp_getsockopt: sockopt(PACKET_SIZE) is deprecated: fix your app
do_dccp_getsockopt: sockopt(PACKET_SIZE) is deprecated: fix your app
ntfs: (device loop6): parse_options(): Unrecognized mount option  
error�n��uldip.
INFO: rcu_sched self-detected stall on CPU
	0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906  
softirq=112858/112858 fqs=0
	 (t=125000 jiffies g=61626 c=61625 q=1534)
rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0  
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
RCU grace-period kthread stack dump:
rcu_sched       I23592     9      2 0x80000000
Call Trace:
  context_switch kernel/sched/core.c:2848 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
  schedule+0xef/0x430 kernel/sched/core.c:3549
  schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
  rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
  kthread+0x345/0x410 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
NMI backtrace for cpu 0
CPU: 0 PID: 26694 Comm: syz-executor1 Not tainted 4.17.0-rc3+ #28
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
  rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
  print_cpu_stall kernel/rcu/tree.c:1525 [inline]
  check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
  __rcu_pending kernel/rcu/tree.c:3356 [inline]
  rcu_pending kernel/rcu/tree.c:3401 [inline]
  rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
  __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
  __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1460
  hrtimer_interrupt+0x2f3/0x750 kernel/time/hrtimer.c:1518
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
  smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
  </IRQ>
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:793 [inline]
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168  
[inline]
RIP: 0010:_raw_spin_unlock_irq+0x56/0x70 kernel/locking/spinlock.c:192
RSP: 0018:ffff8801b3dbf438 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: ffff8801dae2c680 RCX: 1ffff100361ebd4d
RDX: 1ffffffff11a316f RSI: ffff8801b0f5ea48 RDI: ffffffff88d18b78
RBP: ffff8801b3dbf440 R08: ffff8801b0f5e9f8 R09: 0000000000000006
R10: ffff8801b0f5e1c0 R11: 0000000000000000 R12: ffff8801b0f5e1c0
R13: ffff8801b0f5e7a0 R14: dffffc0000000000 R15: ffff8801b0f5e1c0
  rq_unlock_irq kernel/sched/sched.h:1824 [inline]
  __schedule+0x144f/0x1e30 kernel/sched/core.c:3493
  schedule+0xef/0x430 kernel/sched/core.c:3549
  do_sched_yield+0x187/0x240 kernel/sched/core.c:4965
  yield+0xa5/0xe0 kernel/sched/core.c:5054
  tasklet_kill+0x4e/0xd0 kernel/softirq.c:559
  ppp_asynctty_close+0x9e/0x150 drivers/net/ppp/ppp_async.c:239
  ppp_asynctty_hangup+0x15/0x20 drivers/net/ppp/ppp_async.c:256
  tty_ldisc_hangup+0x138/0x640 drivers/tty/tty_ldisc.c:730
  __tty_hangup.part.21+0x2da/0x6e0 drivers/tty/tty_io.c:621
  __tty_hangup drivers/tty/tty_io.c:571 [inline]
  tty_vhangup+0x21/0x30 drivers/tty/tty_io.c:694
  pty_close+0x3bd/0x510 drivers/tty/pty.c:78
  tty_release+0x494/0x12e0 drivers/tty/tty_io.c:1656
  __fput+0x34d/0x890 fs/file_table.c:209
  ____fput+0x15/0x20 fs/file_table.c:243
  task_work_run+0x1e4/0x290 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:191 [inline]
  exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455979
RSP: 002b:00007f92c3751c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f92c37526d4 RCX: 0000000000455979
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000013
RBP: 000000000072bf50 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000053 R14: 00000000006f4868 R15: 0000000000000001


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Song Liu @ 2018-05-02 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Networking, Kernel Team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180502173027.GM12180@hirez.programming.kicks-ass.net>



> On May 2, 2018, at 10:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, May 02, 2018 at 04:48:32PM +0000, Song Liu wrote:
>>> It's broken though, I've bet you've never actually ran this with lockdep
>>> enabled for example.
>> 
>> I am not following here. I just run the new selftest with CONFIG_LOCKDEP on, 
>> and got no warning for this. 
> 
> Weird, I would be expecting complaints about releasing an unheld lock.
> 
> nmi_enter(),nmi_exit() have lockdep_off(),lockdep_on() resp. Which means
> that the down_trylock() will not be recorded. The up, which is done from
> IRQ context, will not be so supressed and should hit
> print_unlock_imbalance_bug().
> 

I am still not sure whether I am following. I guess your concern apply to 
spinlock only? lock_acquire() has the following in the beginning:

	if (unlikely(current->lockdep_recursion))
		return;

So it will not run in nmi context?

On the other hand, semaphore and rw_semaphore should be ok in such cases?

Thanks,
Song

^ permalink raw reply

* [PATCH bpf 0/2] Two x86 BPF JIT fixes
From: Daniel Borkmann @ 2018-05-02 18:12 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann

Fix two memory leaks in x86 JIT. For details, please see
individual patches in this series. Thanks!

Daniel Borkmann (2):
  bpf, x64: fix memleak when not converging after image
  bpf, x64: fix memleak when not converging on calls

 arch/x86/net/bpf_jit_comp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf 1/2] bpf, x64: fix memleak when not converging after image
From: Daniel Borkmann @ 2018-05-02 18:12 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180502181223.30613-1-daniel@iogearbox.net>

While reviewing x64 JIT code, I noticed that we leak the prior allocated
JIT image in the case where proglen != oldproglen during the JIT passes.
Prior to the commit e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT
compiler") we would just break out of the loop, and using the image as the
JITed prog since it could only shrink in size anyway. After e0ee9c12157d,
we would bail out to out_addrs label where we free addrs and jit_data but
not the image coming from bpf_jit_binary_alloc().

Fixes: e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index abce27c..9ae7b93 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1236,6 +1236,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	for (pass = 0; pass < 20 || image; pass++) {
 		proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
 		if (proglen <= 0) {
+out_image:
 			image = NULL;
 			if (header)
 				bpf_jit_binary_free(header);
@@ -1246,8 +1247,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 			if (proglen != oldproglen) {
 				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
 				       proglen, oldproglen);
-				prog = orig_prog;
-				goto out_addrs;
+				goto out_image;
 			}
 			break;
 		}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 2/2] bpf, x64: fix memleak when not converging on calls
From: Daniel Borkmann @ 2018-05-02 18:12 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180502181223.30613-1-daniel@iogearbox.net>

The JIT logic in jit_subprogs() is as follows: for all subprogs we
allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
and pass it to bpf_int_jit_compile(). If a failure occurred during
JIT and prog->jited is not set, then we bail out from attempting to
JIT the whole program, and punt to the interpreter instead. In case
JITing went successful, we fixup BPF call offsets and do another
pass to bpf_int_jit_compile() (extra_pass is true at that point) to
complete JITing calls. Given that requires to pass JIT context around
addrs and jit_data from x86 JIT are freed in the extra_pass in
bpf_int_jit_compile() when calls are involved (if not, they can
be freed immediately). However, if in the original pass, the JIT
image didn't converge then we leak addrs and jit_data since image
itself is NULL, the prog->is_func is set and extra_pass is false
in that case, meaning both will become unreachable and are never
cleaned up, therefore we need to free as well on !image. Only x64
JIT is affected.

Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 9ae7b93..263c845 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1283,7 +1283,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog = orig_prog;
 	}
 
-	if (!prog->is_func || extra_pass) {
+	if (!image || !prog->is_func || extra_pass) {
 out_addrs:
 		kfree(addrs);
 		kfree(jit_data);
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next v9 0/4] Prerequisites for Cavium OCTEON-III network driver.
From: Steven J. Hill @ 2018-05-02 18:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20180429.203347.679522174405836744.davem@davemloft.net>

On 04/29/2018 07:33 PM, David Miller wrote:
> 
> I don't know if we really want all of these MIPS specific changes to
> go via the net-next tree.
> 
> The right way to do this is probably getting this series into the MIPS
> architecture tree.
> 
David,

Correct, and I should have been clearer about that. The MIPS-specific
changes are targeted for 4.18 and I will be working with James/Ralf
to get those in. The actual changes for adding the driver, however,
should be fine to go into net-next. The netdev list should have been
a CC: for those patches and the Linux/MIPS mailing list as the main
recipient. Mea culpa.

Steve

^ permalink raw reply

* Re: [PATCH net-next] net: phy: broadcom: add support for BCM89610 PHY
From: Bhadram Varka @ 2018-05-02 18:47 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, f.fainelli, netdev
In-Reply-To: <20180502.132136.1844584450359331860.davem@davemloft.net>

HI David,

On 5/2/2018 10:51 PM, David Miller wrote:
> 
> Please remove the email footer from your postings here that talks about
> confidential information and whatnot.
> 
> That is expressly inappropriate for this mailing list, and any such
> postings shall be ignored in their entirety.
> 

I Understand this. Fixed my e-mail client to address this.
Sorry for the noise.

Thanks,
Bhadram.

^ permalink raw reply

* Re: [bpf PATCH 1/3] bpf: sockmap, fix scatterlist update on error path in send with apply
From: David Miller @ 2018-05-02 18:51 UTC (permalink / raw)
  To: john.fastabend; +Cc: borkmann, ast, netdev
In-Reply-To: <20180502174727.17875.95306.stgit@john-Precision-Tower-5810>

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 02 May 2018 10:47:27 -0700

> When the call to do_tcp_sendpage() fails to send the complete block
> requested we either retry if only a partial send was completed or
> abort if we receive a error less than or equal to zero. Before
> returning though we must update the scatterlist length/offset to
> account for any partial send completed.
> 
> Before this patch we did this at the end of the retry loop, but
> this was buggy when used while applying a verdict to fewer bytes
> than in the scatterlist. When the scatterlist length was being set
> we forgot to account for the apply logic reducing the size variable.
> So the result was we chopped off some bytes in the scatterlist without
> doing proper cleanup on them. This results in a WARNING when the
> sock is tore down because the bytes have previously been charged to
> the socket but are never uncharged.
> 
> The simple fix is to simply do the accounting inside the retry loop
> subtracting from the absolute scatterlist values rather than trying
> to accumulate the totals and subtract at the end.
> 
> Reported-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [bpf PATCH 2/3] bpf: sockmap, zero sg_size on error when buffer is released
From: David Miller @ 2018-05-02 18:51 UTC (permalink / raw)
  To: john.fastabend; +Cc: borkmann, ast, netdev
In-Reply-To: <20180502174732.17875.75344.stgit@john-Precision-Tower-5810>

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 02 May 2018 10:47:32 -0700

> When an error occurs during a redirect we have two cases that need
> to be handled (i) we have a cork'ed buffer (ii) we have a normal
> sendmsg buffer.
> 
> In the cork'ed buffer case we don't currently support recovering from
> errors in a redirect action. So the buffer is released and the error
> should _not_ be pushed back to the caller of sendmsg/sendpage. The
> rationale here is the user will get an error that relates to old
> data that may have been sent by some arbitrary thread on that sock.
> Instead we simple consume the data and tell the user that the data
> has been consumed. We may add proper error recovery in the future.
> However, this patch fixes a bug where the bytes outstanding counter
> sg_size was not zeroed. This could result in a case where if the user
> has both a cork'ed action and apply action in progress we may
> incorrectly call into the BPF program when the user expected an
> old verdict to be applied via the apply action. I don't have a use
> case where using apply and cork at the same time is valid but we
> never explicitly reject it because it should work fine. This patch
> ensures the sg_size is zeroed so we don't have this case.
> 
> In the normal sendmsg buffer case (no cork data) we also do not
> zero sg_size. Again this can confuse the apply logic when the logic
> calls into the BPF program when the BPF programmer expected the old
> verdict to remain. So ensure we set sg_size to zero here as well. And
> additionally to keep the psock state in-sync with the sk_msg_buff
> release all the memory as well. Previously we did this before
> returning to the user but this left a gap where psock and sk_msg_buff
> states were out of sync which seems fragile. No additional overhead
> is taken here except for a call to check the length and realize its
> already been freed. This is in the error path as well so in my
> opinion lets have robust code over optimized error paths.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [bpf PATCH 3/3] bpf: sockmap, fix error handling in redirect failures
From: David Miller @ 2018-05-02 18:51 UTC (permalink / raw)
  To: john.fastabend; +Cc: borkmann, ast, netdev
In-Reply-To: <20180502174737.17875.74031.stgit@john-Precision-Tower-5810>

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 02 May 2018 10:47:37 -0700

> When a redirect failure happens we release the buffers in-flight
> without calling a sk_mem_uncharge(), the uncharge is called before
> dropping the sock lock for the redirecte, however we missed updating
> the ring start index. When no apply actions are in progress this
> is OK because we uncharge the entire buffer before the redirect.
> But, when we have apply logic running its possible that only a
> portion of the buffer is being redirected. In this case we only
> do memory accounting for the buffer slice being redirected and
> expect to be able to loop over the BPF program again and/or if
> a sock is closed uncharge the memory at sock destruct time.
> 
> With an invalid start index however the program logic looks at
> the start pointer index, checks the length, and when seeing the
> length is zero (from the initial release and failure to update
> the pointer) aborts without uncharging/releasing the remaining
> memory.
> 
> The fix for this is simply to update the start index. To avoid
> fixing this error in two locations we do a small refactor and
> remove one case where it is open-coded. Then fix it in the
> single function.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox