Netdev List

Netdev List
 help / color / mirror / Atom feed

* kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: syzbot @ 2018-04-06 19:02 UTC (permalink / raw)
  To: jasowang, kvm, linux-kernel, mst, netdev, syzkaller-bugs,
	virtualization

Hello,

syzbot hit the following crash on upstream commit
38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
Merge tag 'armsoc-drivers' of  
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd

So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

------------[ cut here ]------------
kernel BUG at drivers/vhost/vhost.c:1652!
invalid opcode: 0000 [#1] SMP KASAN
------------[ cut here ]------------
Dumping ftrace buffer:
kernel BUG at drivers/vhost/vhost.c:1652!
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
RSP: 0018:ffff8801b256f920 EFLAGS: 00010293
RAX: ffff8801adc9e2c0 RBX: dffffc0000000000 RCX: ffffffff85924a0f
RDX: 0000000000000000 RSI: ffffffff85924cea RDI: 0000000000000005
RBP: ffff8801b256fa58 R08: ffff8801adc9e2c0 R09: ffffed003962412d
R10: ffff8801b256fad8 R11: ffff8801cb12096f R12: 0001ffffffffffff
R13: ffffed00364adf36 R14: 0000000000000000 R15: ffff8801b256fa30
FS:  00007fdf24b19700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020bf6000 CR3: 00000001ae6a7000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
  vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
  vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
  vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
  vfs_ioctl fs/ioctl.c:46 [inline]
  file_ioctl fs/ioctl.c:500 [inline]
  do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
  ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
  SYSC_ioctl fs/ioctl.c:708 [inline]
  SyS_ioctl+0x24/0x30 fs/ioctl.c:706
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4456c9
RSP: 002b:00007fdf24b18da8 EFLAGS: 00000297 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 00000000004456c9
RDX: 0000000020f82ffc RSI: 000000004004af61 RDI: 000000000000001b
RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000297 R12: 6b636f73762d7473
R13: 6f68762f7665642f R14: fffffffffffffffc R15: 0000000000000007
Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8  
03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f  
5e e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:  
ffff8801b256f920
RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: ffff8801b256f920
invalid opcode: 0000 [#2] SMP KASAN
---[ end trace 0d0ff45aa44d8a23 ]---
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH iproute2-next v1] rdma: Print net device name and index for RDMA device
From: David Ahern @ 2018-04-06 18:54 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, RDMA mailing list, Stephen Hemminger,
	Steve Wise
In-Reply-To: <20180403042914.23360-1-leon@kernel.org>

On 4/2/18 10:29 PM, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> The RDMA devices are operated in RoCE and iWARP modes have net device
> underneath. Present their names in regular output and their net index
> in detailed mode.
> 
> [root@nps ~]# rdma link show mlx5_3/1
> 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7
> [root@nps ~]# rdma link show mlx5_3/1 -d
> 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7 netdev_index 7
>     caps: <CM, IP_BASED_GIDS>
> 
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> Reviewed-by: Steve Wise <swise@opengridcomputing.com>
> ---
>  Changes v0->v1:
>   * Resend after commit 29122c1aae35 ("rdma: update rdma_netlink.h")
>     which updated relevant netlink attributes.
>   * Added Steve's ROB
> ---
>  rdma/include/uapi/rdma/rdma_netlink.h |  4 ++++
>  rdma/link.c                           | 21 +++++++++++++++++++++
>  rdma/utils.c                          |  2 ++
>  3 files changed, 27 insertions(+)
> 

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH v2] net: phy: marvell: Enable interrupt function on LED2 pin
From: David Miller @ 2018-04-06 17:37 UTC (permalink / raw)
  To: andrew
  Cc: esben.haabendal, netdev, eha, rasmus.villemoes, f.fainelli,
	linux-kernel
In-Reply-To: <20180406172928.GS17495@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Fri, 6 Apr 2018 19:29:28 +0200

> On Thu, Apr 05, 2018 at 10:40:29PM +0200, Esben Haabendal wrote:
>> From: Esben Haabendal <eha@deif.com>
>> 
>> The LED2[2]/INTn pin on Marvell 88E1318S as well as 88E1510/12/14/18 needs
>> to be configured to be usable as interrupt not only when WOL is enabled,
>> but whenever we rely on interrupts from the PHY.
>> 
>> Signed-off-by: Esben Haabendal <eha@deif.com>
>> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> 
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH v2] net: phy: marvell: Enable interrupt function on LED2 pin
From: Andrew Lunn @ 2018-04-06 17:29 UTC (permalink / raw)
  To: Esben Haabendal
  Cc: netdev, Esben Haabendal, Rasmus Villemoes, Florian Fainelli,
	open list
In-Reply-To: <20180405204029.665-1-esben.haabendal@gmail.com>

On Thu, Apr 05, 2018 at 10:40:29PM +0200, Esben Haabendal wrote:
> From: Esben Haabendal <eha@deif.com>
> 
> The LED2[2]/INTn pin on Marvell 88E1318S as well as 88E1510/12/14/18 needs
> to be configured to be usable as interrupt not only when WOL is enabled,
> but whenever we rely on interrupts from the PHY.
> 
> Signed-off-by: Esben Haabendal <eha@deif.com>
> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Fw: [Bug 199307] New: kernel panic,call trace,sched_clock,tcp_write_timer_handler
From: Stephen Hemminger @ 2018-04-06 17:23 UTC (permalink / raw)
  To: netdev

Forwarding to netdev mailing list, it might be a real problem

Ugh. Glaring backtrace screenshots. 

Begin forwarded message:

Date: Fri, 06 Apr 2018 16:43:46 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 199307] New: kernel panic,call trace,sched_clock,tcp_write_timer_handler


https://bugzilla.kernel.org/show_bug.cgi?id=199307

            Bug ID: 199307
           Summary: kernel panic,call
                    trace,sched_clock,tcp_write_timer_handler
           Product: Networking
           Version: 2.5
    Kernel Version: 4.16
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: youling257@gmail.com
        Regression: No

Created attachment 275131
  --> https://bugzilla.kernel.org/attachment.cgi?id=275131&action=edit  
picture

see the picture

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [PATCH v2] dp83640: Ensure against premature access to PHY registers after reset
From: Andrew Lunn @ 2018-04-06 17:23 UTC (permalink / raw)
  To: Esben Haabendal
  Cc: netdev, richardcochran, f.fainelli, linux-kernel, Esben Haabendal
In-Reply-To: <20180406170844.4248-1-esben.haabendal@gmail.com>

On Fri, Apr 06, 2018 at 07:08:44PM +0200, Esben Haabendal wrote:
> From: Esben Haabendal <eha@deif.com>

Hi Esben

It it good to have some form of commit message. Something like:

The datasheet specifies a 3uS pause after performing a software
reset. The default implementation of genphy_soft_reset() does not
provide this, so implement soft_reset with the needed pause.

With that, or something similar added:

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [RFC PATCH v3 bpf-next 4/5] bpf/verifier: use call graph to efficiently check max stack depth
From: Edward Cree @ 2018-04-06 17:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

Instead of recursively walking every possible call chain,
 check_max_stack_depth() now uses an explicit call graph (recorded in
 subprog_info.callees) which it topologically sorts, allowing it to find
 for each subprog the maximum stack depth of all call chains, and then
 use that depth in calculating the stack depth it implies for the
 subprog's callees.  This is O(number of subprogs).
The call graph is populated as part of the check_subprogs() pass.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |   3 +
 kernel/bpf/verifier.c        | 168 +++++++++++++++++++++++++------------------
 2 files changed, 101 insertions(+), 70 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 17990dd56e65..32f27cbe721b 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -175,8 +175,11 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
 #define BPF_MAX_SUBPROGS 256
 
 struct bpf_subprog_info {
+	/* which other subprogs does this one directly call? */
+	DECLARE_BITMAP(callees, BPF_MAX_SUBPROGS);
 	u32 start; /* insn idx of function entry point */
 	u16 stack_depth; /* max. stack depth used by this function */
+	u16 total_stack_depth; /* max. stack depth used by entire call chain */
 };
 
 /* single container for all structs
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 587eb445bfa2..45f530e4a65e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -784,6 +784,7 @@ static int add_subprog(struct bpf_verifier_env *env, int off)
 		info = &env->subprog_info[ret];
 		info->start = off;
 		info->stack_depth = 0;
+		memset(info->callees, 0, sizeof(info->callees));
 		env->insn_aux_data[off].subprogno = ret;
 	}
 	return ret;
@@ -793,13 +794,13 @@ static int check_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_insn_aux_data *aux = env->insn_aux_data;
 	struct bpf_insn *insns = env->prog->insnsi;
-	int insn_cnt = env->prog->len, i, err;
+	int insn_cnt = env->prog->len, i, subprog;
 	int cur_subprog;
 
 	/* Find subprog starts */
-	err = add_subprog(env, 0); /* subprog 0, main(), starts at insn 0 */
-	if (err < 0)
-		return err;
+	subprog = add_subprog(env, 0); /* subprog 0, main(), starts at insn 0 */
+	if (subprog < 0)
+		return subprog;
 	for (i = 0; i < insn_cnt; i++)
 		if (insns[i].code == (BPF_JMP | BPF_CALL) &&
 		    insns[i].src_reg == BPF_PSEUDO_CALL) {
@@ -814,9 +815,9 @@ static int check_subprogs(struct bpf_verifier_env *env)
 			/* add_subprog marks the callee entry point with the
 			 * new subprogno.
 			 */
-			err = add_subprog(env, i + insns[i].imm + 1);
-			if (err < 0)
-				return err;
+			subprog = add_subprog(env, i + insns[i].imm + 1);
+			if (subprog < 0)
+				return subprog;
 		}
 
 	if (env->log.level > 1)
@@ -842,6 +843,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
 		u8 opcode = BPF_OP(insns[i].code);
 		int target;
 
+		cur_subprog = aux[i].subprogno;
 		/* Determine where control flow from this insn goes */
 		if (BPF_CLASS(insns[i].code) != BPF_JMP) {
 			fallthrough = true;
@@ -855,9 +857,18 @@ static int check_subprogs(struct bpf_verifier_env *env)
 				 */
 				continue;
 			case BPF_CALL:
-				/* If pseudo-call, already handled marking the
-				 * callee.  Both kinds of call will eventually
-				 * return and pass to the following insn.
+				if (insns[i].src_reg == BPF_PSEUDO_CALL) {
+					/* record edge in call graph */
+					subprog = find_subprog(env, i + insns[i].imm + 1);
+					if (subprog < 0)
+						return subprog;
+					__set_bit(subprog, env->subprog_info[cur_subprog].callees);
+					/* already handled marking the callee
+					 * back in add_subprog, so jump is false
+					 */
+				}
+				/* Call will eventually return and pass control
+				 * to the following insn.
 				 */
 				fallthrough = true;
 				break;
@@ -876,7 +887,6 @@ static int check_subprogs(struct bpf_verifier_env *env)
 		}
 
 		/* Check that that control flow doesn't leave the subprog */
-		cur_subprog = aux[i].subprogno;
 		if (fallthrough) {
 			if (i + 1 >= insn_cnt) {
 				verbose(env, "no exit/jump at end of program (insn %d)\n",
@@ -1533,78 +1543,96 @@ static int update_stack_depth(struct bpf_verifier_env *env,
 			      const struct bpf_func_state *func,
 			      int off)
 {
-	u16 stack = env->subprog_info[func->subprogno].stack_depth;
+	struct bpf_subprog_info *info;
 
-	if (stack >= -off)
-		return 0;
+	if (!func)
+		return -EFAULT;
+	if (func->subprogno >= BPF_MAX_SUBPROGS)
+		return -E2BIG;
+	info = &env->subprog_info[func->subprogno];
 
 	/* update known max for given subprogram */
-	env->subprog_info[func->subprogno].stack_depth = -off;
+	info->stack_depth = max_t(u16, info->stack_depth, -off);
 	return 0;
 }
 
-/* starting from main bpf function walk all instructions of the function
- * and recursively walk all callees that given function can call.
- * Ignore jump and exit insns.
- * Since recursion is prevented by check_cfg() this algorithm
- * only needs a local stack of MAX_CALL_FRAMES to remember callsites
+/* Topologically sort the call graph, and thereby determine the maximum stack
+ * depth of each subprog's worst-case call chain.  Store in total_stack_depth.
+ * The tsort is performed using Kahn's algorithm.  Since that algorithm selects
+ * nodes in the order of the sorted output, we can do our processing in the loop
+ * that does the tsort, rather than storing the sorted list and then having a
+ * second loop to iterate over it and compute the total_stack_depth values.
  */
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
-	int depth = 0, frame = 0, subprog = 0, i = 0;
-	struct bpf_insn_aux_data *aux = env->insn_aux_data;
-	struct bpf_insn *insn = env->prog->insnsi;
-	int insn_cnt = env->prog->len;
-	int ret_insn[MAX_CALL_FRAMES];
-	int ret_prog[MAX_CALL_FRAMES];
+	DECLARE_BITMAP(frontier, BPF_MAX_SUBPROGS) = {0};
+	DECLARE_BITMAP(visited, BPF_MAX_SUBPROGS) = {0};
+	int subprog, i, j;
+
+	/* subprog 0 has no incoming edges, and should be the only such */
+	__set_bit(0, frontier);
+	env->subprog_info[0].total_stack_depth = env->subprog_info[0].stack_depth;
+
+	while (true) {
+		/* select a frontier node */
+		subprog = find_first_bit(frontier, BPF_MAX_SUBPROGS);
+		if (subprog >= BPF_MAX_SUBPROGS) /* frontier is empty, done */
+			break;
+		/* remove it from the frontier */
+		__clear_bit(subprog, frontier);
+		/* validate its total stack depth.  Since all callers of this
+		 * function have already been processed, the value now in
+		 * info->total_stack_depth reflects the deepest call chain of
+		 * this function.
+		 */
+		if (env->subprog_info[subprog].total_stack_depth > MAX_BPF_STACK) {
+			verbose(env, "combined stack size of calls to %d (at insn %d) is %d.  Too large\n",
+				subprog, env->subprog_info[subprog].start,
+				env->subprog_info[subprog].total_stack_depth);
+			return -EACCES;
+		}
+		if (env->log.level > 1) {
+			verbose(env, "combined stack size of calls to %d (at insn %d) is %d\n",
+				subprog, env->subprog_info[subprog].start,
+				env->subprog_info[subprog].total_stack_depth);
+		}
+		__set_bit(subprog, visited);
+		/* for each callee */
+		for_each_set_bit(i, env->subprog_info[subprog].callees,
+				 BPF_MAX_SUBPROGS) {
+			/* round up to 32-bytes, since this is granularity of
+			 * interpreter stack size
+			 */
+			u16 stack_depth = round_up(max_t(u16, env->subprog_info[i].stack_depth, 1), 32);
 
-process_func:
-	/* round up to 32-bytes, since this is granularity
-	 * of interpreter stack size
-	 */
-	depth += round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
-			  32);
-	if (depth > MAX_BPF_STACK) {
-		verbose(env, "combined stack size of %d calls is %d. Too large\n",
-			frame + 1, depth);
-		return -EACCES;
+			/* Update callee total stack depth.  Since our own
+			 * max total_stack_depth is known, the callee tsd is
+			 * at least that plus the callee's stack frame size.
+			 */
+			env->subprog_info[i].total_stack_depth = max_t(u16,
+				env->subprog_info[i].total_stack_depth,
+				env->subprog_info[subprog].total_stack_depth +
+				stack_depth);
+			/* does it have unvisited callers? */
+			for_each_clear_bit(j, visited, env->subprog_cnt) {
+				if (test_bit(i, env->subprog_info[j].callees))
+					break;
+			}
+			/* if not, add it to the frontier */
+			if (j >= env->subprog_cnt)
+				__set_bit(i, frontier);
+		}
 	}
-continue_func:
-	for (; i < insn_cnt && aux[i].subprogno == subprog; i++) {
-		if (insn[i].code != (BPF_JMP | BPF_CALL))
-			continue;
-		if (insn[i].src_reg != BPF_PSEUDO_CALL)
-			continue;
-		/* remember insn and function to return to */
-		ret_insn[frame] = i + 1;
-		ret_prog[frame] = subprog;
 
-		/* find the callee */
-		i = i + insn[i].imm + 1;
-		subprog = find_subprog(env, i);
-		if (subprog < 0) {
-			WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
-				  i);
-			return -EFAULT;
-		}
-		frame++;
-		if (frame >= MAX_CALL_FRAMES) {
-			WARN_ONCE(1, "verifier bug. Call stack is too deep\n");
-			return -EFAULT;
-		}
-		goto process_func;
+	/* are any nodes left unvisited? */
+	subprog = find_first_zero_bit(visited, env->subprog_cnt);
+	if (subprog < env->subprog_cnt) {
+		/* then call graph is not acyclic, which shouldn't happen */
+		verbose(env, "verifier bug.  Call graph has a cycle including subprog %d (at insn %d)\n",
+			subprog, env->subprog_info[subprog].start);
+		return -EFAULT;
 	}
-	/* end of for() loop means the last insn of the 'subprog'
-	 * was reached. Doesn't matter whether it was JA or EXIT
-	 */
-	if (frame == 0)
-		return 0;
-	depth -= round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
-			  32);
-	frame--;
-	i = ret_insn[frame];
-	subprog = ret_prog[frame];
-	goto continue_func;
+	return 0;
 }
 
 #ifndef CONFIG_BPF_JIT_ALWAYS_ON

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 5/5] bpf/verifier: per-register parent pointers
From: Edward Cree @ 2018-04-06 17:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

By giving each register its own liveness chain, we elide the skip_callee()
 logic.  Instead, each register's parent is the state it inherits from;
 both check_func_call() and prepare_func_exit() automatically connect
 reg states to the correct chain since when they copy the reg state across
 (r1-r5 into the callee as args, and r0 out as the return value) they also
 copy the parent pointer.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |   8 +-
 kernel/bpf/verifier.c        | 180 ++++++++++---------------------------------
 2 files changed, 45 insertions(+), 143 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 32f27cbe721b..81b627dbda32 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -41,6 +41,7 @@ enum bpf_reg_liveness {
 };
 
 struct bpf_reg_state {
+	/* Ordering of fields matters.  See states_equal() */
 	enum bpf_reg_type type;
 	union {
 		/* valid when type == PTR_TO_PACKET */
@@ -59,7 +60,6 @@ struct bpf_reg_state {
 	 * came from, when one is tested for != NULL.
 	 */
 	u32 id;
-	/* Ordering of fields matters.  See states_equal() */
 	/* For scalar types (SCALAR_VALUE), this represents our knowledge of
 	 * the actual value.
 	 * For pointer types, this represents the variable part of the offset
@@ -76,15 +76,15 @@ struct bpf_reg_state {
 	s64 smax_value; /* maximum possible (s64)value */
 	u64 umin_value; /* minimum possible (u64)value */
 	u64 umax_value; /* maximum possible (u64)value */
+	/* parentage chain for liveness checking */
+	struct bpf_reg_state *parent;
 	/* Inside the callee two registers can be both PTR_TO_STACK like
 	 * R1=fp-8 and R2=fp-8, but one of them points to this function stack
 	 * while another to the caller's stack. To differentiate them 'frameno'
 	 * is used which is an index in bpf_verifier_state->frame[] array
 	 * pointing to bpf_func_state.
-	 * This field must be second to last, for states_equal() reasons.
 	 */
 	u32 frameno;
-	/* This field must be last, for states_equal() reasons. */
 	enum bpf_reg_liveness live;
 };
 
@@ -107,7 +107,6 @@ struct bpf_stack_state {
  */
 struct bpf_func_state {
 	struct bpf_reg_state regs[MAX_BPF_REG];
-	struct bpf_verifier_state *parent;
 	/* index of call instruction that called into this func */
 	int callsite;
 	/* stack frame number of this function state from pov of
@@ -129,7 +128,6 @@ struct bpf_func_state {
 struct bpf_verifier_state {
 	/* call stack tracking */
 	struct bpf_func_state *frame[MAX_CALL_FRAMES];
-	struct bpf_verifier_state *parent;
 	u32 curframe;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 45f530e4a65e..4b37cd6fa181 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -355,9 +355,9 @@ static int copy_stack_state(struct bpf_func_state *dst,
 /* do_check() starts with zero-sized stack in struct bpf_verifier_state to
  * make it consume minimal amount of memory. check_stack_write() access from
  * the program calls into realloc_func_state() to grow the stack size.
- * Note there is a non-zero 'parent' pointer inside bpf_verifier_state
- * which this function copies over. It points to previous bpf_verifier_state
- * which is never reallocated
+ * Note there is a non-zero parent pointer inside each reg of bpf_verifier_state
+ * which this function copies over. It points to corresponding reg in previous
+ * bpf_verifier_state which is never reallocated
  */
 static int realloc_func_state(struct bpf_func_state *state, int size,
 			      bool copy_old)
@@ -441,7 +441,6 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 		dst_state->frame[i] = NULL;
 	}
 	dst_state->curframe = src->curframe;
-	dst_state->parent = src->parent;
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
 		if (!dst) {
@@ -707,6 +706,7 @@ static void init_reg_state(struct bpf_verifier_env *env,
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		mark_reg_not_init(env, regs, i);
 		regs[i].live = REG_LIVE_NONE;
+		regs[i].parent = NULL;
 	}
 
 	/* frame pointer */
@@ -916,74 +916,21 @@ static int check_subprogs(struct bpf_verifier_env *env)
 	return 0;
 }
 
-static
-struct bpf_verifier_state *skip_callee(struct bpf_verifier_env *env,
-				       const struct bpf_verifier_state *state,
-				       struct bpf_verifier_state *parent,
-				       u32 regno)
-{
-	struct bpf_verifier_state *tmp = NULL;
-
-	/* 'parent' could be a state of caller and
-	 * 'state' could be a state of callee. In such case
-	 * parent->curframe < state->curframe
-	 * and it's ok for r1 - r5 registers
-	 *
-	 * 'parent' could be a callee's state after it bpf_exit-ed.
-	 * In such case parent->curframe > state->curframe
-	 * and it's ok for r0 only
-	 */
-	if (parent->curframe == state->curframe ||
-	    (parent->curframe < state->curframe &&
-	     regno >= BPF_REG_1 && regno <= BPF_REG_5) ||
-	    (parent->curframe > state->curframe &&
-	       regno == BPF_REG_0))
-		return parent;
-
-	if (parent->curframe > state->curframe &&
-	    regno >= BPF_REG_6) {
-		/* for callee saved regs we have to skip the whole chain
-		 * of states that belong to callee and mark as LIVE_READ
-		 * the registers before the call
-		 */
-		tmp = parent;
-		while (tmp && tmp->curframe != state->curframe) {
-			tmp = tmp->parent;
-		}
-		if (!tmp)
-			goto bug;
-		parent = tmp;
-	} else {
-		goto bug;
-	}
-	return parent;
-bug:
-	verbose(env, "verifier bug regno %d tmp %p\n", regno, tmp);
-	verbose(env, "regno %d parent frame %d current frame %d\n",
-		regno, parent->curframe, state->curframe);
-	return NULL;
-}
-
+/* Parentage chain of this register (or stack slot) should take care of all
+ * issues like callee-saved registers, stack slot allocation time, etc.
+ */
 static int mark_reg_read(struct bpf_verifier_env *env,
-			 const struct bpf_verifier_state *state,
-			 struct bpf_verifier_state *parent,
-			 u32 regno)
+			 const struct bpf_reg_state *state,
+			 struct bpf_reg_state *parent)
 {
 	bool writes = parent == state->parent; /* Observe write marks */
 
-	if (regno == BPF_REG_FP)
-		/* We don't need to worry about FP liveness because it's read-only */
-		return 0;
-
 	while (parent) {
 		/* if read wasn't screened by an earlier write ... */
-		if (writes && state->frame[state->curframe]->regs[regno].live & REG_LIVE_WRITTEN)
+		if (writes && state->live & REG_LIVE_WRITTEN)
 			break;
-		parent = skip_callee(env, state, parent, regno);
-		if (!parent)
-			return -EFAULT;
 		/* ... then we depend on parent's value */
-		parent->frame[parent->curframe]->regs[regno].live |= REG_LIVE_READ;
+		parent->live |= REG_LIVE_READ;
 		state = parent;
 		parent = state->parent;
 		writes = true;
@@ -1009,7 +956,10 @@ static int check_reg_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose(env, "R%d !read_ok\n", regno);
 			return -EACCES;
 		}
-		return mark_reg_read(env, vstate, vstate->parent, regno);
+		/* We don't need to worry about FP liveness because it's read-only */
+		if (regno != BPF_REG_FP)
+			return mark_reg_read(env, &regs[regno],
+					     regs[regno].parent);
 	} else {
 		/* check whether register used as dest operand can be written to */
 		if (regno == BPF_REG_FP) {
@@ -1121,61 +1071,6 @@ static int check_stack_write(struct bpf_verifier_env *env,
 	return 0;
 }
 
-/* registers of every function are unique and mark_reg_read() propagates
- * the liveness in the following cases:
- * - from callee into caller for R1 - R5 that were used as arguments
- * - from caller into callee for R0 that used as result of the call
- * - from caller to the same caller skipping states of the callee for R6 - R9,
- *   since R6 - R9 are callee saved by implicit function prologue and
- *   caller's R6 != callee's R6, so when we propagate liveness up to
- *   parent states we need to skip callee states for R6 - R9.
- *
- * stack slot marking is different, since stacks of caller and callee are
- * accessible in both (since caller can pass a pointer to caller's stack to
- * callee which can pass it to another function), hence mark_stack_slot_read()
- * has to propagate the stack liveness to all parent states at given frame number.
- * Consider code:
- * f1() {
- *   ptr = fp - 8;
- *   *ptr = ctx;
- *   call f2 {
- *      .. = *ptr;
- *   }
- *   .. = *ptr;
- * }
- * First *ptr is reading from f1's stack and mark_stack_slot_read() has
- * to mark liveness at the f1's frame and not f2's frame.
- * Second *ptr is also reading from f1's stack and mark_stack_slot_read() has
- * to propagate liveness to f2 states at f1's frame level and further into
- * f1 states at f1's frame level until write into that stack slot
- */
-static void mark_stack_slot_read(struct bpf_verifier_env *env,
-				 const struct bpf_verifier_state *state,
-				 struct bpf_verifier_state *parent,
-				 int slot, int frameno)
-{
-	bool writes = parent == state->parent; /* Observe write marks */
-
-	while (parent) {
-		if (parent->frame[frameno]->allocated_stack <= slot * BPF_REG_SIZE)
-			/* since LIVE_WRITTEN mark is only done for full 8-byte
-			 * write the read marks are conservative and parent
-			 * state may not even have the stack allocated. In such case
-			 * end the propagation, since the loop reached beginning
-			 * of the function
-			 */
-			break;
-		/* if read wasn't screened by an earlier write ... */
-		if (writes && state->frame[frameno]->stack[slot].spilled_ptr.live & REG_LIVE_WRITTEN)
-			break;
-		/* ... then we depend on parent's value */
-		parent->frame[frameno]->stack[slot].spilled_ptr.live |= REG_LIVE_READ;
-		state = parent;
-		parent = state->parent;
-		writes = true;
-	}
-}
-
 static int check_stack_read(struct bpf_verifier_env *env,
 			    struct bpf_func_state *reg_state /* func where register points to */,
 			    int off, int size, int value_regno)
@@ -1213,8 +1108,8 @@ static int check_stack_read(struct bpf_verifier_env *env,
 			 */
 			state->regs[value_regno].live |= REG_LIVE_WRITTEN;
 		}
-		mark_stack_slot_read(env, vstate, vstate->parent, spi,
-				     reg_state->frameno);
+		mark_reg_read(env, &reg_state->stack[spi].spilled_ptr,
+			      reg_state->stack[spi].spilled_ptr.parent);
 		return 0;
 	} else {
 		int zeros = 0;
@@ -1230,8 +1125,8 @@ static int check_stack_read(struct bpf_verifier_env *env,
 				off, i, size);
 			return -EACCES;
 		}
-		mark_stack_slot_read(env, vstate, vstate->parent, spi,
-				     reg_state->frameno);
+		mark_reg_read(env, &reg_state->stack[spi].spilled_ptr,
+			      reg_state->stack[spi].spilled_ptr.parent);
 		if (value_regno >= 0) {
 			if (zeros == size) {
 				/* any size read into register is zero extended,
@@ -1930,8 +1825,8 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
 		/* reading any byte out of 8-byte 'spill_slot' will cause
 		 * the whole slot to be marked as 'read'
 		 */
-		mark_stack_slot_read(env, env->cur_state, env->cur_state->parent,
-				     spi, state->frameno);
+		mark_reg_read(env, &state->stack[spi].spilled_ptr,
+			      state->stack[spi].spilled_ptr.parent);
 	}
 	return update_stack_depth(env, state, off);
 }
@@ -2362,11 +2257,13 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			state->curframe + 1 /* frameno within this callchain */,
 			subprog /* subprog number within this prog */);
 
-	/* copy r1 - r5 args that callee can access */
+	/* copy r1 - r5 args that callee can access.  The copy includes parent
+	 * pointers, which connects us up to the liveness chain
+	 */
 	for (i = BPF_REG_1; i <= BPF_REG_5; i++)
 		callee->regs[i] = caller->regs[i];
 
-	/* after the call regsiters r0 - r5 were scratched */
+	/* after the call registers r0 - r5 were scratched */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		mark_reg_not_init(env, caller->regs, caller_saved[i]);
 		check_reg_arg(env, caller_saved[i], DST_OP_NO_MARK);
@@ -4272,7 +4169,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 		/* explored state didn't use this */
 		return true;
 
-	equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, frameno)) == 0;
+	equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0;
 
 	if (rold->type == PTR_TO_STACK)
 		/* two stack pointers are equal only if they're pointing to
@@ -4505,7 +4402,7 @@ static bool states_equal(struct bpf_verifier_env *env,
  * equivalent state (jump target or such) we didn't arrive by the straight-line
  * code, so read marks in the state must propagate to the parent regardless
  * of the state's write marks. That's what 'parent == state->parent' comparison
- * in mark_reg_read() and mark_stack_slot_read() is for.
+ * in mark_reg_read() is for.
  */
 static int propagate_liveness(struct bpf_verifier_env *env,
 			      const struct bpf_verifier_state *vstate,
@@ -4526,7 +4423,8 @@ static int propagate_liveness(struct bpf_verifier_env *env,
 		if (vparent->frame[vparent->curframe]->regs[i].live & REG_LIVE_READ)
 			continue;
 		if (vstate->frame[vstate->curframe]->regs[i].live & REG_LIVE_READ) {
-			err = mark_reg_read(env, vstate, vparent, i);
+			err = mark_reg_read(env, &vstate->frame[vstate->curframe]->regs[i],
+					    &vparent->frame[vstate->curframe]->regs[i]);
 			if (err)
 				return err;
 		}
@@ -4541,7 +4439,8 @@ static int propagate_liveness(struct bpf_verifier_env *env,
 			if (parent->stack[i].spilled_ptr.live & REG_LIVE_READ)
 				continue;
 			if (state->stack[i].spilled_ptr.live & REG_LIVE_READ)
-				mark_stack_slot_read(env, vstate, vparent, i, frame);
+				mark_reg_read(env, &state->stack[i].spilled_ptr,
+					      &parent->stack[i].spilled_ptr);
 		}
 	}
 	return err;
@@ -4551,7 +4450,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 {
 	struct bpf_verifier_state_list *new_sl;
 	struct bpf_verifier_state_list *sl;
-	struct bpf_verifier_state *cur = env->cur_state;
+	struct bpf_verifier_state *cur = env->cur_state, *new;
 	int i, j, err;
 
 	sl = env->explored_states[insn_idx];
@@ -4593,16 +4492,18 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 		return -ENOMEM;
 
 	/* add new state to the head of linked list */
-	err = copy_verifier_state(&new_sl->state, cur);
+	new = &new_sl->state;
+	err = copy_verifier_state(new, cur);
 	if (err) {
-		free_verifier_state(&new_sl->state, false);
+		free_verifier_state(new, false);
 		kfree(new_sl);
 		return err;
 	}
 	new_sl->next = env->explored_states[insn_idx];
 	env->explored_states[insn_idx] = new_sl;
 	/* connect new state to parentage chain */
-	cur->parent = &new_sl->state;
+	for (i = 0; i < BPF_REG_FP; i++)
+		cur_regs(env)[i].parent = &new->frame[new->curframe]->regs[i];
 	/* clear write marks in current state: the writes we did are not writes
 	 * our child did, so they don't screen off its reads from us.
 	 * (There are no read marks in current state, because reads always mark
@@ -4615,9 +4516,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 	/* all stack frames are accessible from callee, clear them all */
 	for (j = 0; j <= cur->curframe; j++) {
 		struct bpf_func_state *frame = cur->frame[j];
+		struct bpf_func_state *newframe = new->frame[j];
 
-		for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++)
+		for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++) {
 			frame->stack[i].spilled_ptr.live = REG_LIVE_NONE;
+			frame->stack[i].spilled_ptr.parent =
+						&newframe->stack[i].spilled_ptr;
+		}
 	}
 	return 0;
 }
@@ -4636,7 +4541,6 @@ static int do_check(struct bpf_verifier_env *env)
 	if (!state)
 		return -ENOMEM;
 	state->curframe = 0;
-	state->parent = NULL;
 	state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL);
 	if (!state->frame[0]) {
 		kfree(state);

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 2/5] bpf/verifier: rewrite subprog boundary detection
From: Edward Cree @ 2018-04-06 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

By storing a subprogno in each insn's aux data, we avoid the need to keep
 the list of subprog starts sorted or bsearch() it in find_subprog().
Also, get rid of the weird one-based indexing of subprog numbers.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |   3 +-
 kernel/bpf/verifier.c        | 284 ++++++++++++++++++++++++++-----------------
 2 files changed, 177 insertions(+), 110 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 8f70dc181e23..17990dd56e65 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -146,6 +146,7 @@ struct bpf_insn_aux_data {
 		s32 call_imm;			/* saved imm field of call insn */
 	};
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
+	u16 subprogno; /* subprog in which this insn resides */
 	bool seen; /* this insn was processed by the verifier */
 };
 
@@ -196,7 +197,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
+	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS];
 	u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index df4d742360d9..587eb445bfa2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -736,109 +736,171 @@ enum reg_arg_type {
 	DST_OP_NO_MARK	/* same as above, check only, don't mark */
 };
 
-static int cmp_subprogs(const void *a, const void *b)
+static int find_subprog(struct bpf_verifier_env *env, int insn_idx)
 {
-	return ((struct bpf_subprog_info *)a)->start -
-	       ((struct bpf_subprog_info *)b)->start;
-}
-
-static int find_subprog(struct bpf_verifier_env *env, int off)
-{
-	struct bpf_subprog_info *p;
-
-	p = bsearch(&off, env->subprog_info, env->subprog_cnt,
-		    sizeof(env->subprog_info[0]), cmp_subprogs);
-	if (!p)
-		return -ENOENT;
-	return p - env->subprog_info;
+	int insn_cnt = env->prog->len;
+	u32 subprogno;
 
+	if (insn_idx >= insn_cnt || insn_idx < 0) {
+		verbose(env, "find_subprog of invalid insn_idx %d\n", insn_idx);
+		return -EINVAL;
+	}
+	subprogno = env->insn_aux_data[insn_idx].subprogno;
+	/* validate that we are at start of subprog */
+	if (env->subprog_info[subprogno].start != insn_idx) {
+		verbose(env, "insn_idx %d is in subprog %u but that starts at %d\n",
+			insn_idx, subprogno, env->subprog_info[subprogno].start);
+		return -EINVAL;
+	}
+	return subprogno;
 }
 
 static int add_subprog(struct bpf_verifier_env *env, int off)
 {
 	int insn_cnt = env->prog->len;
+	struct bpf_subprog_info *info;
 	int ret;
 
 	if (off >= insn_cnt || off < 0) {
 		verbose(env, "call to invalid destination\n");
 		return -EINVAL;
 	}
-	ret = find_subprog(env, off);
-	if (ret >= 0)
-		return 0;
-	if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
-		verbose(env, "too many subprograms\n");
-		return -E2BIG;
+	ret = env->insn_aux_data[off].subprogno;
+	/* At the time add_subprog is called, only (some) entry points are
+	 * marked with an aux->subprogno; 'body' insns aren't.  Thus, a
+	 * subprogno of 0 means either "not yet marked as an entry point" or
+	 * "entry point of main(), i.e. insn 0".
+	 * However, a call to insn 0 is never legal (it would necessarily imply
+	 * recursion, which check_cfg will catch), therefore we can assume that
+	 * we will only be called with off == 0 once, and in that case we
+	 * should go ahead and add the subprog entry.
+	 */
+	if (!ret) {
+		if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
+			verbose(env, "too many subprograms\n");
+			return -E2BIG;
+		}
+		ret = env->subprog_cnt++;
+		info = &env->subprog_info[ret];
+		info->start = off;
+		info->stack_depth = 0;
+		env->insn_aux_data[off].subprogno = ret;
 	}
-	env->subprog_info[env->subprog_cnt++].start = off;
-	sort(env->subprog_info, env->subprog_cnt,
-	     sizeof(env->subprog_info[0]), cmp_subprogs, NULL);
-	return 0;
+	return ret;
 }
 
 static int check_subprogs(struct bpf_verifier_env *env)
 {
-	int i, ret, subprog_start, subprog_end, off, cur_subprog = 0;
-	struct bpf_insn *insn = env->prog->insnsi;
-	int insn_cnt = env->prog->len;
+	struct bpf_insn_aux_data *aux = env->insn_aux_data;
+	struct bpf_insn *insns = env->prog->insnsi;
+	int insn_cnt = env->prog->len, i, err;
+	int cur_subprog;
 
-	/* determine subprog starts. The end is one before the next starts */
-	for (i = 0; i < insn_cnt; i++) {
-		if (insn[i].code != (BPF_JMP | BPF_CALL))
-			continue;
-		if (insn[i].src_reg != BPF_PSEUDO_CALL)
-			continue;
-		if (!env->allow_ptr_leaks) {
-			verbose(env, "function calls to other bpf functions are allowed for root only\n");
-			return -EPERM;
-		}
-		if (bpf_prog_is_dev_bound(env->prog->aux)) {
-			verbose(env, "function calls in offloaded programs are not supported yet\n");
-			return -EINVAL;
+	/* Find subprog starts */
+	err = add_subprog(env, 0); /* subprog 0, main(), starts at insn 0 */
+	if (err < 0)
+		return err;
+	for (i = 0; i < insn_cnt; i++)
+		if (insns[i].code == (BPF_JMP | BPF_CALL) &&
+		    insns[i].src_reg == BPF_PSEUDO_CALL) {
+			if (!env->allow_ptr_leaks) {
+				verbose(env, "function calls to other bpf functions are allowed for root only\n");
+				return -EPERM;
+			}
+			if (bpf_prog_is_dev_bound(env->prog->aux)) {
+				verbose(env, "function calls in offloaded programs are not supported yet\n");
+				return -EINVAL;
+			}
+			/* add_subprog marks the callee entry point with the
+			 * new subprogno.
+			 */
+			err = add_subprog(env, i + insns[i].imm + 1);
+			if (err < 0)
+				return err;
 		}
-		ret = add_subprog(env, i + insn[i].imm + 1);
-		if (ret < 0)
-			return ret;
-	}
 
 	if (env->log.level > 1)
 		for (i = 0; i < env->subprog_cnt; i++)
 			verbose(env, "func#%d @%d\n", i, env->subprog_info[i].start);
 
-	/* now check that all jumps are within the same subprog */
-	subprog_start = 0;
-	if (env->subprog_cnt == cur_subprog)
-		subprog_end = insn_cnt;
-	else
-		subprog_end = env->subprog_info[cur_subprog++].start;
+	/* Fill in subprogno on each insn of function body, on the assumption
+	 * that each subprog is a contiguous block of insns.  This will be
+	 * validated by the next step
+	 */
+	cur_subprog = 0;
+	for (i = 0; i < insn_cnt; i++)
+		if (aux[i].subprogno)
+			cur_subprog = aux[i].subprogno;
+		else
+			aux[i].subprogno = cur_subprog;
+
+	/* Check that control never flows from one subprog to another except by
+	 * a pseudo-call insn.
+	 */
 	for (i = 0; i < insn_cnt; i++) {
-		u8 code = insn[i].code;
-
-		if (BPF_CLASS(code) != BPF_JMP)
-			goto next;
-		if (BPF_OP(code) == BPF_EXIT || BPF_OP(code) == BPF_CALL)
-			goto next;
-		off = i + insn[i].off + 1;
-		if (off < subprog_start || off >= subprog_end) {
-			verbose(env, "jump out of range from insn %d to %d\n", i, off);
-			return -EINVAL;
+		bool fallthrough = false, jump = false;
+		u8 opcode = BPF_OP(insns[i].code);
+		int target;
+
+		/* Determine where control flow from this insn goes */
+		if (BPF_CLASS(insns[i].code) != BPF_JMP) {
+			fallthrough = true;
+		} else {
+			switch (opcode) {
+			case BPF_EXIT:
+				/* This insn doesn't go anywhere.  If we're in
+				 * a subprog, then the return target is handled
+				 * as a fallthrough on the call insn, so no need
+				 * to worry about it here.
+				 */
+				continue;
+			case BPF_CALL:
+				/* If pseudo-call, already handled marking the
+				 * callee.  Both kinds of call will eventually
+				 * return and pass to the following insn.
+				 */
+				fallthrough = true;
+				break;
+			case BPF_JA:
+				/* unconditional jump; mark target */
+				jump = true;
+				target = i + insns[i].off + 1;
+				break;
+			default:
+				/* conditional jump; branch both ways */
+				fallthrough = true;
+				jump = true;
+				target = i + insns[i].off + 1;
+				break;
+			}
 		}
-next:
-		if (i == subprog_end - 1) {
-			/* to avoid fall-through from one subprog into another
-			 * the last insn of the subprog should be either exit
-			 * or unconditional jump back
-			 */
-			if (code != (BPF_JMP | BPF_EXIT) &&
-			    code != (BPF_JMP | BPF_JA)) {
-				verbose(env, "last insn is not an exit or jmp\n");
+
+		/* Check that that control flow doesn't leave the subprog */
+		cur_subprog = aux[i].subprogno;
+		if (fallthrough) {
+			if (i + 1 >= insn_cnt) {
+				verbose(env, "no exit/jump at end of program (insn %d)\n",
+					i);
+				return -EINVAL;
+			}
+			if (aux[i + 1].subprogno != cur_subprog) {
+				verbose(env, "no exit/jump at end of subprog %d (insn %d)\n",
+					cur_subprog, i);
+				return -EINVAL;
+			}
+		}
+		if (jump) {
+			if (target < 0 || target >= insn_cnt) {
+				verbose(env, "jump out of range from insn %d to %d\n",
+					i, target);
+				return -EINVAL;
+			}
+			if (aux[target].subprogno != cur_subprog) {
+				verbose(env, "jump from insn %d to %d crosses from subprog %d to %d\n",
+					i, target, cur_subprog,
+					aux[target].subprogno);
 				return -EINVAL;
 			}
-			subprog_start = subprog_end;
-			if (env->subprog_cnt == cur_subprog)
-				subprog_end = insn_cnt;
-			else
-				subprog_end = env->subprog_info[cur_subprog++].start;
 		}
 	}
 	return 0;
@@ -1489,7 +1551,8 @@ static int update_stack_depth(struct bpf_verifier_env *env,
  */
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
-	int depth = 0, frame = 0, subprog = 0, i = 0, subprog_end;
+	int depth = 0, frame = 0, subprog = 0, i = 0;
+	struct bpf_insn_aux_data *aux = env->insn_aux_data;
 	struct bpf_insn *insn = env->prog->insnsi;
 	int insn_cnt = env->prog->len;
 	int ret_insn[MAX_CALL_FRAMES];
@@ -1507,11 +1570,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 		return -EACCES;
 	}
 continue_func:
-	if (env->subprog_cnt == subprog)
-		subprog_end = insn_cnt;
-	else
-		subprog_end = env->subprog_info[subprog].start;
-	for (; i < subprog_end; i++) {
+	for (; i < insn_cnt && aux[i].subprogno == subprog; i++) {
 		if (insn[i].code != (BPF_JMP | BPF_CALL))
 			continue;
 		if (insn[i].src_reg != BPF_PSEUDO_CALL)
@@ -1528,7 +1587,6 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 				  i);
 			return -EFAULT;
 		}
-		subprog++;
 		frame++;
 		if (frame >= MAX_CALL_FRAMES) {
 			WARN_ONCE(1, "verifier bug. Call stack is too deep\n");
@@ -1561,7 +1619,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 			  start);
 		return -EFAULT;
 	}
-	subprog++;
 	return env->subprog_info[subprog].stack_depth;
 }
 #endif
@@ -2100,7 +2157,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_tail_call:
 		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
 			goto error;
-		if (env->subprog_cnt) {
+		if (env->subprog_cnt > 1) {
 			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
 			return -EINVAL;
 		}
@@ -2245,6 +2302,9 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	}
 
 	target_insn = *insn_idx + insn->imm;
+	/* We will increment insn_idx (PC) in do_check() after handling this
+	 * call insn, so the actual start of the function is target + 1.
+	 */
 	subprog = find_subprog(env, target_insn + 1);
 	if (subprog < 0) {
 		verbose(env, "verifier bug. No program starts at insn %d\n",
@@ -2272,7 +2332,7 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			/* remember the callsite, it will be used by bpf_exit */
 			*insn_idx /* callsite */,
 			state->curframe + 1 /* frameno within this callchain */,
-			subprog + 1 /* subprog number within this prog */);
+			subprog /* subprog number within this prog */);
 
 	/* copy r1 - r5 args that callee can access */
 	for (i = BPF_REG_1; i <= BPF_REG_5; i++)
@@ -3831,7 +3891,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return -EINVAL;
 	}
 
-	if (env->subprog_cnt) {
+	if (env->subprog_cnt > 1) {
 		/* when program has LD_ABS insn JITs and interpreter assume
 		 * that r1 == ctx == skb which is not the case for callees
 		 * that can have arbitrary arguments. It's problematic
@@ -4020,10 +4080,6 @@ static int check_cfg(struct bpf_verifier_env *env)
 	int ret = 0;
 	int i, t;
 
-	ret = check_subprogs(env);
-	if (ret < 0)
-		return ret;
-
 	insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
 	if (!insn_state)
 		return -ENOMEM;
@@ -4862,11 +4918,11 @@ static int do_check(struct bpf_verifier_env *env)
 
 	verbose(env, "processed %d insns (limit %d), stack depth ",
 		insn_processed, BPF_COMPLEXITY_LIMIT_INSNS);
-	for (i = 0; i < env->subprog_cnt + 1; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		u32 depth = env->subprog_info[i].stack_depth;
 
 		verbose(env, "%d", depth);
-		if (i + 1 < env->subprog_cnt + 1)
+		if (i + 1 < env->subprog_cnt)
 			verbose(env, "+");
 	}
 	verbose(env, "\n");
@@ -5238,12 +5294,12 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 static int jit_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog, **func, *tmp;
-	int i, j, subprog_start, subprog_end = 0, len, subprog;
 	struct bpf_insn *insn;
 	void *old_bpf_func;
+	int i, j, subprog;
 	int err = -ENOMEM;
 
-	if (env->subprog_cnt == 0)
+	if (env->subprog_cnt <= 1)
 		return 0;
 
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
@@ -5259,7 +5315,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		/* temporarily remember subprog id inside insn instead of
 		 * aux_data, since next loop will split up all insns into funcs
 		 */
-		insn->off = subprog + 1;
+		insn->off = subprog;
 		/* remember original imm in case JIT fails and fallback
 		 * to interpreter will be needed
 		 */
@@ -5268,23 +5324,24 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		insn->imm = 1;
 	}
 
-	func = kzalloc(sizeof(prog) * (env->subprog_cnt + 1), GFP_KERNEL);
+	func = kzalloc(sizeof(prog) * env->subprog_cnt, GFP_KERNEL);
 	if (!func)
 		return -ENOMEM;
 
-	for (i = 0; i <= env->subprog_cnt; i++) {
-		subprog_start = subprog_end;
-		if (env->subprog_cnt == i)
-			subprog_end = prog->len;
-		else
-			subprog_end = env->subprog_info[i].start;
+	for (i = 0; i < env->subprog_cnt; i++) {
+		struct bpf_subprog_info *info = &env->subprog_info[i];
+		int len, end = prog->len;
+
+		if (i + 1 < env->subprog_cnt)
+			end = info[1].start;
+		len = end - info->start;
 
-		len = subprog_end - subprog_start;
 		func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER);
 		if (!func[i])
 			goto out_free;
-		memcpy(func[i]->insnsi, &prog->insnsi[subprog_start],
+		memcpy(func[i]->insnsi, prog->insnsi + info->start,
 		       len * sizeof(struct bpf_insn));
+
 		func[i]->type = prog->type;
 		func[i]->len = len;
 		if (bpf_prog_calc_tag(func[i]))
@@ -5307,7 +5364,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	 * now populate all bpf_calls with correct addresses and
 	 * run last pass of JIT
 	 */
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
 			if (insn->code != (BPF_JMP | BPF_CALL) ||
@@ -5315,12 +5372,16 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 				continue;
 			subprog = insn->off;
 			insn->off = 0;
+			if (subprog < 0 || subprog >= env->subprog_cnt) {
+				err = -EFAULT;
+				goto out_free;
+			}
 			insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
 				func[subprog]->bpf_func -
 				__bpf_call_base;
 		}
 	}
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		old_bpf_func = func[i]->bpf_func;
 		tmp = bpf_int_jit_compile(func[i]);
 		if (tmp != func[i] || func[i]->bpf_func != old_bpf_func) {
@@ -5334,7 +5395,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	/* finally lock prog and jit images for all functions and
 	 * populate kallsysm
 	 */
-	for (i = 0; i <= env->subprog_cnt; i++) {
+	for (i = 0; i < env->subprog_cnt; i++) {
 		bpf_prog_lock_ro(func[i]);
 		bpf_prog_kallsyms_add(func[i]);
 	}
@@ -5351,7 +5412,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			continue;
 		insn->off = env->insn_aux_data[i].call_imm;
 		subprog = find_subprog(env, i + insn->off + 1);
-		addr  = (unsigned long)func[subprog + 1]->bpf_func;
+		addr  = (unsigned long)func[subprog]->bpf_func;
 		addr &= PAGE_MASK;
 		insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
 			    addr - __bpf_call_base;
@@ -5360,10 +5421,10 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	prog->jited = 1;
 	prog->bpf_func = func[0]->bpf_func;
 	prog->aux->func = func;
-	prog->aux->func_cnt = env->subprog_cnt + 1;
+	prog->aux->func_cnt = env->subprog_cnt;
 	return 0;
 out_free:
-	for (i = 0; i <= env->subprog_cnt; i++)
+	for (i = 0; i < env->subprog_cnt; i++)
 		if (func[i])
 			bpf_jit_free(func[i]);
 	kfree(func);
@@ -5393,6 +5454,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 		err = jit_subprogs(env);
 		if (err == 0)
 			return 0;
+		verbose(env, "failed to jit_subprogs %d\n", err);
 	}
 #ifndef CONFIG_BPF_JIT_ALWAYS_ON
 	for (i = 0; i < prog->len; i++, insn++) {
@@ -5682,6 +5744,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
 
 	env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
 
+	ret = check_subprogs(env);
+	if (ret < 0)
+		goto skip_full_check;
+
 	ret = check_cfg(env);
 	if (ret < 0)
 		goto skip_full_check;

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 3/5] bpf/verifier: update selftests
From: Edward Cree @ 2018-04-06 17:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

Error messages for some bad programs have changed.
Also added a test ("calls: interleaved functions") to ensure that subprogs
 are required to be contiguous.
It wasn't entirely clear to me what "calls: wrong recursive calls" was
 meant to test for, since all of the JMP|CALL insns are unreachable.  I've
 changed it so that they are now reachable, which causes static back-edges
 to be detected (since that, like insn reachability, is now tested before
 subprog boundaries are determined).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 51 ++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b1a9ae..d53522d20072 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -646,7 +646,7 @@ static struct bpf_test tests[] = {
 		.insns = {
 			BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
 		},
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of program",
 		.result = REJECT,
 	},
 	{
@@ -9442,13 +9442,13 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn is not an exit or jmp",
+		.errstr = "no exit/jump at end of subprog 0 (insn 0)",
 		.result = REJECT,
 	},
 	{
 		"calls: wrong recursive calls",
 		.insns = {
-			BPF_JMP_IMM(BPF_JA, 0, 0, 4),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 3),
 			BPF_JMP_IMM(BPF_JA, 0, 0, 4),
 			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, -2),
 			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, -2),
@@ -9457,7 +9457,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 0 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9508,7 +9508,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 1 to 5 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9787,7 +9787,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-		.errstr = "jump out of range from insn 1 to 4",
+		.errstr = "jump from insn 1 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9803,13 +9803,12 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
 			BPF_EXIT_INSN(),
-			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-				    offsetof(struct __sk_buff, len)),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, -3),
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range from insn 11 to 9",
+		.errstr = "jump from insn 11 to 9 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9861,7 +9860,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "invalid destination",
+		.errstr = "call to invalid destination",
 		.result = REJECT,
 	},
 	{
@@ -9873,7 +9872,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "invalid destination",
+		.errstr = "call to invalid destination",
 		.result = REJECT,
 	},
 	{
@@ -9886,7 +9885,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 3 to 1 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9899,7 +9898,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "jump out of range",
+		.errstr = "jump from insn 0 to 4 crosses",
 		.result = REJECT,
 	},
 	{
@@ -9913,7 +9912,7 @@ static struct bpf_test tests[] = {
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, -2),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of program",
 		.result = REJECT,
 	},
 	{
@@ -9927,7 +9926,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -9942,7 +9941,7 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "last insn",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -9982,12 +9981,11 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
 			BPF_MOV64_REG(BPF_REG_0, BPF_REG_0),
-			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-				    offsetof(struct __sk_buff, len)),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 8),
 			BPF_EXIT_INSN(),
 		},
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
-		.errstr = "not an exit",
+		.errstr = "no exit/jump at end of subprog",
 		.result = REJECT,
 	},
 	{
@@ -11423,6 +11421,21 @@ static struct bpf_test tests[] = {
 		.errstr = "BPF_XADD stores into R2 packet",
 		.prog_type = BPF_PROG_TYPE_XDP,
 	},
+	{
+		"calls: interleaved functions",
+		.insns = {
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 2),
+			BPF_JMP_IMM(BPF_JA, 0, 0, 1),
+			BPF_EXIT_INSN(),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+		.errstr = "jump from insn 2 to 5 crosses",
+		.result = REJECT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 1/5] bpf/verifier: store subprog info in struct bpf_subprog_info
From: Edward Cree @ 2018-04-06 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

Per-subprog info is stored in env->subprog_info, an array of structs,
 rather than multiple arrays with a common index.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/bpf_verifier.h |  9 +++++---
 kernel/bpf/verifier.c        | 49 +++++++++++++++++++++++---------------------
 2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7e61c395fddf..8f70dc181e23 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -173,6 +173,11 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
 
 #define BPF_MAX_SUBPROGS 256
 
+struct bpf_subprog_info {
+	u32 start; /* insn idx of function entry point */
+	u16 stack_depth; /* max. stack depth used by this function */
+};
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
@@ -191,9 +196,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	u32 subprog_starts[BPF_MAX_SUBPROGS];
-	/* computes the stack depth of each bpf function */
-	u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
 	u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb902bf..df4d742360d9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -738,18 +738,19 @@ enum reg_arg_type {
 
 static int cmp_subprogs(const void *a, const void *b)
 {
-	return *(int *)a - *(int *)b;
+	return ((struct bpf_subprog_info *)a)->start -
+	       ((struct bpf_subprog_info *)b)->start;
 }
 
 static int find_subprog(struct bpf_verifier_env *env, int off)
 {
-	u32 *p;
+	struct bpf_subprog_info *p;
 
-	p = bsearch(&off, env->subprog_starts, env->subprog_cnt,
-		    sizeof(env->subprog_starts[0]), cmp_subprogs);
+	p = bsearch(&off, env->subprog_info, env->subprog_cnt,
+		    sizeof(env->subprog_info[0]), cmp_subprogs);
 	if (!p)
 		return -ENOENT;
-	return p - env->subprog_starts;
+	return p - env->subprog_info;
 
 }
 
@@ -769,9 +770,9 @@ static int add_subprog(struct bpf_verifier_env *env, int off)
 		verbose(env, "too many subprograms\n");
 		return -E2BIG;
 	}
-	env->subprog_starts[env->subprog_cnt++] = off;
-	sort(env->subprog_starts, env->subprog_cnt,
-	     sizeof(env->subprog_starts[0]), cmp_subprogs, NULL);
+	env->subprog_info[env->subprog_cnt++].start = off;
+	sort(env->subprog_info, env->subprog_cnt,
+	     sizeof(env->subprog_info[0]), cmp_subprogs, NULL);
 	return 0;
 }
 
@@ -802,14 +803,14 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
 	if (env->log.level > 1)
 		for (i = 0; i < env->subprog_cnt; i++)
-			verbose(env, "func#%d @%d\n", i, env->subprog_starts[i]);
+			verbose(env, "func#%d @%d\n", i, env->subprog_info[i].start);
 
 	/* now check that all jumps are within the same subprog */
 	subprog_start = 0;
 	if (env->subprog_cnt == cur_subprog)
 		subprog_end = insn_cnt;
 	else
-		subprog_end = env->subprog_starts[cur_subprog++];
+		subprog_end = env->subprog_info[cur_subprog++].start;
 	for (i = 0; i < insn_cnt; i++) {
 		u8 code = insn[i].code;
 
@@ -837,7 +838,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
 			if (env->subprog_cnt == cur_subprog)
 				subprog_end = insn_cnt;
 			else
-				subprog_end = env->subprog_starts[cur_subprog++];
+				subprog_end = env->subprog_info[cur_subprog++].start;
 		}
 	}
 	return 0;
@@ -1470,13 +1471,13 @@ static int update_stack_depth(struct bpf_verifier_env *env,
 			      const struct bpf_func_state *func,
 			      int off)
 {
-	u16 stack = env->subprog_stack_depth[func->subprogno];
+	u16 stack = env->subprog_info[func->subprogno].stack_depth;
 
 	if (stack >= -off)
 		return 0;
 
 	/* update known max for given subprogram */
-	env->subprog_stack_depth[func->subprogno] = -off;
+	env->subprog_info[func->subprogno].stack_depth = -off;
 	return 0;
 }
 
@@ -1498,7 +1499,8 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	/* round up to 32-bytes, since this is granularity
 	 * of interpreter stack size
 	 */
-	depth += round_up(max_t(u32, env->subprog_stack_depth[subprog], 1), 32);
+	depth += round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
+			  32);
 	if (depth > MAX_BPF_STACK) {
 		verbose(env, "combined stack size of %d calls is %d. Too large\n",
 			frame + 1, depth);
@@ -1508,7 +1510,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	if (env->subprog_cnt == subprog)
 		subprog_end = insn_cnt;
 	else
-		subprog_end = env->subprog_starts[subprog];
+		subprog_end = env->subprog_info[subprog].start;
 	for (; i < subprog_end; i++) {
 		if (insn[i].code != (BPF_JMP | BPF_CALL))
 			continue;
@@ -1539,7 +1541,8 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	 */
 	if (frame == 0)
 		return 0;
-	depth -= round_up(max_t(u32, env->subprog_stack_depth[subprog], 1), 32);
+	depth -= round_up(max_t(u32, env->subprog_info[subprog].stack_depth, 1),
+			  32);
 	frame--;
 	i = ret_insn[frame];
 	subprog = ret_prog[frame];
@@ -1559,7 +1562,7 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env,
 		return -EFAULT;
 	}
 	subprog++;
-	return env->subprog_stack_depth[subprog];
+	return env->subprog_info[subprog].stack_depth;
 }
 #endif
 
@@ -4860,14 +4863,14 @@ static int do_check(struct bpf_verifier_env *env)
 	verbose(env, "processed %d insns (limit %d), stack depth ",
 		insn_processed, BPF_COMPLEXITY_LIMIT_INSNS);
 	for (i = 0; i < env->subprog_cnt + 1; i++) {
-		u32 depth = env->subprog_stack_depth[i];
+		u32 depth = env->subprog_info[i].stack_depth;
 
 		verbose(env, "%d", depth);
 		if (i + 1 < env->subprog_cnt + 1)
 			verbose(env, "+");
 	}
 	verbose(env, "\n");
-	env->prog->aux->stack_depth = env->subprog_stack_depth[0];
+	env->prog->aux->stack_depth = env->subprog_info[0].stack_depth;
 	return 0;
 }
 
@@ -5074,9 +5077,9 @@ static void adjust_subprog_starts(struct bpf_verifier_env *env, u32 off, u32 len
 	if (len == 1)
 		return;
 	for (i = 0; i < env->subprog_cnt; i++) {
-		if (env->subprog_starts[i] < off)
+		if (env->subprog_info[i].start < off)
 			continue;
-		env->subprog_starts[i] += len - 1;
+		env->subprog_info[i].start += len - 1;
 	}
 }
 
@@ -5274,7 +5277,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		if (env->subprog_cnt == i)
 			subprog_end = prog->len;
 		else
-			subprog_end = env->subprog_starts[i];
+			subprog_end = env->subprog_info[i].start;
 
 		len = subprog_end - subprog_start;
 		func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER);
@@ -5291,7 +5294,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		 * Long term would need debug info to populate names
 		 */
 		func[i]->aux->name[0] = 'F';
-		func[i]->aux->stack_depth = env->subprog_stack_depth[i];
+		func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
 		func[i]->jit_requested = 1;
 		func[i] = bpf_int_jit_compile(func[i]);
 		if (!func[i]->jited) {

^ permalink raw reply related

* [RFC PATCH v3 bpf-next 0/5] bpf/verifier: subprog/func_call simplifications
From: Edward Cree @ 2018-04-06 17:11 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev

By storing subprog boundaries as a subprogno mark on each insn, rather than
 a start (and implicit end) for each subprog, the code handling subprogs can
 in various places be made simpler, more explicit, and more efficient (i.e.
 better asymptotic complexity).  This also lays the ground for possible
 future work in which an extension of the check_subprogs() code to cover
 basic blocks could replace check_cfg(), which currently largely duplicates
 the insn-walking part.

Some other changes were also included to support this:
* Per-subprog info is stored in env->subprog_info, an array of structs,
  rather than several arrays with a common index.
* Subprog numbers and corresponding subprog_info index are now 0-based;
  subprog 0 is the main()-function of the program, starting at insn 0.
* Call graph is now stored in the new bpf_subprog_info struct; used here
  for check_max_stack_depth() but may have other uses too.

Along with this, patch #5 puts parent pointers (used by liveness analysis)
 in the registers instead of the func_state or verifier_state, so that we
 don't need skip_callee() machinery.  This also does the right thing for
 stack slots, so they don't need their own special handling for liveness
 marking either.

Testing done on this version:
* test_verifier
* test_progs (mostly)
* test_kmod.sh
* tested whether processed-insn numbers (hence pruning) change.
  (Couldn't test with programs containing pseudo-calls (e.g.
  test_xdp_noinline.o), because iproute2 rejects them in its bpf lib.)
  I tested with some Cilium object files, and found that in some cases
  the numbers *do* change as compared to bpf-next.
  - bpf_lxc_opt_-DDROP_ALL.o   26713 -> 17753
  - bpf_lxc_opt_-DUNKNOWN.o    36604 -> 23324
  - bpf_netdev.o               10003 -> 9984
  I'm not yet sure why this is, especially as these object files do not
  contain pseudo-calls.
  One possibility I see: in BPF_MOV64_REG handling, we copy the register
  state, which includes the parent pointer.  But that shouldn't matter,
  because we just read-marked the src reg (via check_reg_arg), and we
  write-mark the dst reg immediately after.  Same goes for other places
  we copy reg states around (e.g. stack reads and writes).
  So I don't quite see how this can affect pruning; but _something_ did.

Changes from v2->v3:
* Added RFC tags back since bpf-next is closed right now.
* Split up first patch into three parts to avoid doing too many things at
  once.
* Restore check_subprogs() as a separate pass rather than doing the work
  on-the-fly in do_check().
* Removed changes in jit_subprogs() handling which were there to support
  non-contiguous subprogs, since that's no longer needed.

Changes from v1->v2:
* No longer allows non-contiguous subprogs.
* No longer allows LD_ABS|IND and pseudo-calls in the same prog.

Edward Cree (5):
  bpf/verifier: store subprog info in struct bpf_subprog_info
  bpf/verifier: rewrite subprog boundary detection
  bpf/verifier: update selftests
  bpf/verifier: use call graph to efficiently check max stack depth
  bpf/verifier: per-register parent pointers

 include/linux/bpf_verifier.h                |  21 +-
 kernel/bpf/verifier.c                       | 617 ++++++++++++++--------------
 tools/testing/selftests/bpf/test_verifier.c |  51 ++-
 3 files changed, 354 insertions(+), 335 deletions(-)

^ permalink raw reply

* Re: [PATCH] net: phy: marvell: Enable interrupt function on LED2 pin
From: Esben Haabendal @ 2018-04-06 17:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Bhadram Varka, Rasmus Villemoes, Florian Fainelli, open list,
	netdev@vger.kernel.org
In-Reply-To: <20180405144355.GB32663@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

>> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c index
>> 0e0978d8a0eb..f03a510f1247 100644
>> --- a/drivers/net/phy/marvell.c
>> +++ b/drivers/net/phy/marvell.c
>> @@ -457,6 +457,21 @@ static int marvell_of_reg_init(struct phy_device
>> *phydev) } #endif /* CONFIG_OF_MDIO */
>>  
>> +static int m88e1318_config_intr(struct phy_device *phydev) {
>> +	int err;
>> +
>> +	err = marvell_config_intr(phydev);
>> +	if (err)
>> +		return err;
>> +
>> +	/* Setup LED[2] as interrupt pin (active low) */
>> +	return phy_modify(phydev, MII_88E1318S_PHY_LED_TCR,
>> +			  MII_88E1318S_PHY_LED_TCR_FORCE_INT,
>> +			  MII_88E1318S_PHY_LED_TCR_INTn_ENABLE |
>> +			  MII_88E1318S_PHY_LED_TCR_INT_ACTIVE_LOW);
>> 
>> Can we move this part of the code to m88e1121_config_init() ?
>> 
>> Every time whether we disable or enable the interrupts this part of code
>> will execute.
>
> Yes, doing this once would be better. But please allow the LED pin to
> be used as an LED when not using interrupts. phy_interrupt_is_valid()
> should be involved somehow.

This should be addressed in v2 of the patch which I have already posted.

/Esben

^ permalink raw reply

* [PATCH v2] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-06 17:08 UTC (permalink / raw)
  To: netdev; +Cc: andrew, richardcochran, f.fainelli, linux-kernel, Esben Haabendal
In-Reply-To: <20180406140540.13511-1-esben.haabendal@gmail.com>

From: Esben Haabendal <eha@deif.com>

Signed-off-by: Esben Haabendal <eha@deif.com>
---
 drivers/net/phy/dp83640.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 654f42d00092..a6c87793d899 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1207,6 +1207,23 @@ static void dp83640_remove(struct phy_device *phydev)
 	kfree(dp83640);
 }
 
+static int dp83640_soft_reset(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_soft_reset(phydev);
+	if (ret < 0)
+		return ret;
+
+	/* From DP83640 datasheet: "Software driver code must wait 3 us
+	 * following a software reset before allowing further serial MII
+	 * operations with the DP83640."
+	 */
+	udelay(10);		/* Taking udelay inaccuracy into account */
+
+	return 0;
+}
+
 static int dp83640_config_init(struct phy_device *phydev)
 {
 	struct dp83640_private *dp83640 = phydev->priv;
@@ -1501,6 +1518,7 @@ static struct phy_driver dp83640_driver = {
 	.flags		= PHY_HAS_INTERRUPT,
 	.probe		= dp83640_probe,
 	.remove		= dp83640_remove,
+	.soft_reset	= dp83640_soft_reset,
 	.config_init	= dp83640_config_init,
 	.ack_interrupt  = dp83640_ack_interrupt,
 	.config_intr    = dp83640_config_intr,
-- 
2.16.3

^ permalink raw reply related

* Re: [PATCH] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-06 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, netdev, richardcochran, f.fainelli, linux-kernel
In-Reply-To: <20180406.111337.1908168293065420432.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Andrew Lunn <andrew@lunn.ch>
> Date: Fri, 6 Apr 2018 16:14:10 +0200
>
>> On Fri, Apr 06, 2018 at 04:05:40PM +0200, Esben Haabendal wrote:
>>> From: Esben Haabendal <eha@deif.com>
>>> 
>>> Signed-off-by: Esben Haabendal <eha@deif.com>
>>> ---
>>>  drivers/net/phy/dp83640.c | 17 +++++++++++++++++
>>>  1 file changed, 17 insertions(+)
>>> 
>>> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
>>> index 654f42d00092..48403170096a 100644
>>> --- a/drivers/net/phy/dp83640.c
>>> +++ b/drivers/net/phy/dp83640.c
>>> @@ -1207,6 +1207,22 @@ static void dp83640_remove(struct phy_device *phydev)
>>>  	kfree(dp83640);
>>>  }
>>>  
>>> +static int dp83640_soft_reset(struct phy_device *phydev)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = genphy_soft_reset(phydev);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	/* From DP83640 datasheet: "Software driver code must wait 3 us
>>> +	 * following a software reset before allowing further serial MII
>>> +	 * operations with the DP83640." */
>>> +	udelay(3);
>> 
>> Hi Esben
>> 
>> The accuracy of udelay() is not guaranteed. So you probably want to be
>> a bit pessimistic, and use 10.

Ok, will do.

/Esben

^ permalink raw reply

* Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh
From: Jiri Olsa @ 2018-04-06 16:54 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, lkml, netdev,
	Linux Kbuild mailing list, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Michal Marek, Jiri Kosina
In-Reply-To: <CAK7LNATU9_zBH9DpayNOCGsFVL0K+Y8Np0z7wxFuwO3Cf1K_tg@mail.gmail.com>

On Fri, Apr 06, 2018 at 09:59:59AM +0900, Masahiro Yamada wrote:
> 2018-04-06 3:59 GMT+09:00 Jiri Olsa <jolsa@redhat.com>:
> > On Fri, Apr 06, 2018 at 12:50:00AM +0900, Masahiro Yamada wrote:
> >> 2018-04-06 0:16 GMT+09:00 Jiri Olsa <jolsa@kernel.org>:
> >> > There's no need to pass LD* arguments to link-vmlinux.sh,
> >> > because they are passed as variables. The only argument
> >> > the link-vmlinux.sh supports is the 'clean' argument.
> >> >
> >> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> >> > ---
> >>
> >> Wrong.
> >>
> >> $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
> >> exist here so that any change in them
> >> invokes scripts/linkk-vmlinux.sh
> >
> > sry, I can't see that.. but it's just a side fix,
> > which is actually not needed for the rest
> >
> > I'll check on more and address this separately
> 
> 
> The link command is recorded in .vmlinux.cmd

i see.. just for the sake command line,

thanks for explanation
jirka

^ permalink raw reply

* Re: TCP one-by-one acking - RFC interpretation question
From: Eric Dumazet @ 2018-04-06 16:49 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell
In-Reply-To: <20180406150345.i3t7cu5g6dmoc3io@unicorn.suse.cz>

Cc Neal and Yuchung if they missed this thread.

On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
>>
>>
>> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
>>> Hello
>>>
>>> I encountered a strange behaviour of some (non-linux) TCP stack which
>>> I believe is incorrect but support engineers from the company producing
>>> it claim is OK.
>>>
>>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
>>> segments but segments 2, 4 and 6 do not reach the server (receiver):
>>>
>>>          ACK             SAK             SAK             SAK
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
>>>       +-------+-------+-------+-------+-------+-------+-------+
>>>     34273   35701   37129   38557   39985   41413   42841   44269
>>>
>>> When segment 2 is retransmitted after RTO timeout, normal response would
>>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
>>> 42841-44269).
>>>
>>> However, this server stack responds with two separate ACKs:
>>>
>>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>>>   - ACK 38557, SACK 39985-41413 42841-44269
>>
>> Hmmm... Yes this seems very very wrong and lazy.
>>
>> Have you verified behavior of more recent linux kernel to such threats ?
> 
> No, unfortunately the problem was only encountered by our customer in
> production environment (they tried to reproduce in a test lab but no
> luck). They are running backups to NFS server and it happens from time
> to time (in the order of hours, IIUC). So it would be probably hard to
> let them try with more recent kernel.
> 
> On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> run into this kind of problem. It was originally reported as a
> a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> (which seems to be mostly caused by the switch to RACK).
> 
> It also seems that part of the problem is specific packet loss pattern
> where at some point, many packets are lost in "every second" pattern.
> The customer finally started to investigate this problem and it seems it
> has something to do with their bonding setup (they provided no details,
> my guess is packets are divided over two paths and one of them fails).
> 
>> packetdrill test would be relatively easy to write.
> 
> I'll try but I have very little experience with writing packetdrill
> scripts so it will probably take some time.
> 
>> Regardless of this broken alien stack, we might be able to work around
>> this faster than the vendor is able to fix and deploy a new stack.
>>
>> ( https://en.wikipedia.org/wiki/Robustness_principle )
>> Be conservative in what you do, be liberal in what you accept from
>> others...
> 
> I was thinking about this a bit. "Fixing" the acknowledgment number
> could do the trick but it doesn't feel correct. We might use the fact
> that TSecr of both ACKs above matches TSval of the retransmission which
> triggered them so that RTT calculated from timestamp would be the right
> one. So perhaps something like "prefer timestamp RTT if measured RTT
> seems way too off". But I'm not sure if it couldn't break other use
> cases where (high) measured RTT is actually correct, rather than (low)
> timestamp RTT.
> 
> Michal Kubecek
> 

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-06 16:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180406160757.GA16281@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

>> At a practical level there should be no receivers.  Plus performance
>> issues.  At least my memory is that any unprivileged user on the system
>> is allowed to listen to those events.
>
> Any unprivileged user is allowed to listen to uevents if they have
> net_broadcast in the user namespace the uevent socket was opened in;
> unless I'm misreading.

I believe you are.

This code in do_one_broadcast.

	if (!net_eq(sock_net(sk), p->net)) {
		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
			return;

		if (!peernet_has_id(sock_net(sk), p->net))
			return;

		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
				     CAP_NET_BROADCAST))
			return;
	}

Used to just be:
	if (!net_eq(sock_net(sk), p->net))
        	return;

Which makes sense when you have a shared hash table and a shared mc_list
between network namespaces.

There is a non-container use of network namespaces where you just need
different contexts were ip addresses can overlap etc.  In that
configuration where a single program is mananging multiple network
namespaces being able to listen to rtnetlink events in all of them is an
advantage.

For that case a special socket option NETLINK_F_LISTEN_ALL_NSID was
added that allowed one socket to listen for events from multiple network
namespaces.

If we rework the code in af_netlink.c that matters.  However for just
understanding uevents you can assume there are no sockets with
NETLINK_F_LISTEN_ALL_NSID set.

Eric

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-06 16:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <874lko2y22.fsf@xmission.com>

On Fri, Apr 06, 2018 at 09:45:41AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >>>
> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >>>
> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >>>
> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >>> into all network namespaces.
> >> >> >>>
> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >>>
> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >>>
> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >>>         j       return;
> >> >> >>> }
> >> >> >>>
> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >>>
> >> >> >>>  sudo unshare -U --map-root
> >> >> >>>  udevadm monitor -k
> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >>>  modprobe kvm
> >> >> >>>  # or
> >> >> >>>  rmmod kvm
> >> >> >>>
> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >>> with them.
> >> >> >>>
> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >>> make a decision what uevents should be sent.
> >> >> >>>
> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >>>   event will always be filtered.
> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >>>   the event will be filtered as well.
> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >>> anything with interesting devices.
> >> >> >>>
> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >>> ---
> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >>>
> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >>>  		return sock_ns != ns;
> >> >> >>>  	}
> >> >> >>>  
> >> >> >>> -	return 0;
> >> >> >>> +	/*
> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >>> +	 * namespace below.
> >> >> >>> +	 */
> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >>> +		return 1;
> >> >> >>> +
> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >>>  }
> >> >> >>>  #endif
> >> >> >>
> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> > 
> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> > how this comes about.
> >> >> > There is only one case where this currently breaks and this is as I
> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> 
> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> Now it will return 1 sometimes.
> >> >
> >> > Oh sure, it's in the commit message though. The callchain is
> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >
> >> > This codepiece will check whether the openened socket holds
> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> > which it won't because we don't have device namespaces and all devices
> >> > belong to the initial set of namespaces.
> >> >
> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >                              CAP_NET_BROADCAST))
> >> >         j       return;
> >> >
> >> 
> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> on their socket and has had someone with the appropriate privileges
> >> assign a peerrnetid.
> >> 
> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> as it might be, and if you can pass the other two checks I think it is
> >> pointless (because the peernet attributes are not generated for
> >> kobj_uevents) but valid to receive events from outside your network
> >> namespace.
> >> 
> >> 
> >> I might be missing something but I don't see anything excluding network
> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> logic.
> >> 
> >> The uevent_sock_list has one entry per network namespace. And
> >> kobject_uevent_net_broacast appears to walk each one.
> >
> > Yeah, it definitely does.
> >
> >> 
> >> I had a memory of filtering uevent messages and I had a memory
> >> that the netlink_has_listeners had a per network namespace effect.
> >> Neither seems true from my inspection of the code tonight.
> >
> > Yeah, I drew the same conclusion.
> >
> >> 
> >> If we are not filtering ordinary uevents at least at the user namespace
> >> level that does seem to be at least a nuisance.
> >
> > This patch would filter based on user namespace and bump it from "we're
> > accidently doing the right thing" to "we're doing the right on purpose"
> > and without accidently leaking uevents in some corner cases.
> > Using kobj_bcast_filter() for this seems like the right thing to do.
> > This sounds like you don't disagree with the approach I'm taking here?
> 
> How are we accidentally filtering?  If we can not describe how the code
> currently works to achieve something we don't understand it well enough
> to change it.

I absolutely agree and without doing a lot of semantics here I just
meant that so far this all worked on accident that doesn't presuppose
that we understand why it worked.

> 
> 
> 
> > On a sidenote, it also very much feels like we're leaking information if
> > we're not filtering based on user namespaces on purpose. Non-initial
> > user namespaces should not be able to receive information about device
> > attribues simply via netlink uevent messages. At least not *trivially*
> > [1] by opening a netlink uevent socket.
> 
> This is not at all about isolation.  Devices don't belong to user

Sure, but there's still a difference between an unprivileged host user
listening to uevents and user namespaces that they might have delegated.

> namespaces.  This is about it being useless/silly to send events
> to those sockets as the receiver could not do anything with the uevent.

Yes, indeed unless a sufficiently privileged process explicitly decides
that a given uevent should be sent.

> At a practical level there should be no receivers.  Plus performance
> issues.  At least my memory is that any unprivileged user on the system
> is allowed to listen to those events.

Any unprivileged user is allowed to listen to uevents if they have
net_broadcast in the user namespace the uevent socket was opened in;
unless I'm misreading.

> 
> At least at one point udev listening and reacting to events in every
> network namespace in containers was a significant slow down on the
> system.

Sure, because we're holding a global lock on the list of all uevent
sockets and then go on to walk all of mc_list for each of those sockets
when broadcasting to guarantee that uevents are correctly ordered:

	mutex_lock(&uevent_sock_mutex);
	/* we will send an event, so request a new sequence number */
	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
	if (retval) {
		mutex_unlock(&uevent_sock_mutex);
		goto exit;
	}
	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
					      devpath);
	mutex_unlock(&uevent_sock_mutex);

which means that the time it takes for uevents to be sent out increases
drastically the more listeners you have in more network namespaces. Even
for just one network namespace you have quite a lot:

ss -e -f netlink -a | grep uevent

then just look at those that you can clearly associate with a single
program I typically get ~20 listeners.

> 
> The report was that adding netlink_has_listeners greatly sped up uevent
> sending.   That testing appears faulty.  I think the real gain was
> userspace not listening to uevents in every container.

Banal point but the truth is probably not userspace changes in general
but that most massive container deployments are app containers. So
there's no udevd.

> 
> Which means uevent injection for containers has some real technical
> challenges to overcome before it can be wide spread.
> 
> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> there are some real efficiency improvements that we could make to
> >> kobject_uevent_net_broadcast if nothing else.
> >
> > Yeah, for sure! I already started doing this. This patch here is
> > basically a preparatory patch for that work which is on my todo.
> 
> Limiting the network namespaces on uevent_sock_list to just network
> namespaces owned by the initial user namespaces would be a lot better
> than your patch.

I had this idea as well but I decided against it because of namespace
tag carrying kobjects. The only kobjects that fall into this category
are network devices so I need to think through this again and more
thoroughly.

> 
> 
> At a practical level I am concerned what will happen when we stand up a
> uevent listener aka udev in say 100 or maybe 1000 unprivileged
> containers on a system.  History suggests the current data structures
> won't scale to the problem, and the fixes that were put in place kernel
> side only did not change anything.  It was only the userspace fixes
> that made a difference.
> 
> Which suggests either improving the isolation between network namespaces
> for netlink multicast sockets or putting:
> 
> 
> 	if (!net_eq(sock_net(sk), p->net)) {
> 		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> 			return;
> 
> 		if (!peernet_has_id(sock_net(sk), p->net))
> 			return;
> 
> 		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> 				     CAP_NET_BROADCAST))
> 			return;
> 	}
> 
> In a filter and having an appropriate filter so that
> netlink_broadcast_filtered only needs to be called once from
> kobject_uevent_net_broadcast.

Maybe, but I'd like to try and find a solution that let's us wipe the
uevent socket list.

> 
> It is more difficult but in practice I expect we have far more to gain
> by fixing the mc_list and the netlink_has_listeners function to be per
> network namespace basis.  Handling the NETLINK_F_LISTEN_ALL_NSID will
> be trickier in that case but the current semantics appear correct
> for that case.

Yeah, I'll dig into this.

> 
> But before we do anything we have to test and see if the assertion
> that we don't make it to netlink listeners in network namespaces
> that are outside of the init_user_ns is true.   If it is true we need to
> find the code that makes it true.

Yeah, I'm on this as well.

Thanks!
Christian

^ permalink raw reply

* Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06
From: David Miller @ 2018-04-06 15:39 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri,  6 Apr 2018 08:36:28 -0700

> This series contains a couple of fixes for the new ice driver.
> 
> Wei Yongjun fixes the return error code for error case during init.
> 
> Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
> and fixed "for" loop calculations.

Pulled, thanks Jeff.

^ permalink raw reply

* [net 1/2] ice: Fix error return code in ice_init_hw()
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem; +Cc: Wei Yongjun, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Wei Yongjun <weiyongjun1@huawei.com>

Fix to return error code ICE_ERR_NO_MEMORY from the alloc error
handling case instead of 0, as done elsewhere in this function.

Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 385f5d425d19..21977ec984c4 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -468,8 +468,10 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
 	mac_buf_len = sizeof(struct ice_aqc_manage_mac_read_resp);
 	mac_buf = devm_kzalloc(ice_hw_to_dev(hw), mac_buf_len, GFP_KERNEL);
 
-	if (!mac_buf)
+	if (!mac_buf) {
+		status = ICE_ERR_NO_MEMORY;
 		goto err_unroll_fltr_mgmt_struct;
+	}
 
 	status = ice_aq_manage_mac_read(hw, mac_buf, mac_buf_len, NULL);
 	devm_kfree(ice_hw_to_dev(hw), mac_buf);
-- 
2.14.3

^ permalink raw reply related

* [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains a couple of fixes for the new ice driver.

Wei Yongjun fixes the return error code for error case during init.

Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
and fixed "for" loop calculations.

The following are changes since commit 3239534a79ee6f20cffd974173a1e62e0730e8ac:
  net/sched: fix NULL dereference in the error path of tcf_bpf_init()
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 100GbE

Anirudh Venkataramanan (1):
  ice: Bug fixes in ethtool code

Wei Yongjun (1):
  ice: Fix error return code in ice_init_hw()

 drivers/net/ethernet/intel/ice/ice_common.c  | 4 +++-
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [net 2/2] ice: Bug fixes in ethtool code
From: Jeff Kirsher @ 2018-04-06 15:36 UTC (permalink / raw)
  To: davem
  Cc: Anirudh Venkataramanan, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180406153630.9321-1-jeffrey.t.kirsher@intel.com>

From: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>

1) Return correct size from ice_get_regs_len.
2) Fix incorrect use of ARRAY_SIZE in ice_get_regs.

Fixes: fcea6f3da546 (ice: Add stats and ethtool support)
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 186764a5c263..1db304c01d10 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -156,7 +156,7 @@ ice_get_drvinfo(struct net_device *netdev, struct ethtool_drvinfo *drvinfo)
 
 static int ice_get_regs_len(struct net_device __always_unused *netdev)
 {
-	return ARRAY_SIZE(ice_regs_dump_list);
+	return sizeof(ice_regs_dump_list);
 }
 
 static void
@@ -170,7 +170,7 @@ ice_get_regs(struct net_device *netdev, struct ethtool_regs *regs, void *p)
 
 	regs->version = 1;
 
-	for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list) / sizeof(u32); ++i)
+	for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list); ++i)
 		regs_buf[i] = rd32(hw, ice_regs_dump_list[i]);
 }
 
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables
From: David Miller @ 2018-04-06 15:21 UTC (permalink / raw)
  To: andrew; +Cc: raghuramchary.jallipalli, netdev, unglinuxdriver, woojung.huh
In-Reply-To: <20180406144342.GN17495@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Fri, 6 Apr 2018 16:43:42 +0200

> On Fri, Apr 06, 2018 at 11:42:02AM +0530, Raghuram Chary J wrote:
>> The patch is to configure DSP registers of PHY device
>> to handle Gbe-EEE failures with >40m cable length.
>> 
>> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
>> Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
>> ---
>>  drivers/net/phy/microchip.c  | 123 ++++++++++++++++++++++++++++++++++++++++++-
>>  include/linux/microchipphy.h |   8 +++
>>  2 files changed, 130 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
>> index 0f293ef28935..174ae9808722 100644
>> --- a/drivers/net/phy/microchip.c
>> +++ b/drivers/net/phy/microchip.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/ethtool.h>
>>  #include <linux/phy.h>
>>  #include <linux/microchipphy.h>
>> +#include <linux/delay.h>
>>  
>>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>>  #define DRIVER_DESC	"Microchip LAN88XX PHY driver"
>> @@ -66,6 +67,107 @@ static int lan88xx_suspend(struct phy_device *phydev)
>>  	return 0;
>>  }
>>  
>> +static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
>> +			       u32 data)
>> +{
>> +	int val;
>> +	u16 buf;
>> +
>> +	/* Get access to token ring page */
>> +	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
>> +		  LAN88XX_EXT_PAGE_ACCESS_TR);
> 
> Hi Raghuram
> 
> You might want to look at phy_read_paged(), phy_write_paged(), etc.
> 
> There can be race conditions with paged access.

Yep, so something like:

static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
			       u32 data)
{
	int save_page, val;
	u16 buf;

	save_page = phy_save_page(phydev);
	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_LOW_DATA, (data & 0xFFFF));
	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_HIGH_DATA,
			(data & 0x00FF0000) >> 16);

	/* Config control bits [15:13] of register */
	buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
	buf |= 0x8000; /* Set [15] to Packet transmit */

	phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			LAN88XX_EXT_PAGE_TR_CR, buf);
	usleep_range(1000, 2000);/* Wait for Data to be written */

	val = phy_read_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
			     LAN88XX_EXT_PAGE_TR_CR);
	if (!(val & 0x8000))
		pr_warn("TR Register[0x%X] configuration failed\n", regaddr);

	phy_restore_page(phydev, save_page, 0);
}

Since PHY accesses and thus things like phy_save_page() can fail, the
return type of this function should be changed to 'int' and some error
checking should be added.

^ permalink raw reply

* Re: [PATCH v3] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-06 15:16 UTC (permalink / raw)
  To: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
	linux-kernel
  Cc: dnelson, Vadim Lomovtsev
In-Reply-To: <20180406140443.15181-1-Vadim.Lomovtsev@caviumnetworks.com>

Self-NACK here, because of https://lkml.org/lkml/2018/4/6/724

Sorry for noise.

Vadim

On Fri, Apr 06, 2018 at 07:04:43AM -0700, Vadim Lomovtsev wrote:
> From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
> ---
> Changes from v1 to v2:
>  - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> Changes from v2 to v3:
>  - update commit description with 'Reported-by: Dan Carpenter';
>  - update size calculations for mc list to offsetof() call
>    instead of explicit arithmetic;
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h        |  7 +-----
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +++++++++---------------
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..448d1fafc827 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> -	struct list_head list;
> -	u64              addr;
> -};
> -
>  struct xcast_addr_list {
> -	struct list_head list;
>  	int              count;
> +	u64              mc[];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..7d9e58533a83 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  						  work.work);
>  	struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>  	union nic_mbx mbx = {};
> -	struct xcast_addr *xaddr, *next;
> +	u8 idx = 0;
>  
>  	if (!vf_work)
>  		return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
>  	/* check if we have any specific MACs to be added to PF DMAC filter */
>  	if (vf_work->mc) {
>  		/* now go through kernel list of MACs and add them one by one */
> -		list_for_each_entry_safe(xaddr, next,
> -					 &vf_work->mc->list, list) {
> +		for (idx = 0; idx < vf_work->mc->count; idx++) {
>  			mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> -			mbx.xcast.data.mac = xaddr->addr;
> +			mbx.xcast.data.mac = vf_work->mc->mc[idx];
>  			nicvf_send_msg_to_pf(nic, &mbx);
> -
> -			/* after receiving ACK from PF release memory */
> -			list_del(&xaddr->list);
> -			kfree(xaddr);
> -			vf_work->mc->count--;
>  		}
>  		kfree(vf_work->mc);
>  	}
> @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
>  			mode |= BGX_XCAST_MCAST_FILTER;
>  			/* here we need to copy mc addrs */
>  			if (netdev_mc_count(netdev)) {
> -				struct xcast_addr *xaddr;
> -
> -				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> -				INIT_LIST_HEAD(&mc_list->list);
> +				mc_list = kmalloc(offsetof(typeof(*mc_list),
> +							   mc[netdev_mc_count(netdev)]),
> +						  GFP_ATOMIC);
> +				if (unlikely(!mc_list))
> +					return;
> +				mc_list->count = 0;
>  				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
> -					xaddr = kmalloc(sizeof(*xaddr),
> -							GFP_ATOMIC);
> -					xaddr->addr =
> +					mc_list->mc[mc_list->count] =
>  						ether_addr_to_u64(ha->addr);
> -					list_add_tail(&xaddr->list,
> -						      &mc_list->list);
>  					mc_list->count++;
>  				}
>  			}
> -- 
> 2.14.3
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox