Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] xdp: adjust xdp redirect tracepoint
From: David Miller @ 2017-08-18 23:19 UTC (permalink / raw)
  To: brouer; +Cc: netdev, john.fastabend
In-Reply-To: <150298692691.6608.11908184719252996949.stgit@firesoul>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 17 Aug 2017 18:22:27 +0200

> Working on streamlining the tracepoints for XDP.  The eBPF programs
> and XDP have no flow-control or queueing.  Investigating using
> tracepoint to provide a feedback on XDP_REDIRECT xmit overflow events.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] irda: do not leak initialized list.dev to userspace
From: David Miller @ 2017-08-18 23:22 UTC (permalink / raw)
  To: colin.king; +Cc: netdev, samuel, linux-kernel
In-Reply-To: <20170817221458.27040-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu, 17 Aug 2017 23:14:58 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> list.dev has not been initialized and so the copy_to_user is copying
> data from the stack back to user space which is a potential
> information leak. Fix this ensuring all of list is initialized to
> zero.
> 
> Detected by CoverityScan, CID#1357894 ("Uninitialized scalar variable")
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] rxrpc: Fix oops when discarding a preallocated service call
From: David Miller @ 2017-08-18 23:24 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs, stable, linux-kernel
In-Reply-To: <150301198243.6876.2468782049364321310.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Fri, 18 Aug 2017 00:19:42 +0100

> rxrpc_service_prealloc_one() doesn't set the socket pointer on any new call
> it preallocates, but does add it to the rxrpc net namespace call list.
> This, however, causes rxrpc_put_call() to oops when the call is discarded
> when the socket is closed.  rxrpc_put_call() needs the socket to be able to
> reach the namespace so that it can use a lock held therein.
> 
> Fix this by setting a call's socket pointer immediately before discarding
> it.
> 
> This can be triggered by unloading the kafs module, resulting in an oops
> like the following:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
> IP: rxrpc_put_call+0x1e2/0x32d
> PGD 0
> P4D 0
> Oops: 0000 [#1] SMP
> Modules linked in: kafs(E-)
> CPU: 3 PID: 3037 Comm: rmmod Tainted: G            E   4.12.0-fscache+ #213
> Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> task: ffff8803fc92e2c0 task.stack: ffff8803fef74000
> RIP: 0010:rxrpc_put_call+0x1e2/0x32d
> RSP: 0018:ffff8803fef77e08 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff8803fab99ac0 RCX: 000000000000000f
> RDX: ffffffff81c50a40 RSI: 000000000000000c RDI: ffff8803fc92ea88
> RBP: ffff8803fef77e30 R08: ffff8803fc87b941 R09: ffffffff82946d20
> R10: ffff8803fef77d10 R11: 00000000000076fc R12: 0000000000000005
> R13: ffff8803fab99c20 R14: 0000000000000001 R15: ffffffff816c6aee
> FS:  00007f915a059700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000030 CR3: 00000003fef39000 CR4: 00000000001406e0
> Call Trace:
>  rxrpc_discard_prealloc+0x325/0x341
>  rxrpc_listen+0xf9/0x146
>  kernel_listen+0xb/0xd
>  afs_close_socket+0x3e/0x173 [kafs]
>  afs_exit+0x1f/0x57 [kafs]
>  SyS_delete_module+0x10f/0x19a
>  do_syscall_64+0x8a/0x149
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing")
> Signed-off-by: David Howells <dhowells@redhat.com>

Applied, but...

> cc: stable@vger.kernel.org

This is inappropriate for two reasons.

The code in question is new in 4.13-rcX and therefore does not need a
stable backport.

Secondly, I handle all networking stable submissions explicitly, so
please just tell me that a change should go to -stable and I will
queue it up for you, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bpf: Fix map-in-map checking in the verifier
From: David Miller @ 2017-08-18 23:26 UTC (permalink / raw)
  To: kafai; +Cc: netdev, ast, daniel, kernel-team, john.fastabend
In-Reply-To: <20170818011443.562793-1-kafai@fb.com>

From: Martin KaFai Lau <kafai@fb.com>
Date: Thu, 17 Aug 2017 18:14:43 -0700

> In check_map_func_compatibility(), a 'break' has been accidentally
> removed for the BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
> cases.  This patch adds it back.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied.

^ permalink raw reply

* Re: [PATCHv2 net] net: sched: fix NULL pointer dereference when action calls some targets
From: David Miller @ 2017-08-18 23:26 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, xiyou.wangcong, pablo
In-Reply-To: <f9e3e9338eeb819c40efb13fed33514a3df23c14.1503025296.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Fri, 18 Aug 2017 11:01:36 +0800

> As we know in some target's checkentry it may dereference par.entryinfo
> to check entry stuff inside. But when sched action calls xt_check_target,
> par.entryinfo is set with NULL. It would cause kernel panic when calling
> some targets.
> 
> It can be reproduce with:
>   # tc qd add dev eth1 ingress handle ffff:
>   # tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
>     -j ECN --ecn-tcp-remove
> 
> It could also crash kernel when using target CLUSTERIP or TPROXY.
> 
> By now there's no proper value for par.entryinfo in ipt_init_target,
> but it can not be set with NULL. This patch is to void all these
> panics by setting it with an ipt_entry obj with all members = 0.
> 
> Note that this issue has been there since the very beginning.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH v7 net-next] net: systemport: Support 64bit statistics
From: Florian Fainelli @ 2017-08-18 23:27 UTC (permalink / raw)
  To: David Miller, jqiaoulk; +Cc: eric.dumazet, netdev
In-Reply-To: <20170806.212129.568625131913902231.davem@davemloft.net>

On 08/06/2017 09:21 PM, David Miller wrote:
> From: "Jianming.qiao" <jqiaoulk@gmail.com>
> Date: Fri,  4 Aug 2017 00:07:45 +0100
> 
>> When using Broadcom Systemport device in 32bit Platform, ifconfig can
>> only report up to 4G tx,rx status, which will be wrapped to 0 when the
>> number of incoming or outgoing packets exceeds 4G, only taking
>> around 2 hours in busy network environment (such as streaming).
>> Therefore, it makes hard for network diagnostic tool to get reliable
>> statistical result, so the patch is used to add 64bit support for
>> Broadcom Systemport device in 32bit Platform.
>>
>> This patch provides 64bit statistics capability on both ethtool and ifconfig.
>>
>> Signed-off-by: Jianming.qiao <kiki-good@hotmail.com>
> 
> Applied, thanks.
> 

Kiki, I don't know how you tested that, but I keep hitting deadlocks
with that patch, I don't have time to debug this at the moment and I
still need to add the debugging that Eric suggested in another, but it
does not take long to reproduce:

# ethtool -K eth0 rx on
Cannot get device udp-fragmentation-offload settings: Operation not
supported


^C

[  283.457657] INFO: rcu_sched self-detected stall on CPU
[  283.458658] INFO: rcu_sched detected stalls on CPUs/tasks:
[  283.458669]  0-...: (21000 ticks this GP) idle=eba/140000000000001/0
softirq=3159/3159 fqs=5246
[  283.458671]  (detected by 3, t=21002 jiffies, g=736, c=735, q=52)
[  283.458682] Sending NMI from CPU 3 to CPUs 0:
[  283.487596]  0-...: (21000 ticks this GP) idle=eba/140000000000001/0
softirq=3159/3159 fqs=5247
[  283.496399]   (t=21031 jiffies g=736 c=735 q=52)
[  293.459258] rcu_sched kthread starved for 9244 jiffies! g736 c735
f0x0 RCU_GP_DOING_FQS(4) ->state=0x0
[  293.459259] NMI backtrace for cpu 0
[  293.459266] CPU: 0 PID: 1489 Comm: ethtool Not tainted
4.13.0-rc5-01160-g6e42b0004c6c-dirty #13
[  293.459269] Hardware name: Broadcom STB (Flattened Device Tree)
[  293.459301] [<c0211720>] (unwind_backtrace) from [<c020c50c>]
(show_stack+0x10/0x14)
[  293.459313] [<c020c50c>] (show_stack) from [<c0979460>]
(dump_stack+0x84/0x98)
[  293.459327] [<c0979460>] (dump_stack) from [<c097f678>]
(nmi_cpu_backtrace+0x11c/0x120)
[  293.459336] [<c097f678>] (nmi_cpu_backtrace) from [<c097f754>]
(nmi_trigger_cpumask_backtrace+0xd8/0x124)
[  293.459347] [<c097f754>] (nmi_trigger_cpumask_backtrace) from
[<c0283c88>] (rcu_dump_cpu_stacks+0xa0/0xd0)
[  293.459355] [<c0283c88>] (rcu_dump_cpu_stacks) from [<c028337c>]
(rcu_check_callbacks+0x7b4/0x930)
[  293.459364] [<c028337c>] (rcu_check_callbacks) from [<c02888cc>]
(update_process_times+0x34/0x5c)
[  293.459374] [<c02888cc>] (update_process_times) from [<c029a744>]
(tick_sched_timer+0x40/0x9c)
[  293.459383] [<c029a744>] (tick_sched_timer) from [<c02899cc>]
(__hrtimer_run_queues+0x168/0x328)
[  293.459389] [<c02899cc>] (__hrtimer_run_queues) from [<c0289dc4>]
(hrtimer_interrupt+0xa4/0x1f8)
[  293.459400] [<c0289dc4>] (hrtimer_interrupt) from [<c07edcb4>]
(arch_timer_handler_virt+0x28/0x30)
[  293.459409] [<c07edcb4>] (arch_timer_handler_virt) from [<c02728fc>]
(handle_percpu_devid_irq+0x88/0x23c)
[  293.459418] [<c02728fc>] (handle_percpu_devid_irq) from [<c026d64c>]
(generic_handle_irq+0x24/0x34)
[  293.459426] [<c026d64c>] (generic_handle_irq) from [<c026db8c>]
(__handle_domain_irq+0x5c/0xb4)
[  293.459433] [<c026db8c>] (__handle_domain_irq) from [<c020148c>]
(gic_handle_irq+0x48/0x8c)
[  293.459438] [<c020148c>] (gic_handle_irq) from [<c020d1b8>]
(__irq_svc+0x58/0x74)
[  293.459442] Exception stack(0xee3d1be0 to 0xee3d1c28)
[  293.459447] 1be0: edbe11a4 00000008 00681daf edbd6010 ee392800
edbe10e4 edbe10e4 ee392800
[  293.459451] 1c00: 00000000 c1803c48 00000000 00000002 00000000
ee3d1c30 00000000 c06e00a8
[  293.459454] 1c20: 20070013 ffffffff
[  293.459463] [<c020d1b8>] (__irq_svc) from [<c06e00a8>]
(bcm_sysport_get_stats64+0x94/0x104)
[  293.459474] [<c06e00a8>] (bcm_sysport_get_stats64) from [<c084a408>]
(dev_get_stats+0x38/0xac)
[  293.459485] [<c084a408>] (dev_get_stats) from [<c0863fc8>]
(rtnl_fill_stats+0x30/0x118)
[  293.459493] [<c0863fc8>] (rtnl_fill_stats) from [<c08645e8>]
(rtnl_fill_ifinfo+0x538/0xdc8)
[  293.459502] [<c08645e8>] (rtnl_fill_ifinfo) from [<c0868e30>]
(rtmsg_ifinfo_build_skb+0x6c/0xd8)
[  293.459507] [<c0868e30>] (rtmsg_ifinfo_build_skb) from [<c0868eb0>]
(rtmsg_ifinfo_event.part.6+0x14/0x44)
[  293.459512] [<c0868eb0>] (rtmsg_ifinfo_event.part.6) from
[<c0868f78>] (rtnetlink_event+0x70/0x7c)
[  293.459520] [<c0868f78>] (rtnetlink_event) from [<c0241778>]
(notifier_call_chain+0x44/0x84)
[  293.459527] [<c0241778>] (notifier_call_chain) from [<c0241918>]
(raw_notifier_call_chain+0x18/0x20)
[  293.459535] [<c0241918>] (raw_notifier_call_chain) from [<c084aef4>]
(netdev_features_change+0x28/0x44)
[  293.459542] [<c084aef4>] (netdev_features_change) from [<c08593e4>]
(dev_ethtool+0x404/0x28e0)
[  293.459551] [<c08593e4>] (dev_ethtool) from [<c0871910>]
(dev_ioctl+0x5c0/0x8a4)
[  293.459561] [<c0871910>] (dev_ioctl) from [<c0355a5c>]
(do_vfs_ioctl+0xac/0x7e4)
[  293.459569] [<c0355a5c>] (do_vfs_ioctl) from [<c03561c8>]
(SyS_ioctl+0x34/0x5c)
[  293.459576] [<c03561c8>] (SyS_ioctl) from [<c0208540>]
(ret_fast_syscall+0x0/0x34)
[  293.776281] rcu_sched       S    0     8      2 0x00000000
[  293.781794] [<c098eda4>] (__schedule) from [<c098f2f0>]
(schedule+0x44/0x9c)
[  293.788863] [<c098f2f0>] (schedule) from [<c0992984>]
(schedule_timeout+0x1c8/0x36c)
[  293.796627] [<c0992984>] (schedule_timeout) from [<c0281d8c>]
(rcu_gp_kthread+0x6bc/0xf7c)
[  293.804919] [<c0281d8c>] (rcu_gp_kthread) from [<c0240060>]
(kthread+0x128/0x158)
[  293.812423] [<c0240060>] (kthread) from [<c02085e8>]
(ret_from_fork+0x14/0x2c)

[  346.462654] INFO: rcu_sched self-detected stall on CPU
[  346.463655] INFO: rcu_sched detected stalls on CPUs/tasks:
[  346.463664]  0-...: (74005 ticks this GP) idle=eba/140000000000001/0
softirq=3159/3159 fqs=18231
[  346.463666]  (detected by 1, t=84007 jiffies, g=736, c=735, q=64)
[  346.463673] Sending NMI from CPU 1 to CPUs 0:
[  346.492658]  0-...: (74005 ticks this GP) idle=eba/140000000000001/0
softirq=3159/3159 fqs=18231
[  346.501547]   (t=84044 jiffies g=736 c=735 q=64)

-- 
Florian

^ permalink raw reply

* Re: [PATCH] mlx5: ensure 0 is returned when vport is zero
From: David Miller @ 2017-08-18 23:29 UTC (permalink / raw)
  To: colin.king-Z7WLFzj8eWMS+FvcfC7Uqw
  Cc: saeedm-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	leonro-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20170818134925.16604-1-colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

From: Colin King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Date: Fri, 18 Aug 2017 14:49:25 +0100

> From: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Currently, if vport is zero then then an uninialized return status
> in err is returned.  Since the only return status at the end of the
> function esw_add_uc_addr is zero for the current set of return paths
> we may as well just return 0 rather than err to fix this issue.
> 
> Detected by CoverityScan, CID#1452698 ("Uninitialized scalar variable")
> 
> Fixes: eeb66cdb6826 ("net/mlx5: Separate between E-Switch and MPFS")
> Signed-off-by: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 00/13] amd-xgbe: AMD XGBE driver updates 2017-08-17
From: David Miller @ 2017-08-18 23:31 UTC (permalink / raw)
  To: thomas.lendacky; +Cc: netdev
In-Reply-To: <20170818140209.14804.94997.stgit@tlendack-t1.amdoffice.net>

From: Tom Lendacky <thomas.lendacky@amd.com>
Date: Fri, 18 Aug 2017 09:02:09 -0500

> The following updates are included in this driver update series:
> 
> - Set the MDIO mode to clause 45 for the 10GBase-T configuration
> - Set the MII control width to 8-bits for speeds less than 1Gbps
> - Fix an issue to related to module removal when the devices are up
> - Fix ethtool statistics related to packet counting of TSO packets
> - Add support for device renaming
> - Add additional dynamic debug output for the PCS window calculation
> - Optimize reading of DMA channel interrupt enablement register
> - Add additional dynamic debug output about the hardware features
> - Add per queue Tx and Rx ethtool statistics
> - Add a macro to clear ethtool_link_ksettings modes
> - Convert the driver to use the ethtool_link_ksettings
> - Add support for VXLAN offload capabilities
> - Add additional ethtool statistics related to VXLAN
> 
> This patch series is based on net-next.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v3 net-next] bpf/verifier: track liveness for pruning
From: Alexei Starovoitov via iovisor-dev @ 2017-08-18 23:37 UTC (permalink / raw)
  To: Edward Cree, davem-fT/PcQaiUtIeIZ0/mPfg9Q, Alexei Starovoitov,
	Daniel Borkmann
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <b8a46388-74ea-81f9-df08-3b6b88042229-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>

On 8/18/17 7:16 AM, Edward Cree wrote:
> On 18/08/17 04:21, Alexei Starovoitov wrote:
>> On 8/15/17 12:34 PM, Edward Cree wrote:
>>> State of a register doesn't matter if it wasn't read in reaching an exit;
>>>  a write screens off all reads downstream of it from all explored_states
>>>  upstream of it.
>>> This allows us to prune many more branches; here are some processed insn
>>>  counts for some Cilium programs:
>>> Program                  before  after
>>> bpf_lb_opt_-DLB_L3.o       6515   3361
>>> bpf_lb_opt_-DLB_L4.o       8976   5176
>>> bpf_lb_opt_-DUNKNOWN.o     2960   1137
>>> bpf_lxc_opt_-DDROP_ALL.o  95412  48537
>>> bpf_lxc_opt_-DUNKNOWN.o  141706  78718
>>> bpf_netdev.o              24251  17995
>>> bpf_overlay.o             10999   9385
>>>
>>> The runtime is also improved; here are 'time' results in ms:
>>> Program                  before  after
>>> bpf_lb_opt_-DLB_L3.o         24      6
>>> bpf_lb_opt_-DLB_L4.o         26     11
>>> bpf_lb_opt_-DUNKNOWN.o       11      2
>>> bpf_lxc_opt_-DDROP_ALL.o   1288    139
>>> bpf_lxc_opt_-DUNKNOWN.o    1768    234
>>> bpf_netdev.o                 62     31
>>> bpf_overlay.o                15     13
>>>
>>> Signed-off-by: Edward Cree <ecree-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>
>>
>> this is one ingenious hack. Love it!
>> I took me whole day to understand most of it, but I still have
>> few questions:
>>
>>> +
>>> +static void propagate_liveness(const struct bpf_verifier_state *state,
>>> +                   struct bpf_verifier_state *parent)
>>
>> here the name 'parent' is very confusing, since for the first
>> iteration of the loop below it transfers lives from 'neighbor'
>> state to the current state and only then traverses the link
>> of parents in the current.
>> Would be good to document it, since I was struggling the most
>> with this name until I realized that the way you build parent link list
>> in is_state_visited() is actual sequence of roughly basic blocks and
>> the name 'parent' applies there, but not for the first iteration
>> of this function.
> In the way I think about it, it really is a parent, by which I mean "a
>  block whose output registers are our inputs".  When we call
>  propagate_liveness(), we're saying "here's another block whose outputs
>  could be our inputs".  So while the 'state->parent' datastructure is a
>  linked list, the 'parentage' relationship is really a DAG.
> While 'state->parent' always has to point at an explored_state (which are
>  roughly immutable), the 'parent' we pass into propagate_liveness is just
>  env->cur_state which is mutable.  The weird "it's not actually
>  state->parent" comes from (a) there's only room for one state->parent and
>  (b) we didn't create a new_sl for it because we're pruning.
> I agree this is not at all explained in the code or comments, except for
>  the glib "Registers read by the continuation are read by us".  I will try
>  to write some comments and/or documentation explaining how and why the
>  liveness tracking works, because it _is_ subtle and a week from now _I_
>  probably won't understand it either.

a bit twisted, but yeah agree with this angle of thinking about it :)

>
>>> @@ -3407,6 +3501,14 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
>>>      memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
>>>      new_sl->next = env->explored_states[insn_idx];
>>>      env->explored_states[insn_idx] = new_sl;
>>> +    /* connect new state to parentage chain */
>>> +    env->cur_state.parent = &new_sl->state;
>>> +    /* clear liveness marks in current state */
>>> +    for (i = 0; i < BPF_REG_FP; i++)
>>> +        env->cur_state.regs[i].live = REG_LIVE_NONE;
>>> +    for (i = 0; i < MAX_BPF_STACK / BPF_REG_SIZE; i++)
>>> +        if (env->cur_state.stack_slot_type[i * BPF_REG_SIZE] == STACK_SPILL)
>>> +            env->cur_state.spilled_regs[i].live = REG_LIVE_NONE;
>>
>> and this part I don't get at all.
> The idea behind the liveness marks is that 'writes go down and reads go up
>  and up'.  That is, each state 'tells' its parent state 'I read from you'
>  (which can then end up recursing), but it remembers for itself that it
>  wrote to a register and therefore should swallow subsequent reads rather
>  than forwarding them to its parent.
> While a block is being walked, its liveness marks _only_ record writes, and
>  then only writes _made in that block_.  The read marks go to the parent,
>  which is "some block that has been walked", but whose continuations haven't
>  all been walked yet so (by looplessness) won't show up as a pruning
>  opportunity.  An explored_state's liveness marks record the writes _done in
>  reaching that state_ from its parent, but the reads _done by the state's
>  children_.  A cur_state's liveness marks do the same, but it doesn't have
>  any children yet so it never gets read-marks (until it gets turned into an
>  explored_state, of course).
> We clear our liveness marks because the writes our parent block did are not
>  writes we did, so they don't screen off our reads from our parent;

got it. makes sense. would be good to document it.

>> It seems you're trying to sort-of do per-fake-basic block liveness
>> analysis, but our state_list_marks are not correct if we go with
>> canonical basic block definition, since we mark the jump insn and
>> not insn after the branch and not every basic block boundary is
>> properly detected.
> I think the reason this works is that jump insns can't do writes.
> Whenever we pop a branch and restore its register state, we _also_ restore
>  its parentage state.  Then if we decide to prune, we're saying that
>  whatever it takes to get from sl->state to an exit, we must also do.
> A read mark on sl->state says that in the process of getting to an exit, the
>  register was written before it was read.  A write mark on sl->state says
>  that the straight-line code resulting in sl->state wrote to the register,
>  so reads shouldn't propagate to its ->parent.  But by the time we're using
>  sl->state in is_state_visited(), its continuations have all been probed, so
>  it can never gain more reads, so its write marks are irrelevant; they
>  mustn't stop the state's reads propagating to new 'pruning parents' because
>  those didn't arrive at the state through the straight-line code.
> So maybe the first iteration through propagate_liveness() really _is_
>  special, and that if it were done differently (ignoring write marks) then
>  it really wouldn't matter at all where our state_list_marks were in
>  relation to the basic blocks.  But I _think_ that, because we mark jump
>  insns as well as their destinations (and a popped branch is always a
>  destination, rather than the 'insn after the branch'), the sl->state will
>  never have any write marks and it'll all just work.
> But I should really test that!

also makes sense.

>> So if algorithm should only work for basic blocks (for sequences of
>> instructions without control flow changes) then it's broken.
>> If it should work with control flow insns then it should also work
>> for the whole chain of insns from the first one till bpf_exit...
>> So I tried removing two above clearing loops and results are much
>> better:
>>                         before  after
>> bpf_lb-DLB_L3.o         2604    1120
>> bpf_lb-DLB_L4.o         11159   1371
>> bpf_lb-DUNKNOWN.o       1116    485
>> bpf_lxc-DDROP_ALL.o     34566   12758
>> bpf_lxc-DUNKNOWN.o      53267   18337
>> bpf_netdev.o            17843   10564
>> bpf_overlay.o           8672    5513
>>
>> but it feels too good to be true and probably not correct.
>> So either way we need to fix something it seems.
> Without that clearing, write marks will never be cleared, meaning that once
>  a register has been written to it will _never again_ forward a read.  Since
>  every register (except ctx and fp) must be written before it can be read,
>  no register (except ctx) will ever be marked as read, and all the branches
>  will be pruned away.
>
> Reads are a property of a state - roughly, "what do we read in getting from
>  this state to a BPF_EXIT?" - whereas writes are a property of a block,
>  "what do we write in getting from the parent to here?"
> Thus, reads may (indeed must) propagate (upwards), but writes must _never_
>  spread beyond the 'end-state' of their block.
>
> I hope this answers your questions, because I'm not entirely sure _I_
>  understand it either.  Or rather, when I think about it for an hour, it
>  makes sense for about fifteen seconds, and then it's gone again and all I
>  can remember is "write down, read up".

yes. thank a lot for explanation.
I've been playing with algo for the whole day and yes indeed it's
magically doesn't depend on proper basic block boundaries.
So having cfg jump into or out of a block of instructions actually
doesn't break the algo. My understanding this is correct, since
we only create such blocks at state_list_marks and we also check
for state equality at the same state_list_marks, so if control flow
jumps into the middle of the block, there is no equal state to find
and it will continue as normal.

I've tried moving state_list_mark all over the place. Made them
control flow accurate and it improved cilium prog load stats,
but not a lot, so I will hold on such patch for now.
Then I tried to add state_list_marks for every odd insn and
unfortunately it uncovered a bug in this liveness logic.
The following test can reproduce it with existing code (no extra hacks)
         {
                 "grr",
                 .insns = {
                         BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
                         BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
                         BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
                         BPF_LD_MAP_FD(BPF_REG_1, 0),
                         BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
                                      BPF_FUNC_map_lookup_elem),
                         BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 8),
                         BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
                         BPF_MOV32_IMM(BPF_REG_2, MAX_ENTRIES),
                         BPF_JMP_REG(BPF_JSGT, BPF_REG_2, BPF_REG_1, 1),
                         BPF_MOV32_IMM(BPF_REG_1, 0),
                         BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 2),
                         BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
                         BPF_JMP_IMM(BPF_JA, 0, 0, 0),
                         BPF_ST_MEM(BPF_DW, BPF_REG_0, 0,
                                    offsetof(struct test_val, foo)),
                         BPF_EXIT_INSN(),
                 },
                 .fixup_map2 = { 3 },
                 .errstr_unpriv = "R0 leaks addr",
                 .errstr = "R0 unbounded memory access",
                 .result_unpriv = REJECT,
                 .result = REJECT,
                 .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
         },

the trick here is extra BPF_JMP_IMM(BPF_JA, 0, 0, 0)
which is a nop, but it forces our state_list_mark logic to
mark subsequent ST_MEM(DW) insn and liveness logic finds old
state at this insn and declares it equivalent.
The output looks like:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (18) r1 = 0xffff880139d45200
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+8
  R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) R10=fp0
7: (79) r1 = *(u64 *)(r0 +0)
  R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) R10=fp0
8: (b4) (u32) r2 = (u32) 11
9: (6d) if r2 s> r1 goto pc+1
  R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) 
R1=inv(id=0,umin_value=11,umax_value=9223372036854775807,var_off=(0x0; 
0x7fffffffffffffff)) R2=inv11 R10=fp0
10: (b4) (u32) r1 = (u32) 0
11: (64) (u32) r1 <<= (u32) 2
12: (0f) r0 += r1
13: (05) goto pc+0
14: (7a) *(u64 *)(r0 +0) = 4
  R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) R1=inv0 R2=inv11 R10=fp0
15: (95) exit

from 9 to 11: R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) 
R1=inv(id=0,smax_value=10) R2=inv11 R10=fp0
11: (64) (u32) r1 <<= (u32) 2
12: (0f) r0 += r1
13: (05) goto pc+0
14: safe

from 6 to 15: R0=inv0 R10=fp0
15: (95) exit


that '14: safe' above is not correct.

Disabling liveness as:
@@ -3282,7 +3288,7 @@ static bool regsafe(struct bpf_reg_state *rold,
                     struct bpf_reg_state *rcur,
                     bool varlen_map_access, struct idpair *idmap)
  {
-       if (!(rold->live & REG_LIVE_READ))
+       if (0 && !(rold->live & REG_LIVE_READ))

makes the test work properly and proper verifier output is:
from 9 to 11: R0=map_value(id=0,off=0,ks=8,vs=48,imm=0) 
R1=inv(id=0,smax_value=10) R2=inv11 R10=fp0
11: (64) (u32) r1 <<= (u32) 2
12: (0f) r0 += r1
13: (05) goto pc+0
14: (7a) *(u64 *)(r0 +0) = 4
 
R0=map_value(id=0,off=0,ks=8,vs=48,umax_value=17179869180,var_off=(0x0; 
0x3fffffffc)) R1=inv(id=0,umax_value=17179869180,var_off=(0x0; 
0x3fffffffc)) R2=inv11 R10=fp0
R0 unbounded memory access, make sure to bounds check any array access 
into a map

I don't yet understand the underlying reason. R0 should have been
marked as LIVE_READ by ST_MEM...
Please help debug it further.

^ permalink raw reply

* [PATCH 0/3] MIPS,bpf: Improvements for MIPS eBPF JIT
From: David Daney @ 2017-08-18 23:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller, netdev,
	linux-kernel
  Cc: linux-mips, ralf, David Daney

Here are several improvements and bug fixes for the MIPS eBPF JIT.

The main change is the addition of support for JLT, JLE, JSLT and JSLE
ops, that were recently added.

Also fix WARN output when used with preemptable kernel, and a small
cleanup/optimization in the use of BPF_OP(insn->code).

I suggest that the whole thing go via the BPF/net-next path as there
are dependencies on code that is not yet merged to Linus' tree.

Still pending are changes to reduce stack usage when the verifier can
determine the maximum stack size.

David Daney (3):
  MIPS,bpf: Fix using smp_processor_id() in preemptible splat.
  MIPS,bpf: Implement JLT, JLE, JSLT and JSLE ops in the eBPF JIT.
  MIPS,bpf: Cache value of BPF_OP(insn->code) in eBPF JIT.

 arch/mips/net/ebpf_jit.c | 162 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 103 insertions(+), 59 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH 1/3] MIPS,bpf: Fix using smp_processor_id() in preemptible splat.
From: David Daney @ 2017-08-18 23:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller, netdev,
	linux-kernel
  Cc: linux-mips, ralf, David Daney
In-Reply-To: <20170818234033.5990-1-david.daney@cavium.com>

If the kernel is configured with preemption enabled we were getting
warning stack traces for use of current_cpu_type().

Fix by moving the test between preempt_disable()/preempt_enable() and
caching the results of the CPU type tests for use during code
generation.

Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/mips/net/ebpf_jit.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 3f87b96..721216b 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -113,6 +113,7 @@ struct jit_ctx {
 	u64 *reg_val_types;
 	unsigned int long_b_conversion:1;
 	unsigned int gen_b_offsets:1;
+	unsigned int use_bbit_insns:1;
 };
 
 static void set_reg_val_type(u64 *rvt, int reg, enum reg_val_type type)
@@ -655,19 +656,6 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx, int this_idx)
 	return build_int_epilogue(ctx, MIPS_R_T9);
 }
 
-static bool use_bbit_insns(void)
-{
-	switch (current_cpu_type()) {
-	case CPU_CAVIUM_OCTEON:
-	case CPU_CAVIUM_OCTEON_PLUS:
-	case CPU_CAVIUM_OCTEON2:
-	case CPU_CAVIUM_OCTEON3:
-		return true;
-	default:
-		return false;
-	}
-}
-
 static bool is_bad_offset(int b_off)
 {
 	return b_off > 0x1ffff || b_off < -0x20000;
@@ -1198,7 +1186,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		if (dst < 0)
 			return dst;
 
-		if (use_bbit_insns() && hweight32((u32)insn->imm) == 1) {
+		if (ctx->use_bbit_insns && hweight32((u32)insn->imm) == 1) {
 			if ((insn + 1)->code == (BPF_JMP | BPF_EXIT) && insn->off == 1) {
 				b_off = b_imm(exit_idx, ctx);
 				if (is_bad_offset(b_off))
@@ -1853,6 +1841,18 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
 	memset(&ctx, 0, sizeof(ctx));
 
+	preempt_disable();
+	switch (current_cpu_type()) {
+	case CPU_CAVIUM_OCTEON:
+	case CPU_CAVIUM_OCTEON_PLUS:
+	case CPU_CAVIUM_OCTEON2:
+	case CPU_CAVIUM_OCTEON3:
+		ctx.use_bbit_insns = 1;
+	default:
+		ctx.use_bbit_insns = 0;
+	}
+	preempt_enable();
+
 	ctx.offsets = kcalloc(prog->len + 1, sizeof(*ctx.offsets), GFP_KERNEL);
 	if (ctx.offsets == NULL)
 		goto out_err;
-- 
2.9.5

^ permalink raw reply related

* [PATCH 2/3] MIPS,bpf: Implement JLT, JLE, JSLT and JSLE ops in the eBPF JIT.
From: David Daney @ 2017-08-18 23:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller, netdev,
	linux-kernel
  Cc: linux-mips, ralf, David Daney
In-Reply-To: <20170818234033.5990-1-david.daney@cavium.com>

Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/mips/net/ebpf_jit.c | 101 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 72 insertions(+), 29 deletions(-)

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 721216b..c1e21cb 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -990,8 +990,12 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		goto jeq_common;
 	case BPF_JMP | BPF_JEQ | BPF_X: /* JMP_REG */
 	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSLT | BPF_X:
+	case BPF_JMP | BPF_JSLE | BPF_X:
 	case BPF_JMP | BPF_JSGT | BPF_X:
 	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JLT | BPF_X:
+	case BPF_JMP | BPF_JLE | BPF_X:
 	case BPF_JMP | BPF_JGT | BPF_X:
 	case BPF_JMP | BPF_JGE | BPF_X:
 	case BPF_JMP | BPF_JSET | BPF_X:
@@ -1013,28 +1017,34 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			cmp_eq = false;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JSGT) {
+		} else if (BPF_OP(insn->code) == BPF_JSGT || BPF_OP(insn->code) == BPF_JSLE) {
 			emit_instr(ctx, dsubu, MIPS_R_AT, dst, src);
 			if ((insn + 1)->code == (BPF_JMP | BPF_EXIT) && insn->off == 1) {
 				b_off = b_imm(exit_idx, ctx);
 				if (is_bad_offset(b_off))
 					return -E2BIG;
-				emit_instr(ctx, blez, MIPS_R_AT, b_off);
+				if (BPF_OP(insn->code) == BPF_JSGT)
+					emit_instr(ctx, blez, MIPS_R_AT, b_off);
+				else
+					emit_instr(ctx, bgtz, MIPS_R_AT, b_off);
 				emit_instr(ctx, nop);
 				return 2; /* We consumed the exit. */
 			}
 			b_off = b_imm(this_idx + insn->off + 1, ctx);
 			if (is_bad_offset(b_off))
 				return -E2BIG;
-			emit_instr(ctx, bgtz, MIPS_R_AT, b_off);
+			if (BPF_OP(insn->code) == BPF_JSGT)
+				emit_instr(ctx, bgtz, MIPS_R_AT, b_off);
+			else
+				emit_instr(ctx, blez, MIPS_R_AT, b_off);
 			emit_instr(ctx, nop);
 			break;
-		} else if (BPF_OP(insn->code) == BPF_JSGE) {
+		} else if (BPF_OP(insn->code) == BPF_JSGE || BPF_OP(insn->code) == BPF_JSLT) {
 			emit_instr(ctx, slt, MIPS_R_AT, dst, src);
-			cmp_eq = true;
+			cmp_eq = BPF_OP(insn->code) == BPF_JSGE;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JGT) {
+		} else if (BPF_OP(insn->code) == BPF_JGT || BPF_OP(insn->code) == BPF_JLE) {
 			/* dst or src could be AT */
 			emit_instr(ctx, dsubu, MIPS_R_T8, dst, src);
 			emit_instr(ctx, sltu, MIPS_R_AT, dst, src);
@@ -1042,12 +1052,12 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, movz, MIPS_R_T9, MIPS_R_SP, MIPS_R_T8);
 			emit_instr(ctx, movn, MIPS_R_T9, MIPS_R_ZERO, MIPS_R_T8);
 			emit_instr(ctx, or, MIPS_R_AT, MIPS_R_T9, MIPS_R_AT);
-			cmp_eq = true;
+			cmp_eq = BPF_OP(insn->code) == BPF_JGT;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JGE) {
+		} else if (BPF_OP(insn->code) == BPF_JGE || BPF_OP(insn->code) == BPF_JLT) {
 			emit_instr(ctx, sltu, MIPS_R_AT, dst, src);
-			cmp_eq = true;
+			cmp_eq = BPF_OP(insn->code) == BPF_JGE;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
 		} else { /* JNE/JEQ case */
@@ -1110,6 +1120,8 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		break;
 	case BPF_JMP | BPF_JSGT | BPF_K: /* JMP_IMM */
 	case BPF_JMP | BPF_JSGE | BPF_K: /* JMP_IMM */
+	case BPF_JMP | BPF_JSLT | BPF_K: /* JMP_IMM */
+	case BPF_JMP | BPF_JSLE | BPF_K: /* JMP_IMM */
 		cmp_eq = (BPF_OP(insn->code) == BPF_JSGE);
 		dst = ebpf_to_mips_reg(ctx, insn, dst_reg_fp_ok);
 		if (dst < 0)
@@ -1120,65 +1132,92 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 				b_off = b_imm(exit_idx, ctx);
 				if (is_bad_offset(b_off))
 					return -E2BIG;
-				if (cmp_eq)
-					emit_instr(ctx, bltz, dst, b_off);
-				else
+				switch (BPF_OP(insn->code)) {
+				case BPF_JSGT:
 					emit_instr(ctx, blez, dst, b_off);
+					break;
+				case BPF_JSGE:
+					emit_instr(ctx, bltz, dst, b_off);
+					break;
+				case BPF_JSLT:
+					emit_instr(ctx, bgez, dst, b_off);
+					break;
+				case BPF_JSLE:
+					emit_instr(ctx, bgtz, dst, b_off);
+					break;
+				}
 				emit_instr(ctx, nop);
 				return 2; /* We consumed the exit. */
 			}
 			b_off = b_imm(this_idx + insn->off + 1, ctx);
 			if (is_bad_offset(b_off))
 				return -E2BIG;
-			if (cmp_eq)
-				emit_instr(ctx, bgez, dst, b_off);
-			else
+			switch (BPF_OP(insn->code)) {
+			case BPF_JSGT:
 				emit_instr(ctx, bgtz, dst, b_off);
+				break;
+			case BPF_JSGE:
+				emit_instr(ctx, bgez, dst, b_off);
+				break;
+			case BPF_JSLT:
+				emit_instr(ctx, bltz, dst, b_off);
+				break;
+			case BPF_JSLE:
+				emit_instr(ctx, blez, dst, b_off);
+				break;
+			}
 			emit_instr(ctx, nop);
 			break;
 		}
 		/*
 		 * only "LT" compare available, so we must use imm + 1
-		 * to generate "GT"
+		 * to generate "GT" and imm -1 to generate LE
 		 */
-		t64s = insn->imm + (cmp_eq ? 0 : 1);
+		if (BPF_OP(insn->code) == BPF_JSGT)
+			t64s = insn->imm + 1;
+		else if (BPF_OP(insn->code) == BPF_JSLE)
+			t64s = insn->imm + 1;
+		else
+			t64s = insn->imm;
+
+		cmp_eq = BPF_OP(insn->code) == BPF_JSGT || BPF_OP(insn->code) == BPF_JSGE;
 		if (t64s >= S16_MIN && t64s <= S16_MAX) {
 			emit_instr(ctx, slti, MIPS_R_AT, dst, (int)t64s);
 			src = MIPS_R_AT;
 			dst = MIPS_R_ZERO;
-			cmp_eq = true;
 			goto jeq_common;
 		}
 		emit_const_to_reg(ctx, MIPS_R_AT, (u64)t64s);
 		emit_instr(ctx, slt, MIPS_R_AT, dst, MIPS_R_AT);
 		src = MIPS_R_AT;
 		dst = MIPS_R_ZERO;
-		cmp_eq = true;
 		goto jeq_common;
 
 	case BPF_JMP | BPF_JGT | BPF_K:
 	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JLT | BPF_K:
+	case BPF_JMP | BPF_JLE | BPF_K:
 		cmp_eq = (BPF_OP(insn->code) == BPF_JGE);
 		dst = ebpf_to_mips_reg(ctx, insn, dst_reg_fp_ok);
 		if (dst < 0)
 			return dst;
 		/*
 		 * only "LT" compare available, so we must use imm + 1
-		 * to generate "GT"
+		 * to generate "GT" and imm -1 to generate LE
 		 */
-		t64s = (u64)(u32)(insn->imm) + (cmp_eq ? 0 : 1);
-		if (t64s >= 0 && t64s <= S16_MAX) {
-			emit_instr(ctx, sltiu, MIPS_R_AT, dst, (int)t64s);
-			src = MIPS_R_AT;
-			dst = MIPS_R_ZERO;
-			cmp_eq = true;
-			goto jeq_common;
-		}
+		if (BPF_OP(insn->code) == BPF_JGT)
+			t64s = (u64)(u32)(insn->imm) + 1;
+		else if (BPF_OP(insn->code) == BPF_JLE)
+			t64s = (u64)(u32)(insn->imm) + 1;
+		else
+			t64s = (u64)(u32)(insn->imm);
+
+		cmp_eq = BPF_OP(insn->code) == BPF_JGT || BPF_OP(insn->code) == BPF_JGE;
+
 		emit_const_to_reg(ctx, MIPS_R_AT, (u64)t64s);
 		emit_instr(ctx, sltu, MIPS_R_AT, dst, MIPS_R_AT);
 		src = MIPS_R_AT;
 		dst = MIPS_R_ZERO;
-		cmp_eq = true;
 		goto jeq_common;
 
 	case BPF_JMP | BPF_JSET | BPF_K: /* JMP_IMM */
@@ -1712,10 +1751,14 @@ static int reg_val_propagate_range(struct jit_ctx *ctx, u64 initial_rvt,
 			case BPF_JEQ:
 			case BPF_JGT:
 			case BPF_JGE:
+			case BPF_JLT:
+			case BPF_JLE:
 			case BPF_JSET:
 			case BPF_JNE:
 			case BPF_JSGT:
 			case BPF_JSGE:
+			case BPF_JSLT:
+			case BPF_JSLE:
 				if (follow_taken) {
 					rvt[idx] |= RVT_BRANCH_TAKEN;
 					idx += insn->off;
-- 
2.9.5

^ permalink raw reply related

* [PATCH 3/3] MIPS,bpf: Cache value of BPF_OP(insn->code) in eBPF JIT.
From: David Daney @ 2017-08-18 23:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller, netdev,
	linux-kernel
  Cc: linux-mips, ralf, David Daney
In-Reply-To: <20170818234033.5990-1-david.daney@cavium.com>

The code looks a little cleaner if we replace BPF_OP(insn->code) with
the local variable bpf_op.  Caching the value this way also saves 300
bytes (about 1%) in the code size of the JIT.

Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/mips/net/ebpf_jit.c | 67 ++++++++++++++++++++++++------------------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index c1e21cb..44ddc12 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -670,6 +670,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	unsigned int target;
 	u64 t64;
 	s64 t64s;
+	int bpf_op = BPF_OP(insn->code);
 
 	switch (insn->code) {
 	case BPF_ALU64 | BPF_ADD | BPF_K: /* ALU64_IMM */
@@ -758,13 +759,13 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, sll, dst, dst, 0);
 		if (insn->imm == 1) {
 			/* div by 1 is a nop, mod by 1 is zero */
-			if (BPF_OP(insn->code) == BPF_MOD)
+			if (bpf_op == BPF_MOD)
 				emit_instr(ctx, addu, dst, MIPS_R_ZERO, MIPS_R_ZERO);
 			break;
 		}
 		gen_imm_to_reg(insn, MIPS_R_AT, ctx);
 		emit_instr(ctx, divu, dst, MIPS_R_AT);
-		if (BPF_OP(insn->code) == BPF_DIV)
+		if (bpf_op == BPF_DIV)
 			emit_instr(ctx, mflo, dst);
 		else
 			emit_instr(ctx, mfhi, dst);
@@ -786,13 +787,13 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 
 		if (insn->imm == 1) {
 			/* div by 1 is a nop, mod by 1 is zero */
-			if (BPF_OP(insn->code) == BPF_MOD)
+			if (bpf_op == BPF_MOD)
 				emit_instr(ctx, addu, dst, MIPS_R_ZERO, MIPS_R_ZERO);
 			break;
 		}
 		gen_imm_to_reg(insn, MIPS_R_AT, ctx);
 		emit_instr(ctx, ddivu, dst, MIPS_R_AT);
-		if (BPF_OP(insn->code) == BPF_DIV)
+		if (bpf_op == BPF_DIV)
 			emit_instr(ctx, mflo, dst);
 		else
 			emit_instr(ctx, mfhi, dst);
@@ -817,7 +818,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, dinsu, dst, MIPS_R_ZERO, 32, 32);
 		did_move = false;
 		if (insn->src_reg == BPF_REG_10) {
-			if (BPF_OP(insn->code) == BPF_MOV) {
+			if (bpf_op == BPF_MOV) {
 				emit_instr(ctx, daddiu, dst, MIPS_R_SP, MAX_BPF_STACK);
 				did_move = true;
 			} else {
@@ -827,7 +828,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		} else if (get_reg_val_type(ctx, this_idx, insn->src_reg) == REG_32BIT) {
 			int tmp_reg = MIPS_R_AT;
 
-			if (BPF_OP(insn->code) == BPF_MOV) {
+			if (bpf_op == BPF_MOV) {
 				tmp_reg = dst;
 				did_move = true;
 			}
@@ -835,7 +836,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, dinsu, tmp_reg, MIPS_R_ZERO, 32, 32);
 			src = MIPS_R_AT;
 		}
-		switch (BPF_OP(insn->code)) {
+		switch (bpf_op) {
 		case BPF_MOV:
 			if (!did_move)
 				emit_instr(ctx, daddu, dst, src, MIPS_R_ZERO);
@@ -867,7 +868,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, beq, src, MIPS_R_ZERO, b_off);
 			emit_instr(ctx, movz, MIPS_R_V0, MIPS_R_ZERO, src);
 			emit_instr(ctx, ddivu, dst, src);
-			if (BPF_OP(insn->code) == BPF_DIV)
+			if (bpf_op == BPF_DIV)
 				emit_instr(ctx, mflo, dst);
 			else
 				emit_instr(ctx, mfhi, dst);
@@ -911,7 +912,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		if (ts == REG_64BIT || ts == REG_32BIT_ZERO_EX) {
 			int tmp_reg = MIPS_R_AT;
 
-			if (BPF_OP(insn->code) == BPF_MOV) {
+			if (bpf_op == BPF_MOV) {
 				tmp_reg = dst;
 				did_move = true;
 			}
@@ -919,7 +920,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, sll, tmp_reg, src, 0);
 			src = MIPS_R_AT;
 		}
-		switch (BPF_OP(insn->code)) {
+		switch (bpf_op) {
 		case BPF_MOV:
 			if (!did_move)
 				emit_instr(ctx, addu, dst, src, MIPS_R_ZERO);
@@ -950,7 +951,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, beq, src, MIPS_R_ZERO, b_off);
 			emit_instr(ctx, movz, MIPS_R_V0, MIPS_R_ZERO, src);
 			emit_instr(ctx, divu, dst, src);
-			if (BPF_OP(insn->code) == BPF_DIV)
+			if (bpf_op == BPF_DIV)
 				emit_instr(ctx, mflo, dst);
 			else
 				emit_instr(ctx, mfhi, dst);
@@ -977,7 +978,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		break;
 	case BPF_JMP | BPF_JEQ | BPF_K: /* JMP_IMM */
 	case BPF_JMP | BPF_JNE | BPF_K: /* JMP_IMM */
-		cmp_eq = (BPF_OP(insn->code) == BPF_JEQ);
+		cmp_eq = (bpf_op == BPF_JEQ);
 		dst = ebpf_to_mips_reg(ctx, insn, dst_reg_fp_ok);
 		if (dst < 0)
 			return dst;
@@ -1012,18 +1013,18 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, sll, MIPS_R_AT, dst, 0);
 			dst = MIPS_R_AT;
 		}
-		if (BPF_OP(insn->code) == BPF_JSET) {
+		if (bpf_op == BPF_JSET) {
 			emit_instr(ctx, and, MIPS_R_AT, dst, src);
 			cmp_eq = false;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JSGT || BPF_OP(insn->code) == BPF_JSLE) {
+		} else if (bpf_op == BPF_JSGT || bpf_op == BPF_JSLE) {
 			emit_instr(ctx, dsubu, MIPS_R_AT, dst, src);
 			if ((insn + 1)->code == (BPF_JMP | BPF_EXIT) && insn->off == 1) {
 				b_off = b_imm(exit_idx, ctx);
 				if (is_bad_offset(b_off))
 					return -E2BIG;
-				if (BPF_OP(insn->code) == BPF_JSGT)
+				if (bpf_op == BPF_JSGT)
 					emit_instr(ctx, blez, MIPS_R_AT, b_off);
 				else
 					emit_instr(ctx, bgtz, MIPS_R_AT, b_off);
@@ -1033,18 +1034,18 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			b_off = b_imm(this_idx + insn->off + 1, ctx);
 			if (is_bad_offset(b_off))
 				return -E2BIG;
-			if (BPF_OP(insn->code) == BPF_JSGT)
+			if (bpf_op == BPF_JSGT)
 				emit_instr(ctx, bgtz, MIPS_R_AT, b_off);
 			else
 				emit_instr(ctx, blez, MIPS_R_AT, b_off);
 			emit_instr(ctx, nop);
 			break;
-		} else if (BPF_OP(insn->code) == BPF_JSGE || BPF_OP(insn->code) == BPF_JSLT) {
+		} else if (bpf_op == BPF_JSGE || bpf_op == BPF_JSLT) {
 			emit_instr(ctx, slt, MIPS_R_AT, dst, src);
-			cmp_eq = BPF_OP(insn->code) == BPF_JSGE;
+			cmp_eq = bpf_op == BPF_JSGE;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JGT || BPF_OP(insn->code) == BPF_JLE) {
+		} else if (bpf_op == BPF_JGT || bpf_op == BPF_JLE) {
 			/* dst or src could be AT */
 			emit_instr(ctx, dsubu, MIPS_R_T8, dst, src);
 			emit_instr(ctx, sltu, MIPS_R_AT, dst, src);
@@ -1052,16 +1053,16 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			emit_instr(ctx, movz, MIPS_R_T9, MIPS_R_SP, MIPS_R_T8);
 			emit_instr(ctx, movn, MIPS_R_T9, MIPS_R_ZERO, MIPS_R_T8);
 			emit_instr(ctx, or, MIPS_R_AT, MIPS_R_T9, MIPS_R_AT);
-			cmp_eq = BPF_OP(insn->code) == BPF_JGT;
+			cmp_eq = bpf_op == BPF_JGT;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
-		} else if (BPF_OP(insn->code) == BPF_JGE || BPF_OP(insn->code) == BPF_JLT) {
+		} else if (bpf_op == BPF_JGE || bpf_op == BPF_JLT) {
 			emit_instr(ctx, sltu, MIPS_R_AT, dst, src);
-			cmp_eq = BPF_OP(insn->code) == BPF_JGE;
+			cmp_eq = bpf_op == BPF_JGE;
 			dst = MIPS_R_AT;
 			src = MIPS_R_ZERO;
 		} else { /* JNE/JEQ case */
-			cmp_eq = (BPF_OP(insn->code) == BPF_JEQ);
+			cmp_eq = (bpf_op == BPF_JEQ);
 		}
 jeq_common:
 		/*
@@ -1122,7 +1123,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	case BPF_JMP | BPF_JSGE | BPF_K: /* JMP_IMM */
 	case BPF_JMP | BPF_JSLT | BPF_K: /* JMP_IMM */
 	case BPF_JMP | BPF_JSLE | BPF_K: /* JMP_IMM */
-		cmp_eq = (BPF_OP(insn->code) == BPF_JSGE);
+		cmp_eq = (bpf_op == BPF_JSGE);
 		dst = ebpf_to_mips_reg(ctx, insn, dst_reg_fp_ok);
 		if (dst < 0)
 			return dst;
@@ -1132,7 +1133,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 				b_off = b_imm(exit_idx, ctx);
 				if (is_bad_offset(b_off))
 					return -E2BIG;
-				switch (BPF_OP(insn->code)) {
+				switch (bpf_op) {
 				case BPF_JSGT:
 					emit_instr(ctx, blez, dst, b_off);
 					break;
@@ -1152,7 +1153,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			b_off = b_imm(this_idx + insn->off + 1, ctx);
 			if (is_bad_offset(b_off))
 				return -E2BIG;
-			switch (BPF_OP(insn->code)) {
+			switch (bpf_op) {
 			case BPF_JSGT:
 				emit_instr(ctx, bgtz, dst, b_off);
 				break;
@@ -1173,14 +1174,14 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		 * only "LT" compare available, so we must use imm + 1
 		 * to generate "GT" and imm -1 to generate LE
 		 */
-		if (BPF_OP(insn->code) == BPF_JSGT)
+		if (bpf_op == BPF_JSGT)
 			t64s = insn->imm + 1;
-		else if (BPF_OP(insn->code) == BPF_JSLE)
+		else if (bpf_op == BPF_JSLE)
 			t64s = insn->imm + 1;
 		else
 			t64s = insn->imm;
 
-		cmp_eq = BPF_OP(insn->code) == BPF_JSGT || BPF_OP(insn->code) == BPF_JSGE;
+		cmp_eq = bpf_op == BPF_JSGT || bpf_op == BPF_JSGE;
 		if (t64s >= S16_MIN && t64s <= S16_MAX) {
 			emit_instr(ctx, slti, MIPS_R_AT, dst, (int)t64s);
 			src = MIPS_R_AT;
@@ -1197,7 +1198,7 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	case BPF_JMP | BPF_JGE | BPF_K:
 	case BPF_JMP | BPF_JLT | BPF_K:
 	case BPF_JMP | BPF_JLE | BPF_K:
-		cmp_eq = (BPF_OP(insn->code) == BPF_JGE);
+		cmp_eq = (bpf_op == BPF_JGE);
 		dst = ebpf_to_mips_reg(ctx, insn, dst_reg_fp_ok);
 		if (dst < 0)
 			return dst;
@@ -1205,14 +1206,14 @@ static int build_one_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		 * only "LT" compare available, so we must use imm + 1
 		 * to generate "GT" and imm -1 to generate LE
 		 */
-		if (BPF_OP(insn->code) == BPF_JGT)
+		if (bpf_op == BPF_JGT)
 			t64s = (u64)(u32)(insn->imm) + 1;
-		else if (BPF_OP(insn->code) == BPF_JLE)
+		else if (bpf_op == BPF_JLE)
 			t64s = (u64)(u32)(insn->imm) + 1;
 		else
 			t64s = (u64)(u32)(insn->imm);
 
-		cmp_eq = BPF_OP(insn->code) == BPF_JGT || BPF_OP(insn->code) == BPF_JGE;
+		cmp_eq = bpf_op == BPF_JGT || bpf_op == BPF_JGE;
 
 		emit_const_to_reg(ctx, MIPS_R_AT, (u64)t64s);
 		emit_instr(ctx, sltu, MIPS_R_AT, dst, MIPS_R_AT);
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next 1/2] bpf: improve htab inlining for future 32 bit jits
From: Daniel Borkmann @ 2017-08-18 23:51 UTC (permalink / raw)
  To: davem; +Cc: ast, netdev, Daniel Borkmann
In-Reply-To: <cover.1503100047.git.daniel@iogearbox.net>

Lets future proof htab lookup inlining, commit 9015d2f59535 ("bpf:
inline htab_map_lookup_elem()") was making the assumption that a
direct call emission to __htab_map_lookup_elem() will always work
out for JITs. This is currently true since all JITs we have are
for 64 bit archs, but in case of 32 bit JITs like upcoming arm32,
we get a NULL pointer dereference when executing the call to
__htab_map_lookup_elem() since passed arguments are of a different
size (unsigned long vs. u64 for pointers) than what we do out of
BPF. Thus, lets do a proper BPF_CALL_2() declaration such that we
don't need to make any such assumptions.

Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/hashtab.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4fb4631..cabf37b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -437,7 +437,8 @@ static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
  * The return value is adjusted by BPF instructions
  * in htab_map_gen_lookup().
  */
-static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
+static __always_inline void *__htab_map_lookup_elem(struct bpf_map *map,
+						    void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_nulls_head *head;
@@ -479,12 +480,17 @@ static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
  * bpf_prog
  *   __htab_map_lookup_elem
  */
+BPF_CALL_2(bpf_htab_lookup_helper, struct bpf_map *, map, void *, key)
+{
+	return (unsigned long) __htab_map_lookup_elem(map, key);
+}
+
 static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 {
 	struct bpf_insn *insn = insn_buf;
 	const int ret = BPF_REG_0;
 
-	*insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem);
+	*insn++ = BPF_EMIT_CALL(bpf_htab_lookup_helper);
 	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
 				offsetof(struct htab_elem, key) +
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 0/2] BPF inline improvements
From: Daniel Borkmann @ 2017-08-18 23:51 UTC (permalink / raw)
  To: davem; +Cc: ast, netdev, Daniel Borkmann

First one makes htab inlining more robust wrt future jits and
second one inlines map in map lookups through map_gen_lookup()
callback.

Thanks!

Daniel Borkmann (2):
  bpf: improve htab inlining for future 32 bit jits
  bpf: inline map in map lookup functions for array and htab

 kernel/bpf/arraymap.c | 26 ++++++++++++++++++++++++++
 kernel/bpf/hashtab.c  | 27 +++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 2 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH net-next 2/2] bpf: inline map in map lookup functions for array and htab
From: Daniel Borkmann @ 2017-08-18 23:51 UTC (permalink / raw)
  To: davem; +Cc: ast, netdev, Daniel Borkmann
In-Reply-To: <cover.1503100047.git.daniel@iogearbox.net>

Avoid two successive functions calls for the map in map lookup, first
is the bpf_map_lookup_elem() helper call, and second the callback via
map->ops->map_lookup_elem() to get to the map in map implementation.
Implementation inlines array and htab flavor for map in map lookups.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/arraymap.c | 26 ++++++++++++++++++++++++++
 kernel/bpf/hashtab.c  | 17 +++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d771a38..b25d6ce 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -603,6 +603,31 @@ static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
 	return READ_ONCE(*inner_map);
 }
 
+static u32 array_of_map_gen_lookup(struct bpf_map *map,
+				   struct bpf_insn *insn_buf)
+{
+	u32 elem_size = round_up(map->value_size, 8);
+	struct bpf_insn *insn = insn_buf;
+	const int ret = BPF_REG_0;
+	const int map_ptr = BPF_REG_1;
+	const int index = BPF_REG_2;
+
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value));
+	*insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0);
+	*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 5);
+	if (is_power_of_2(elem_size))
+		*insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(elem_size));
+	else
+		*insn++ = BPF_ALU64_IMM(BPF_MUL, ret, elem_size);
+	*insn++ = BPF_ALU64_REG(BPF_ADD, ret, map_ptr);
+	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+	*insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
+	*insn++ = BPF_MOV64_IMM(ret, 0);
+
+	return insn - insn_buf;
+}
+
 const struct bpf_map_ops array_of_maps_map_ops = {
 	.map_alloc = array_of_map_alloc,
 	.map_free = array_of_map_free,
@@ -612,4 +637,5 @@ static void *array_of_map_lookup_elem(struct bpf_map *map, void *key)
 	.map_fd_get_ptr = bpf_map_fd_get_ptr,
 	.map_fd_put_ptr = bpf_map_fd_put_ptr,
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+	.map_gen_lookup = array_of_map_gen_lookup,
 };
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index cabf37b..750261d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1317,6 +1317,22 @@ static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
 	return READ_ONCE(*inner_map);
 }
 
+static u32 htab_of_map_gen_lookup(struct bpf_map *map,
+				  struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+	const int ret = BPF_REG_0;
+
+	*insn++ = BPF_EMIT_CALL(bpf_htab_lookup_helper);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
+				offsetof(struct htab_elem, key) +
+				round_up(map->key_size, 8));
+	*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
+
+	return insn - insn_buf;
+}
+
 static void htab_of_map_free(struct bpf_map *map)
 {
 	bpf_map_meta_free(map->inner_map_meta);
@@ -1332,4 +1348,5 @@ static void htab_of_map_free(struct bpf_map *map)
 	.map_fd_get_ptr = bpf_map_fd_get_ptr,
 	.map_fd_put_ptr = bpf_map_fd_put_ptr,
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
+	.map_gen_lookup = htab_of_map_gen_lookup,
 };
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH net-next 1/2] bpf: improve htab inlining for future 32 bit jits
From: Alexei Starovoitov @ 2017-08-19  0:00 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev
In-Reply-To: <a2a1360af1df20e5a11269441656b5daffabdd77.1503100047.git.daniel@iogearbox.net>

On 8/18/17 4:51 PM, Daniel Borkmann wrote:
> Lets future proof htab lookup inlining, commit 9015d2f59535 ("bpf:
> inline htab_map_lookup_elem()") was making the assumption that a
> direct call emission to __htab_map_lookup_elem() will always work
> out for JITs. This is currently true since all JITs we have are
> for 64 bit archs, but in case of 32 bit JITs like upcoming arm32,
> we get a NULL pointer dereference when executing the call to
> __htab_map_lookup_elem() since passed arguments are of a different
> size (unsigned long vs. u64 for pointers) than what we do out of
> BPF. Thus, lets do a proper BPF_CALL_2() declaration such that we
> don't need to make any such assumptions.
>
> Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

assuming on 64-bit archs the should be no perf difference
and only increase in .text, since __htab_map_lookup_elem
is now force inlined into a bunch of places?
I guess that's ok, but kinda sux for 64-bit archs to pay
such penalty because of 32-bit archs.
May be drop always_inline and do such thing conditionally
on 32-bit archs only?
what's the increase in .text?
any difference seen in map_perf_test?

^ permalink raw reply

* Re: [PATCH net-next 2/2] bpf: inline map in map lookup functions for array and htab
From: Alexei Starovoitov @ 2017-08-19  0:05 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev
In-Reply-To: <501683f73390c81a7abaa0683f3675860a79adc7.1503100047.git.daniel@iogearbox.net>

On 8/18/17 4:51 PM, Daniel Borkmann wrote:
> Avoid two successive functions calls for the map in map lookup, first
> is the bpf_map_lookup_elem() helper call, and second the callback via
> map->ops->map_lookup_elem() to get to the map in map implementation.
> Implementation inlines array and htab flavor for map in map lookups.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

lgtm
thanks for adding this optimization.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 net-next] net: ipv6: put host and anycast routes on device with address
From: David Ahern @ 2017-08-19  0:05 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji
In-Reply-To: <87wp60lb70.fsf@stressinduktion.org>

On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote:
> Hello David,
> 
> David Ahern <dsahern@gmail.com> writes:
> 
>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
>>  {
>>  	u32 tb_id;
>>  	struct net *net = dev_net(idev->dev);
>> -	struct net_device *dev = net->loopback_dev;
>> +	struct net_device *dev = idev->dev;
>>  	struct rt6_info *rt;
>>  
>> -	/* use L3 Master device as loopback for host routes if device
>> -	 * is enslaved and address is not link local or multicast
>> -	 */
>> -	if (!rt6_need_strict(addr))
>> -		dev = l3mdev_master_dev_rcu(idev->dev) ? : dev;
>> -
>>  	rt = ip6_dst_alloc(net, dev, DST_NOCOUNT);
>>  	if (!rt)
>>  		return ERR_PTR(-ENOMEM);
> 
> I am afraid this change might break Java:
> 
> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574>
> 
> I am all in for this change, but maybe it might be necessary to mask
> RTF_LOCAL routes with "lo" somehow.

That's asinine. The if_inet6 processing is just getting the 'lo'
interface index. Why scan the file looking for that? The ipv6_route
processing is assembling routes against the loopback device regardless
of what the route is. Do you know why - what the route list is used for?

If it matters, we could keep 'lo' as the device for RTF_LOCAL routes in
the proc files to keep backwards compatibility.

^ permalink raw reply

* [PATCH RFC net] fsl/man: Inherit parent device and of_node
From: Florian Fainelli @ 2017-08-19  0:12 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, vivien.didelot, madalin.bucur, junote,
	igal.liberman, Florian Fainelli

Junote Cai reported that he was not able to get a DSA setup involving the
Freescale DPAA/FMAN driver to work and narrowed it down to
of_find_net_device_by_node(). This function requires the network device's
device reference to be correctly set which is the case here, though we have
lost any device_node association there.

The problem is that dpaa_eth_add_device() allocates a "dpaa-ethernet" platform
device, and later on dpaa_eth_probe() is called but SET_NETDEV_DEV() won't be
propagating &pdev->dev.of_node properly. Fix this by inherenting both the parent
device and the of_node when dpaa_eth_add_device() creates the platform device.

Fixes: 3933961682a3 ("fsl/fman: Add FMan MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Madalin,

After this patch, Junote was able to get the DSA switch probed correctly however
he was also observing a resource conflict, but I don't think this patch is responsible
for that per-se, please review:

iommu: Adding device ffe488000.port to group 21
fsl_mac ffe4e0000.ethernet: FMan dTSEC version: 0x08240101
fsl_mac ffe4e0000.ethernet: FMan MAC address: 00:04:9f:ef:05:01
fsl_mac dpaa-ethernet.0: __devm_request_mem_region(mac) failed
fsl_mac: probe of dpaa-ethernet.0 failed with error -16
fsl_mac dpaa-ethernet.0: __devm_request_mem_region(mac) failed
fsl_mac: probe of dpaa-ethernet.0 failed with error -16
fsl_dpa dpaa-ethernet.0 eth0: Probed interface eth0

 drivers/net/ethernet/freescale/fman/mac.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fman/mac.c b/drivers/net/ethernet/freescale/fman/mac.c
index 6e67d22fd0d5..1c7da16ad0ff 100644
--- a/drivers/net/ethernet/freescale/fman/mac.c
+++ b/drivers/net/ethernet/freescale/fman/mac.c
@@ -623,6 +623,8 @@ static struct platform_device *dpaa_eth_add_device(int fman_id,
 		goto no_mem;
 	}
 
+	pdev->dev.of_node = node;
+	pdev->dev.parent = priv->dev;
 	set_dma_ops(&pdev->dev, get_dma_ops(priv->dev));
 
 	ret = platform_device_add_data(pdev, &data, sizeof(data));
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v2 net-next] net: ipv6: put host and anycast routes on device with address
From: David Ahern @ 2017-08-19  0:14 UTC (permalink / raw)
  To: David Ahern, Hannes Frederic Sowa; +Cc: netdev, yoshfuji
In-Reply-To: <75e79efe-4a41-7b15-b7bb-8ed0624b72b5@gmail.com>

On 8/18/17 6:05 PM, David Ahern wrote:
> On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote:
>> Hello David,
>>
>> David Ahern <dsahern@gmail.com> writes:
>>
>>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
>>>  {
>>>  	u32 tb_id;
>>>  	struct net *net = dev_net(idev->dev);
>>> -	struct net_device *dev = net->loopback_dev;
>>> +	struct net_device *dev = idev->dev;
>>>  	struct rt6_info *rt;
>>>  
>>> -	/* use L3 Master device as loopback for host routes if device
>>> -	 * is enslaved and address is not link local or multicast
>>> -	 */
>>> -	if (!rt6_need_strict(addr))
>>> -		dev = l3mdev_master_dev_rcu(idev->dev) ? : dev;
>>> -
>>>  	rt = ip6_dst_alloc(net, dev, DST_NOCOUNT);
>>>  	if (!rt)
>>>  		return ERR_PTR(-ENOMEM);
>>
>> I am afraid this change might break Java:
>>
>> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574>
>>
>> I am all in for this change, but maybe it might be necessary to mask
>> RTF_LOCAL routes with "lo" somehow.
> 
> That's asinine. The if_inet6 processing is just getting the 'lo'
> interface index. Why scan the file looking for that? The ipv6_route
> processing is assembling routes against the loopback device regardless
> of what the route is. Do you know why - what the route list is used for?


If I read it correctly, seems to be a 2.4 workaround:
- only user of the route list is needsLoopbackRoute()
- only caller of needsLoopbackRoute is here:

http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l828

^ permalink raw reply

* [PATCH net] ipv6: repair fib6 tree in failure case
From: Wei Wang @ 2017-08-19  0:14 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Wei Wang

From: Wei Wang <weiwan@google.com>

In fib6_add(), it is possible that fib6_add_1() picks an intermediate
node and sets the node's fn->leaf to NULL in order to add this new
route. However, if fib6_add_rt2node() fails to add the new
route for some reason, fn->leaf will be left as NULL and could
potentially cause crash when fn->leaf is accessed in fib6_locate().
This patch makes sure fib6_repair_tree() is called to properly repair
fn->leaf in the above failure case.

Here is the syzkaller reported general protection fault in fib6_locate:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801d7d64100 ti: ffff8801d01a0000 task.ti: ffff8801d01a0000
RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline]
RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline]
RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline]
RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233
RSP: 0018:ffff8801d01a36a8  EFLAGS: 00010202
RAX: 0000000000000020 RBX: ffff8801bc790e00 RCX: ffffc90002983000
RDX: 0000000000001219 RSI: ffff8801d01a37a0 RDI: 0000000000000100
RBP: ffff8801d01a36f0 R08: 00000000000000ff R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
R13: dffffc0000000000 R14: ffff8801d01a37a0 R15: 0000000000000000
FS:  00007f6afd68c700(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004c6340 CR3: 00000000ba41f000 CR4: 00000000001426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0
 ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0
 ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988
Call Trace:
 [<ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109
 [<ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075
 [<ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450
 [<ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281
 [<ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456
 [<ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline]
 [<ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232
 [<ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778
 [<ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline]
 [<ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619
 [<ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834
 [<ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478
 [<ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491
 [<ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538
 [<ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline]
 [<ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577
 [<ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17

Note: there is no "Fixes" tag as this seems to be a bug introduced
very early.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_fib.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 78b9359b06fd..549aacc3cb2c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1118,7 +1118,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 			/* Create subtree root node */
 			sfn = node_alloc();
 			if (!sfn)
-				goto st_failure;
+				goto failure;
 
 			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
 			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
@@ -1135,12 +1135,12 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 
 			if (IS_ERR(sn)) {
 				/* If it is failed, discard just allocated
-				   root, and then (in st_failure) stale node
+				   root, and then (in failure) stale node
 				   in main tree.
 				 */
 				node_free(sfn);
 				err = PTR_ERR(sn);
-				goto st_failure;
+				goto failure;
 			}
 
 			/* Now link new subtree to main tree */
@@ -1155,7 +1155,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 
 			if (IS_ERR(sn)) {
 				err = PTR_ERR(sn);
-				goto st_failure;
+				goto failure;
 			}
 		}
 
@@ -1196,18 +1196,17 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 			atomic_inc(&pn->leaf->rt6i_ref);
 		}
 #endif
-		/* Always release dst as dst->__refcnt is guaranteed
-		 * to be taken before entering this function
-		 */
-		dst_release_immediate(&rt->dst);
+		goto failure;
 	}
 	return err;
 
-#ifdef CONFIG_IPV6_SUBTREES
-	/* Subtree creation failed, probably main tree node
-	   is orphan. If it is, shoot it.
+failure:
+	/* fn->leaf could be NULL if fn is an intermediate node and we
+	 * failed to add the new route to it in both subtree creation
+	 * failure and fib6_add_rt2node() failure case.
+	 * In both cases, fib6_repair_tree() should be called to fix
+	 * fn->leaf.
 	 */
-st_failure:
 	if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
 		fib6_repair_tree(info->nl_net, fn);
 	/* Always release dst as dst->__refcnt is guaranteed
@@ -1215,7 +1214,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 	 */
 	dst_release_immediate(&rt->dst);
 	return err;
-#endif
 }
 
 /*
-- 
2.14.1.480.gb18f417b89-goog

^ permalink raw reply related

* Re: [PATCH net-next 1/2] bpf: improve htab inlining for future 32 bit jits
From: Daniel Borkmann @ 2017-08-19  0:21 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: netdev
In-Reply-To: <43a2b95f-db1d-8c33-a7d6-42f8299ddcb3@fb.com>

On 08/19/2017 02:00 AM, Alexei Starovoitov wrote:
> On 8/18/17 4:51 PM, Daniel Borkmann wrote:
>> Lets future proof htab lookup inlining, commit 9015d2f59535 ("bpf:
>> inline htab_map_lookup_elem()") was making the assumption that a
>> direct call emission to __htab_map_lookup_elem() will always work
>> out for JITs. This is currently true since all JITs we have are
>> for 64 bit archs, but in case of 32 bit JITs like upcoming arm32,
>> we get a NULL pointer dereference when executing the call to
>> __htab_map_lookup_elem() since passed arguments are of a different
>> size (unsigned long vs. u64 for pointers) than what we do out of
>> BPF. Thus, lets do a proper BPF_CALL_2() declaration such that we
>> don't need to make any such assumptions.
>>
>> Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>
> assuming on 64-bit archs the should be no perf difference
> and only increase in .text, since __htab_map_lookup_elem
> is now force inlined into a bunch of places?
> I guess that's ok, but kinda sux for 64-bit archs to pay
> such penalty because of 32-bit archs.

Yeah true, text bumps from 11k to 13k, doesn't pay off.

> May be drop always_inline and do such thing conditionally
> on 32-bit archs only?

I will guard with this instead:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4f6e7eb..e42c096 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4160,7 +4160,11 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
                         continue;
                 }

-               if (ebpf_jit_enabled() && insn->imm == BPF_FUNC_map_lookup_elem) {
+               /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
+                * handlers are currently limited to 64 bit only.
+                */
+               if (ebpf_jit_enabled() && BITS_PER_LONG == 64 &&
+                   insn->imm == BPF_FUNC_map_lookup_elem) {
                         map_ptr = env->insn_aux_data[i + delta].map_ptr;
                         if (map_ptr == BPF_MAP_PTR_POISON ||
                             !map_ptr->ops->map_gen_lookup)

^ permalink raw reply related

* Re: [PATCH net-next 1/2] bpf: improve htab inlining for future 32 bit jits
From: Alexei Starovoitov @ 2017-08-19  0:24 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: netdev
In-Reply-To: <59978488.9010802@iogearbox.net>

On 8/18/17 5:21 PM, Daniel Borkmann wrote:
> On 08/19/2017 02:00 AM, Alexei Starovoitov wrote:
>> On 8/18/17 4:51 PM, Daniel Borkmann wrote:
>>> Lets future proof htab lookup inlining, commit 9015d2f59535 ("bpf:
>>> inline htab_map_lookup_elem()") was making the assumption that a
>>> direct call emission to __htab_map_lookup_elem() will always work
>>> out for JITs. This is currently true since all JITs we have are
>>> for 64 bit archs, but in case of 32 bit JITs like upcoming arm32,
>>> we get a NULL pointer dereference when executing the call to
>>> __htab_map_lookup_elem() since passed arguments are of a different
>>> size (unsigned long vs. u64 for pointers) than what we do out of
>>> BPF. Thus, lets do a proper BPF_CALL_2() declaration such that we
>>> don't need to make any such assumptions.
>>>
>>> Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> assuming on 64-bit archs the should be no perf difference
>> and only increase in .text, since __htab_map_lookup_elem
>> is now force inlined into a bunch of places?
>> I guess that's ok, but kinda sux for 64-bit archs to pay
>> such penalty because of 32-bit archs.
>
> Yeah true, text bumps from 11k to 13k, doesn't pay off.
>
>> May be drop always_inline and do such thing conditionally
>> on 32-bit archs only?
>
> I will guard with this instead:
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 4f6e7eb..e42c096 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4160,7 +4160,11 @@ static int fixup_bpf_calls(struct
> bpf_verifier_env *env)
>                         continue;
>                 }
>
> -               if (ebpf_jit_enabled() && insn->imm ==
> BPF_FUNC_map_lookup_elem) {
> +               /* BPF_EMIT_CALL() assumptions in some of the
> map_gen_lookup
> +                * handlers are currently limited to 64 bit only.
> +                */
> +               if (ebpf_jit_enabled() && BITS_PER_LONG == 64 &&
> +                   insn->imm == BPF_FUNC_map_lookup_elem) {
>                         map_ptr = env->insn_aux_data[i + delta].map_ptr;
>                         if (map_ptr == BPF_MAP_PTR_POISON ||
>                             !map_ptr->ops->map_gen_lookup)

sure. looks good to me for now. We probably need to first measure
the perf gains out of inlining on 32-bit arm to go next step.
Thanks!

^ permalink raw reply

* Re: [PATCH v2 net-next] net: ipv6: put host and anycast routes on device with address
From: Hannes Frederic Sowa @ 2017-08-19  0:28 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, yoshfuji
In-Reply-To: <0bcba874-7fc3-508c-bf78-ae5832312845@gmail.com>

David Ahern <dsahern@gmail.com> writes:

> On 8/18/17 6:05 PM, David Ahern wrote:
>> On 8/18/17 5:15 PM, Hannes Frederic Sowa wrote:
>>> Hello David,
>>>
>>> David Ahern <dsahern@gmail.com> writes:
>>>
>>>> @@ -2688,15 +2716,9 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
>>>>  {
>>>>  	u32 tb_id;
>>>>  	struct net *net = dev_net(idev->dev);
>>>> -	struct net_device *dev = net->loopback_dev;
>>>> +	struct net_device *dev = idev->dev;
>>>>  	struct rt6_info *rt;
>>>>  
>>>> -	/* use L3 Master device as loopback for host routes if device
>>>> -	 * is enslaved and address is not link local or multicast
>>>> -	 */
>>>> -	if (!rt6_need_strict(addr))
>>>> -		dev = l3mdev_master_dev_rcu(idev->dev) ? : dev;
>>>> -
>>>>  	rt = ip6_dst_alloc(net, dev, DST_NOCOUNT);
>>>>  	if (!rt)
>>>>  		return ERR_PTR(-ENOMEM);
>>>
>>> I am afraid this change might break Java:
>>>
>>> <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l574>
>>>
>>> I am all in for this change, but maybe it might be necessary to mask
>>> RTF_LOCAL routes with "lo" somehow.
>> 
>> That's asinine. The if_inet6 processing is just getting the 'lo'
>> interface index. Why scan the file looking for that? The ipv6_route
>> processing is assembling routes against the loopback device regardless
>> of what the route is. Do you know why - what the route list is used for?
>
>
> If I read it correctly, seems to be a 2.4 workaround:
> - only user of the route list is needsLoopbackRoute()
> - only caller of needsLoopbackRoute is here:
>
> http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libnet/net_util_md.c#l828

I agree that it looks like dead code now. But I know for sure that this
code has been excercised at least at some point in time and caused
problems for JVMs on Linux with IPv6.

On the top of this file I found this comment:

-- >8 --
/* following code creates a list of addresses from the kernel
 * routing table that are routed via the loopback address.
 * We check all destination addresses against this table
 * and override the scope_id field to use the relevant value for "lo"
 * in order to work-around the Linux bug that prevents packets destined
 * for certain local addresses from being sent via a physical interface.
 */
-- 8< --

I don't know if it makes sense to dive down into java history (and I
also found e.g. net-snmp scanning /proc/net/ipv6_route). The same
problem might be visible via RTM_GETROUTE dumps if applications
implement their own source address selection maybe. :/

Bye, Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox