* Re: [PATCH net 0/2] nfp: fix ethtool stats and page allocation
From: David Miller @ 2017-10-10 20:18 UTC (permalink / raw)
To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20171010161623.23838-1-jakub.kicinski@netronome.com>
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 10 Oct 2017 09:16:21 -0700
> Two fixes for net. First one makes sure we handle gather of stats on
> 32bit machines correctly (ouch). The second fix solves a potential
> NULL-deref if we fail to allocate a page with XDP running.
>
> I used Fixes: tags pointing to where the bug was introduced, but for
> patch 1 it has been in the driver "for ever" and fix won't backport
> cleanly beyond commit 325945ede6d4 ("nfp: split software and hardware
> vNIC statistics") which is in net.
Series applied, thanks Jakub.
^ permalink raw reply
* Re: [net-next 0/9][pull request] 1GbE Intel Wired LAN Driver Updates 2017-10-10
From: David Miller @ 2017-10-10 20:21 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20171010172139.77914-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 10 Oct 2017 10:21:30 -0700
> This series contains updates to e1000e and igb.
Pulled, thanks Jeff.
^ permalink raw reply
* Re: [ovs-dev] [PATCH net-next] openvswitch: add ct_clear action
From: Joe Stringer @ 2017-10-10 20:24 UTC (permalink / raw)
To: Eric Garver, Joe Stringer, Pravin Shelar,
Linux Kernel Network Developers, ovs dev
In-Reply-To: <20171010191329.GH26353@dev-rhel7>
On 10 October 2017 at 12:13, Eric Garver <e@erig.me> wrote:
> On Tue, Oct 10, 2017 at 10:24:20AM -0700, Joe Stringer wrote:
>> On 10 October 2017 at 08:09, Eric Garver <e@erig.me> wrote:
>> > On Tue, Oct 10, 2017 at 05:33:48AM -0700, Joe Stringer wrote:
>> >> On 9 October 2017 at 21:41, Pravin Shelar <pshelar@ovn.org> wrote:
>> >> > On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver <e@erig.me> wrote:
>> >> >> This adds a ct_clear action for clearing conntrack state. ct_clear is
>> >> >> currently implemented in OVS userspace, but is not backed by an action
>> >> >> in the kernel datapath. This is useful for flows that may modify a
>> >> >> packet tuple after a ct lookup has already occurred.
>> >> >>
>> >> >> Signed-off-by: Eric Garver <e@erig.me>
>> >> > Patch mostly looks good. I have following comments.
>> >> >
>> >> >> ---
>> >> >> include/uapi/linux/openvswitch.h | 2 ++
>> >> >> net/openvswitch/actions.c | 5 +++++
>> >> >> net/openvswitch/conntrack.c | 12 ++++++++++++
>> >> >> net/openvswitch/conntrack.h | 7 +++++++
>> >> >> net/openvswitch/flow_netlink.c | 5 +++++
>> >> >> 5 files changed, 31 insertions(+)
>> >> >>
>> >> >> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>> >> >> index 156ee4cab82e..1b6e510e2cc6 100644
>> >> >> --- a/include/uapi/linux/openvswitch.h
>> >> >> +++ b/include/uapi/linux/openvswitch.h
>> >> >> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
>> >> >> * packet.
>> >> >> * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
>> >> >> * packet.
>> >> >> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
>> >> >> *
>> >> >> * Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all
>> >> >> * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
>> >> >> @@ -835,6 +836,7 @@ enum ovs_action_attr {
>> >> >> OVS_ACTION_ATTR_TRUNC, /* u32 struct ovs_action_trunc. */
>> >> >> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
>> >> >> OVS_ACTION_ATTR_POP_ETH, /* No argument. */
>> >> >> + OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
>> >> >>
>> >> >> __OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted
>> >> >> * from userspace. */
>> >> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> >> >> index a54a556fcdb5..db9c7f2e662b 100644
>> >> >> --- a/net/openvswitch/actions.c
>> >> >> +++ b/net/openvswitch/actions.c
>> >> >> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>> >> >> return err == -EINPROGRESS ? 0 : err;
>> >> >> break;
>> >> >>
>> >> >> + case OVS_ACTION_ATTR_CT_CLEAR:
>> >> >> + err = ovs_ct_clear(skb, key);
>> >> >> + break;
>> >> >> +
>> >> >> case OVS_ACTION_ATTR_PUSH_ETH:
>> >> >> err = push_eth(skb, key, nla_data(a));
>> >> >> break;
>> >> >> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>> >> >> case OVS_ACTION_ATTR_POP_ETH:
>> >> >> err = pop_eth(skb, key);
>> >> >> break;
>> >> >> +
>> >> >> }
>> >> > Unrelated change.
>> >> >
>> >> >>
>> >> >> if (unlikely(err)) {
>> >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> >> >> index d558e882ca0c..f9b73c726ad7 100644
>> >> >> --- a/net/openvswitch/conntrack.c
>> >> >> +++ b/net/openvswitch/conntrack.c
>> >> >> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
>> >> >> return err;
>> >> >> }
>> >> >>
>> >> >> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
>> >> >> +{
>> >> >> + if (skb_nfct(skb)) {
>> >> >> + nf_conntrack_put(skb_nfct(skb));
>> >> >> + nf_ct_set(skb, NULL, 0);
>> >> > Can the new conntract state be appropriate? may be IP_CT_UNTRACKED?
>> >> >
>> >> >> + }
>> >> >> +
>> >> >> + ovs_ct_fill_key(skb, key);
>> >> >> +
>> >> > I do not see need to refill the key if there is no skb-nf-ct.
>> >>
>> >> Really this is trying to just zero the CT key fields, but reuses
>> >> existing functions, right? This means that subsequent upcalls, for
>> >
>> > Right.
>> >
>> >> instance, won't have the outdated view of the CT state from the
>> >> previous lookup (that was prior to the ct_clear). I'd expect these key
>> >> fields to be cleared.
>> >
>> > I assumed Pravin was saying that we don't need to clear them if there is
>> > no conntrack state. They should already be zero.
>>
>> The conntrack calls aren't going to clear it, so I don't see what else
>> would clear it?
>>
>> If you execute ct(),ct_clear(), then the first ct will set the
>> values.. what will zero them?
>
> I meant move ovs_ct_fill_key() to inside the if statement.
> i.e.
>
> if (skb_nfct(skb)) {
> nf_conntrack_put(skb_nfct(skb));
> nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
> ovs_ct_fill_key(skb, key);
> }
>
> Should be nothing to fill/zero if we have not yet done conntrack.
> Is there a case where we may lose skb->_nfct, but the key still has
> conntrack data?
Ah, misreading on my part. Right, if there is no nfct then it should
already have the right value. I don't think there's a way to lose it
and leave the key with conntrack data.
^ permalink raw reply
* [PATCH net-next v2] openvswitch: add ct_clear action
From: Eric Garver @ 2017-10-10 20:54 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: dev-yBygre7rU0TnMu66kgdUjQ
This adds a ct_clear action for clearing conntrack state. ct_clear is
currently implemented in OVS userspace, but is not backed by an action
in the kernel datapath. This is useful for flows that may modify a
packet tuple after a ct lookup has already occurred.
Signed-off-by: Eric Garver <e@erig.me>
---
v2:
- Use IP_CT_UNTRACKED for nf_ct_set()
- Only fill key if previously conntracked
include/uapi/linux/openvswitch.h | 2 ++
net/openvswitch/actions.c | 4 ++++
net/openvswitch/conntrack.c | 11 +++++++++++
net/openvswitch/conntrack.h | 7 +++++++
net/openvswitch/flow_netlink.c | 5 +++++
5 files changed, 29 insertions(+)
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index efdbfbfd3ee2..0cd6f8833147 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -807,6 +807,7 @@ struct ovs_action_push_eth {
* packet.
* @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
* packet.
+ * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
*
* Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all
* fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -836,6 +837,7 @@ enum ovs_action_attr {
OVS_ACTION_ATTR_TRUNC, /* u32 struct ovs_action_trunc. */
OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
OVS_ACTION_ATTR_POP_ETH, /* No argument. */
+ OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
__OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted
* from userspace. */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a54a556fcdb5..a551232daf61 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
return err == -EINPROGRESS ? 0 : err;
break;
+ case OVS_ACTION_ATTR_CT_CLEAR:
+ err = ovs_ct_clear(skb, key);
+ break;
+
case OVS_ACTION_ATTR_PUSH_ETH:
err = push_eth(skb, key, nla_data(a));
break;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index d558e882ca0c..fe861e2f0deb 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1129,6 +1129,17 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
return err;
}
+int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
+{
+ if (skb_nfct(skb)) {
+ nf_conntrack_put(skb_nfct(skb));
+ nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
+ ovs_ct_fill_key(skb, key);
+ }
+
+ return 0;
+}
+
static int ovs_ct_add_helper(struct ovs_conntrack_info *info, const char *name,
const struct sw_flow_key *key, bool log)
{
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index bc7efd1867ab..399dfdd2c4f9 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -30,6 +30,7 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *);
int ovs_ct_execute(struct net *, struct sk_buff *, struct sw_flow_key *,
const struct ovs_conntrack_info *);
+int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key);
void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key);
int ovs_ct_put_key(const struct sw_flow_key *swkey,
@@ -73,6 +74,12 @@ static inline int ovs_ct_execute(struct net *net, struct sk_buff *skb,
return -ENOTSUPP;
}
+static inline int ovs_ct_clear(struct sk_buff *skb,
+ struct sw_flow_key *key)
+{
+ return -ENOTSUPP;
+}
+
static inline void ovs_ct_fill_key(const struct sk_buff *skb,
struct sw_flow_key *key)
{
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index fc0ca9a89b8e..dc0d79092e74 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -76,6 +76,7 @@ static bool actions_may_change_flow(const struct nlattr *actions)
break;
case OVS_ACTION_ATTR_CT:
+ case OVS_ACTION_ATTR_CT_CLEAR:
case OVS_ACTION_ATTR_HASH:
case OVS_ACTION_ATTR_POP_ETH:
case OVS_ACTION_ATTR_POP_MPLS:
@@ -2528,6 +2529,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
[OVS_ACTION_ATTR_SAMPLE] = (u32)-1,
[OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash),
[OVS_ACTION_ATTR_CT] = (u32)-1,
+ [OVS_ACTION_ATTR_CT_CLEAR] = 0,
[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
[OVS_ACTION_ATTR_POP_ETH] = 0,
@@ -2669,6 +2671,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
skip_copy = true;
break;
+ case OVS_ACTION_ATTR_CT_CLEAR:
+ break;
+
case OVS_ACTION_ATTR_PUSH_ETH:
/* Disallow pushing an Ethernet header if one
* is already present */
--
2.12.0
^ permalink raw reply related
* Re: [patch net-next 0/4] net: sched: get rid of cls_flower->egress_dev
From: Jiri Pirko @ 2017-10-10 21:13 UTC (permalink / raw)
To: Or Gerlitz
Cc: Saeed Mahameed, Linux Netdev List, David Miller, Jamal Hadi Salim,
Cong Wang, mlxsw
In-Reply-To: <CAJ3xEMgPdcVrogHZmEnH+vP3E53zvEcFN8+wyuEqSs6utHQVRg@mail.gmail.com>
Tue, Oct 10, 2017 at 07:24:21PM CEST, gerlitz.or@gmail.com wrote:
>Jiri,
>
>FWIW, as I reported to you earlier, I was playing with tc encap/decap rules
>on 4.14-rc+ (net) before
>applying any patch of this series, and something is messy w.r.t to decap
>rules. I don't see
>them removed at all when user space attempts to do so. It might (probably)
>mlx5 bug, which
>we will have to fix for net and later rebase net-next over net. We have
>short WW here so
>I will not be able to do RCA this week.
Or, as I replied to you earlier, the issue you describe is totally
unrelated to this patchset as you see the issue with the current net-next.
Not sure why you add this comment here.
^ permalink raw reply
* Re: [patch net-next 3/4] net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra
From: Jiri Pirko @ 2017-10-10 21:16 UTC (permalink / raw)
To: Or Gerlitz
Cc: Simon Horman, Linux Netdev List, David Miller, Jamal Hadi Salim,
Cong Wang, Saeed Mahameed, mlxsw
In-Reply-To: <CAJ3xEMh0cM3FszEjKNNe3aWTjHoDxof9GURgUSMy5k3iSTLg2g@mail.gmail.com>
Tue, Oct 10, 2017 at 10:08:23PM CEST, gerlitz.or@gmail.com wrote:
>On Tue, Oct 10, 2017 at 10:30 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> The only user of cls_flower->egress_dev is mlx5.
>
>but nfp supports decap action offload too and from the flower code
>stand point, I guess they are both the same, right? how does it work
>there?
Apparently they don't use cls_flower->egress_dev.
^ permalink raw reply
* Re: [patch net-next 1/4] net: sched: make tc_action_ops->get_dev return dev and avoid passing net
From: Jiri Pirko @ 2017-10-10 21:19 UTC (permalink / raw)
To: Cong Wang
Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
Saeed Mahameed, matanb, leonro, mlxsw
In-Reply-To: <CAM_iQpXjq1QVVAC2L3TdTEJzdda1=H1V=c4n5v+qeYJqCSxoWw@mail.gmail.com>
Tue, Oct 10, 2017 at 07:44:53PM CEST, xiyou.wangcong@gmail.com wrote:
>On Tue, Oct 10, 2017 at 12:30 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> -static int tcf_mirred_device(const struct tc_action *a, struct net *net,
>> - struct net_device **mirred_dev)
>> +static struct net_device *tcf_mirred_get_dev(const struct tc_action *a)
>> {
>> - int ifindex = tcf_mirred_ifindex(a);
>> + struct tcf_mirred *m = to_mirred(a);
>>
>> - *mirred_dev = __dev_get_by_index(net, ifindex);
>> - if (!*mirred_dev)
>> - return -EINVAL;
>> - return 0;
>> + return __dev_get_by_index(m->net, m->tcfm_ifindex);
>
>Hmm, why not just return m->tcfm_dev?
I just follow the existing code. The change you suggest should be a
separate follow-up patch.
^ permalink raw reply
* Re: [PATCH net-next v2 4/5] selinux: bpf: Add selinux check for eBPF syscall operations
From: kbuild test robot @ 2017-10-10 21:30 UTC (permalink / raw)
To: Chenbo Feng
Cc: kbuild-all, linux-security-module, netdev, SELinux,
Jeffrey Vander Stoep, Alexei Starovoitov, lorenzo,
Daniel Borkmann, Stephen Smalley, Chenbo Feng
In-Reply-To: <20171009222028.13096-5-chenbofeng.kernel@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]
Hi Chenbo,
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Chenbo-Feng/bpf-security-New-file-mode-and-LSM-hooks-for-eBPF-object-permission-control/20171011-010349
config: x86_64-randconfig-u0-10110310 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
In file included from include/linux/init.h:4:0,
from security/selinux/hooks.c:27:
security/selinux/hooks.c: In function 'bpf_map_fmode_to_av':
security/selinux/hooks.c:6284:6: error: 'f_mode' undeclared (first use in this function)
if (f_mode & FMODE_READ)
^
include/linux/compiler.h:156:30: note: in definition of macro '__trace_if'
if (__builtin_constant_p(!!(cond)) ? !!(cond) : \
^~~~
>> security/selinux/hooks.c:6284:2: note: in expansion of macro 'if'
if (f_mode & FMODE_READ)
^~
security/selinux/hooks.c:6284:6: note: each undeclared identifier is reported only once for each function it appears in
if (f_mode & FMODE_READ)
^
include/linux/compiler.h:156:30: note: in definition of macro '__trace_if'
if (__builtin_constant_p(!!(cond)) ? !!(cond) : \
^~~~
>> security/selinux/hooks.c:6284:2: note: in expansion of macro 'if'
if (f_mode & FMODE_READ)
^~
vim +/if +6284 security/selinux/hooks.c
6279
6280 static u32 bpf_map_fmode_to_av(fmode_t fmode)
6281 {
6282 u32 av = 0;
6283
> 6284 if (f_mode & FMODE_READ)
6285 av |= BPF_MAP__READ;
6286 if (f_mode & FMODE_WRITE)
6287 av |= BPF_MAP__WRITE;
6288 return av;
6289 }
6290
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31549 bytes --]
^ permalink raw reply
* Re: [PATCH net-next v2 6/7] bpf: don't rely on the verifier lock for metadata_dst allocation
From: kbuild test robot @ 2017-10-10 21:33 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kbuild-all, netdev, oss-drivers, alexei.starovoitov, daniel,
Jakub Kicinski
In-Reply-To: <20171009173015.23520-7-jakub.kicinski@netronome.com>
[-- Attachment #1: Type: text/plain, Size: 866 bytes --]
Hi Jakub,
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Jakub-Kicinski/bpf-get-rid-of-global-verifier-state-and-reuse-instruction-printer/20171011-021905
config: x86_64-randconfig-a0-10110234 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
net//core/dst.c: In function 'metadata_dst_free_percpu':
>> net//core/dst.c:328: warning: unused variable 'cpu'
vim +/cpu +328 net//core/dst.c
325
326 void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst)
327 {
> 328 int cpu;
329
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30274 bytes --]
^ permalink raw reply
* Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
From: Cong Wang @ 2017-10-10 21:37 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Linux Kernel Network Developers, Eric Dumazet, Yuchung Cheng,
Neal Cardwell, Martin KaFai Lau, Brendan Gregg,
Hannes Frederic Sowa, Song Liu
In-Reply-To: <20171010173821.6djxyvrggvaivqec@ast-mbp.dhcp.thefacebook.com>
On Tue, Oct 10, 2017 at 10:38 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> I'm happy to see new tracepoints being added to tcp stack, but I'm concerned
> with practical usability of them.
> Like the above tracepoint definition makes it not very useful from bpf point of view,
> since 'sk' pointer is not recored by as part of the tracepoint.
> In bpf/tracing world we prefer tracepoints to have raw pointers recorded
> in TP_STRUCT__entry() and _not_ printed in TP_printk()
> (since pointers are useless for userspace).
> Like trace_kfree_skb() tracepoint records raw 'skb' pointer and we can
> walk whatever sk_buff fields we need inside the program.
> Such approach allows tracepoint to be usable in many more scenarios, since
> bpf program can examine kernel datastructures.
Sure, I am happy to add them for BPF. The current version is merely
for our own use case, other use cases like this are always welcome!
> Over the last few years we've been running tcp statistics framework (similar to web10g)
> using 8 kprobes in tcp stack with bpf programs extracting the data and now we're
> ready to share this experience with the community. Right now we're working on a set
> of tracepoints for tcp stack to make the interface more accurate, faster and more stable.
> We're planning to send an RFC patch with these new tracepoints in the comming weeks.
Great! Looking forward to it!
>
> More concrete, if you can make this trace_tcp_retransmit_skb() to record
> sk, skb pointers and err code at the end of __tcp_retransmit_skb() it will solve
> our need as well.
Note, currently I only call trace_tcp_retransmit_skb() for successful
retransmissions, since you mentioned err code, I guess you want it
for failures too? I am not sure if tracing unsuccessful TCP retransmissions
is meaningful here, I guess it's needed for BPF to track TCP states?
It doesn't harm to add it, at least we can filter out err!=0 since we
only care about successful ones.
>
> So far our list of kprobes is:
> int kprobe__tcp_validate_incoming
> int kprobe__tcp_send_active_reset
> int kprobe__tcp_v4_send_reset
> int kprobe__tcp_v6_send_reset
> int kprobe__tcp_v4_destroy_sock
> int kprobe__tcp_set_state
> int kprobe__tcp_retransmit_skb
> int kprobe__tcp_rtx_synack
>
> with tracepoints we can consolidate two of them into one and drop
> another one for sure. Notice that tcp_retransmit_skb is on our list too
> and currently we're doing extra work inside the program to make it more
> accurate which will be unnecessary if this tracepoint is at the end
> of __tcp_retransmit_skb().
Yeah, with these tracepoints we would be able to trace more TCP
state changes.
Thanks!
^ permalink raw reply
* Re: [PATCH v2] xdp: Sample xdp program implementing ip forward
From: David Daney @ 2017-10-10 21:37 UTC (permalink / raw)
To: Stephen Hemminger, Christina Jacob
Cc: Christina Jacob, daniel, Sunil.Goutham, netdev, dsahern,
linux-kernel, brouer, linux-arm-kernel
In-Reply-To: <20171010101921.30136d48@shemminger-XPS-13-9360>
On 10/10/2017 10:19 AM, Stephen Hemminger wrote:
> On Tue, 10 Oct 2017 12:58:52 +0530
> Christina Jacob <christina.jacob.koikara@gmail.com> wrote:
>
>> +/* Get the mac address of the interface given interface name */
>> +static long *getmac(char *iface)
>> +{
>> + int fd;
>> + struct ifreq ifr;
>> + long *mac = NULL;
>> +
>> + fd = socket(AF_INET, SOCK_DGRAM, 0);
>> + ifr.ifr_addr.sa_family = AF_INET;
>> + strncpy(ifr.ifr_name, iface, IFNAMSIZ - 1);
>> + ioctl(fd, SIOCGIFHWADDR, &ifr);
>> + mac = (long *)ifr.ifr_hwaddr.sa_data;
>> + close(fd);
>> + return mac;
>
> Always check return value of ioctl.
> You are assuming sizeof(long) > 6 bytes.
> Also the byte order.
Also:
Returning the address of a local variable (ifr.ifr_hwaddr.sa_data), and
then dereferencing it outside of the function is not correct.
The casting of the char sa_data[] to a long * may cause alignment faults
on some architectures. The may also be endinaness issues depending on
how the data are manipulated if you pack all those chars into a long.
If we think that a MAC address is char[6], then it may be best to define
the data structures as such and manipulate it as an array instead of
trying to pack it into a long.
Keep working on this though, this program will surely be useful.
David Daney
^ permalink raw reply
* [PATCH v2 nf-next] netfilter: x_tables: speed up iptables-restore
From: Florian Westphal @ 2017-10-10 21:39 UTC (permalink / raw)
To: netdev, edumazet
iptables-restore can take quite a long time when sytem is busy,
in order of half a minute or more.
The main reason for this is the way ip(6)tables performs table
swap, or, more precisely, expensive sequence lock synchronizations
when reading counters.
When xt_replace_table assigns the new ruleset pointer, it does
not wait for other processors to finish with old ruleset.
Instead it relies on the counter sequence lock in get_counters()
to do this.
This works but this is very costly if system is busy as each counter
read operation can possibly be restarted indefinitely.
Instead, make xt_replace_table wait until all processors are
known to not use the old ruleset anymore.
This allows to read the old counters without any locking, no cpu is
using the ruleset anymore so counters can't change either.
ipv4/netfilter/arp_tables.c | 22 ++++++++++++++++++++--
ipv4/netfilter/ip_tables.c | 23 +++++++++++++++++++++--
ipv6/netfilter/ip6_tables.c | 22 ++++++++++++++++++++--
netfilter/x_tables.c | 15 ++++++++++++---
4 files changed, 73 insertions(+), 9 deletions(-)
^ permalink raw reply
* [PATCH v2 nf-next 1/2] netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore
From: Florian Westphal @ 2017-10-10 21:39 UTC (permalink / raw)
To: netdev, edumazet; +Cc: Florian Westphal, Dan Williams
In-Reply-To: <20171010213945.19074-1-fw@strlen.de>
xt_replace_table relies on table replacement counter retrieval (which
uses xt_recseq to synchronize pcpu counters).
This is fine, however with large rule set get_counters() can take
a very long time -- it needs to synchronize all counters because
it has to assume concurrent modifications can occur.
Make xt_replace_table synchronize by itself by waiting until all cpus
had an even seqcount.
This allows a followup patch to copy the counters of the old ruleset
without any synchonization after xt_replace_table has completed.
Cc: Dan Williams <dcbw@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
v2: fix Erics email address
net/netfilter/x_tables.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index c83a3b5e1c6c..f2d4a365768f 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1153,6 +1153,7 @@ xt_replace_table(struct xt_table *table,
int *error)
{
struct xt_table_info *private;
+ unsigned int cpu;
int ret;
ret = xt_jumpstack_alloc(newinfo);
@@ -1184,12 +1185,20 @@ xt_replace_table(struct xt_table *table,
/*
* Even though table entries have now been swapped, other CPU's
- * may still be using the old entries. This is okay, because
- * resynchronization happens because of the locking done
- * during the get_counters() routine.
+ * may still be using the old entries...
*/
local_bh_enable();
+ /* ... so wait for even xt_recseq on all cpus */
+ for_each_possible_cpu(cpu) {
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
+
+ while (raw_read_seqcount(s) & 1)
+ cpu_relax();
+
+ cond_resched();
+ }
+
#ifdef CONFIG_AUDIT
if (audit_enabled) {
audit_log(current->audit_context, GFP_KERNEL,
--
2.13.6
^ permalink raw reply related
* [PATCH v2 nf-next 2/2] netfilter: x_tables: don't use seqlock when fetching old counters
From: Florian Westphal @ 2017-10-10 21:39 UTC (permalink / raw)
To: netdev, edumazet; +Cc: Florian Westphal, Dan Williams
In-Reply-To: <20171010213945.19074-1-fw@strlen.de>
after previous commit xt_replace_table will wait until all cpus
had even seqcount (i.e., no cpu is accessing old ruleset).
Add a 'old' counter retrival version that doesn't synchronize counters.
Its not needed, the old counters are not in use anymore at this point.
This speeds up table replacement on busy systems with large tables
(and many cores).
Cc: Dan Williams <dcbw@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
v2: fix Erics email address
net/ipv4/netfilter/arp_tables.c | 22 ++++++++++++++++++++--
net/ipv4/netfilter/ip_tables.c | 23 +++++++++++++++++++++--
net/ipv6/netfilter/ip6_tables.c | 22 ++++++++++++++++++++--
3 files changed, 61 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 9e2770fd00be..f88221aebc9d 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -634,6 +634,25 @@ static void get_counters(const struct xt_table_info *t,
}
}
+static void get_old_counters(const struct xt_table_info *t,
+ struct xt_counters counters[])
+{
+ struct arpt_entry *iter;
+ unsigned int cpu, i;
+
+ for_each_possible_cpu(cpu) {
+ i = 0;
+ xt_entry_foreach(iter, t->entries, t->size) {
+ struct xt_counters *tmp;
+
+ tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ++i;
+ }
+ cond_resched();
+ }
+}
+
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
@@ -910,8 +929,7 @@ static int __do_replace(struct net *net, const char *name,
(newinfo->number <= oldinfo->initial_entries))
module_put(t->me);
- /* Get the old counters, and synchronize with replace */
- get_counters(oldinfo, counters);
+ get_old_counters(oldinfo, counters);
/* Decrease module usage counts and free resource */
loc_cpu_old_entry = oldinfo->entries;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 39286e543ee6..4cbe5e80f3bf 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -781,6 +781,26 @@ get_counters(const struct xt_table_info *t,
}
}
+static void get_old_counters(const struct xt_table_info *t,
+ struct xt_counters counters[])
+{
+ struct ipt_entry *iter;
+ unsigned int cpu, i;
+
+ for_each_possible_cpu(cpu) {
+ i = 0;
+ xt_entry_foreach(iter, t->entries, t->size) {
+ const struct xt_counters *tmp;
+
+ tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ++i; /* macro does multi eval of i */
+ }
+
+ cond_resched();
+ }
+}
+
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
@@ -1070,8 +1090,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
(newinfo->number <= oldinfo->initial_entries))
module_put(t->me);
- /* Get the old counters, and synchronize with replace */
- get_counters(oldinfo, counters);
+ get_old_counters(oldinfo, counters);
/* Decrease module usage counts and free resource */
xt_entry_foreach(iter, oldinfo->entries, oldinfo->size)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 01bd3ee5ebc6..f06e25065a34 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -800,6 +800,25 @@ get_counters(const struct xt_table_info *t,
}
}
+static void get_old_counters(const struct xt_table_info *t,
+ struct xt_counters counters[])
+{
+ struct ip6t_entry *iter;
+ unsigned int cpu, i;
+
+ for_each_possible_cpu(cpu) {
+ i = 0;
+ xt_entry_foreach(iter, t->entries, t->size) {
+ const struct xt_counters *tmp;
+
+ tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ++i;
+ }
+ cond_resched();
+ }
+}
+
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
@@ -1090,8 +1109,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
(newinfo->number <= oldinfo->initial_entries))
module_put(t->me);
- /* Get the old counters, and synchronize with replace */
- get_counters(oldinfo, counters);
+ get_old_counters(oldinfo, counters);
/* Decrease module usage counts and free resource */
xt_entry_foreach(iter, oldinfo->entries, oldinfo->size)
--
2.13.6
^ permalink raw reply related
* Re: [patch net-next 0/4] net: sched: get rid of cls_flower->egress_dev
From: Or Gerlitz @ 2017-10-10 21:46 UTC (permalink / raw)
To: Jiri Pirko
Cc: Saeed Mahameed, Linux Netdev List, David Miller, Jamal Hadi Salim,
Cong Wang, mlxsw
In-Reply-To: <20171010211357.GJ2033@nanopsycho>
On Wed, Oct 11, 2017 at 12:13 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Oct 10, 2017 at 07:24:21PM CEST, gerlitz.or@gmail.com wrote:
> Or, as I replied to you earlier, the issue you describe is totally
> unrelated to this patchset as you see the issue with the current net-next.
Jiri, the point I wanted to make that if indeed there's a bug in mlx5
or flower, we will have to fix it for 4.14 and then these bits would
have to be rebased when net-next is re-planted over net, I put "FWIW"
before that, so maybe it doesn't W so much, we'll see.
Or.
^ permalink raw reply
* Re: [patch net-next 3/4] net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra
From: Or Gerlitz @ 2017-10-10 21:47 UTC (permalink / raw)
To: Jiri Pirko, John Hurley
Cc: Simon Horman, Linux Netdev List, David Miller, Jamal Hadi Salim,
Cong Wang, Saeed Mahameed, mlxsw
In-Reply-To: <20171010211609.GK2033@nanopsycho>
On Wed, Oct 11, 2017 at 12:16 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Oct 10, 2017 at 10:08:23PM CEST, gerlitz.or@gmail.com wrote:
>>On Tue, Oct 10, 2017 at 10:30 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> The only user of cls_flower->egress_dev is mlx5.
>>
>>but nfp supports decap action offload too and from the flower code
>>stand point, I guess they are both the same, right? how does it work
>>there?
>
> Apparently they don't use cls_flower->egress_dev.
John, can you elaborate on that, how do you manage to get away from
that practice?
^ permalink raw reply
* Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
From: Hannes Frederic Sowa @ 2017-10-10 21:58 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Cong Wang, netdev, Eric Dumazet, Yuchung Cheng, Neal Cardwell,
Martin KaFai Lau, Brendan Gregg, Song Liu
In-Reply-To: <20171010173821.6djxyvrggvaivqec@ast-mbp.dhcp.thefacebook.com>
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> On Mon, Oct 09, 2017 at 10:35:47PM -0700, Cong Wang wrote:
[...]
>> + trace_tcp_retransmit_skb(sk, skb, segs);
>
> I'm happy to see new tracepoints being added to tcp stack, but I'm concerned
> with practical usability of them.
> Like the above tracepoint definition makes it not very useful from bpf point of view,
> since 'sk' pointer is not recored by as part of the tracepoint.
> In bpf/tracing world we prefer tracepoints to have raw pointers recorded
> in TP_STRUCT__entry() and _not_ printed in TP_printk()
> (since pointers are useless for userspace).
Ack.
Also could the TP_printk also use the socket cookies so they can get
associated with netlink dumps and as such also be associated to user
space processes? It could help against races while trying to associate
the socket with a process. ss already supports dumping those cookies
with -e.
The corresponding commit would be:
commit 33cf7c90fe2f97afb1cadaa0cfb782cb9d1b9ee2
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Mar 11 18:53:14 2015 -0700
net: add real socket cookies
Right now they only get set when needed but as Eric already mentioned in
his commit log, this could be refined.
[...]
^ permalink raw reply
* [PATCH] i40e: only redistribute MSI-X vectors when needed
From: Shannon Nelson @ 2017-10-10 21:56 UTC (permalink / raw)
To: intel-wired-lan, jeffrey.t.kirsher; +Cc: netdev
Whether or not there are vectors_left, we only need to redistribute
our vectors if we didn't get as many as we requested. With the current
check, the code will try to redistribute even if we did in fact get all
the vectors we requested - this can happen when we have more CPUs than
we do vectors. This restores an earlier check to be sure we only
redistribute if we didn't get the full count we requested.
Fixes: 4ce20abc645f (i40e: fix MSI-X vector redistribution if hw limit is reached)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index bf91958..535e6e7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8176,7 +8176,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
pf->num_lan_qps = 1;
pf->num_lan_msix = 1;
- } else if (!vectors_left) {
+ } else if (v_actual != v_budget) {
/* If we have limited resources, we will start with no vectors
* for the special features and then allocate vectors to some
* of these features based on the policy and at the end disable
@@ -8185,7 +8185,8 @@ static int i40e_init_msix(struct i40e_pf *pf)
int vec;
dev_info(&pf->pdev->dev,
- "MSI-X vector limit reached, attempting to redistribute vectors\n");
+ "MSI-X vector limit reached with %d, wanted %d, attempting to redistribute vectors\n",
+ v_actual, v_budget);
/* reserve the misc vector */
vec = v_actual - 1;
--
1.7.1
^ permalink raw reply related
* [PATCH net-next] net: dst: move cpu inside ifdef to avoid compilation warning
From: Jakub Kicinski @ 2017-10-10 22:05 UTC (permalink / raw)
To: netdev; +Cc: Jakub Kicinski
If CONFIG_DST_CACHE is not selected cpu variable
will be unused and we will see a compilation warning.
Move it under the ifdef.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: d66f2b91f95b ("bpf: don't rely on the verifier lock for metadata_dst allocation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
net/core/dst.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 8b2eafac984d..662a2d4a3d19 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -325,9 +325,9 @@ EXPORT_SYMBOL_GPL(metadata_dst_alloc_percpu);
void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst)
{
+#ifdef CONFIG_DST_CACHE
int cpu;
-#ifdef CONFIG_DST_CACHE
for_each_possible_cpu(cpu) {
struct metadata_dst *one_md_dst = per_cpu_ptr(md_dst, cpu);
--
2.14.1
^ permalink raw reply related
* [PATCH] hdlc: Convert timers to use timer_setup()
From: Kees Cook @ 2017-10-10 22:08 UTC (permalink / raw)
To: David S. Miller; +Cc: Krzysztof Halasa, netdev, Thomas Gleixner, linux-kernel
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly. This adds a pointer back to the
net_device, and drops needless open-coded resetting of the .function and
.data fields.
Cc: David S. Miller <davem@davemloft.net>
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
This requires commit 686fef928bba ("timer: Prepare to change timer
callback argument type") in v4.14-rc3, but should be otherwise
stand-alone.
---
drivers/net/wan/hdlc_cisco.c | 15 ++++++---------
drivers/net/wan/hdlc_fr.c | 13 ++++++-------
2 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/drivers/net/wan/hdlc_cisco.c b/drivers/net/wan/hdlc_cisco.c
index a408abc25512..320039d329c7 100644
--- a/drivers/net/wan/hdlc_cisco.c
+++ b/drivers/net/wan/hdlc_cisco.c
@@ -54,6 +54,7 @@ struct cisco_state {
cisco_proto settings;
struct timer_list timer;
+ struct net_device *dev;
spinlock_t lock;
unsigned long last_poll;
int up;
@@ -257,11 +258,10 @@ static int cisco_rx(struct sk_buff *skb)
-static void cisco_timer(unsigned long arg)
+static void cisco_timer(struct timer_list *t)
{
- struct net_device *dev = (struct net_device *)arg;
- hdlc_device *hdlc = dev_to_hdlc(dev);
- struct cisco_state *st = state(hdlc);
+ struct cisco_state *st = from_timer(st, t, timer);
+ struct net_device *dev = st->dev;
spin_lock(&st->lock);
if (st->up &&
@@ -276,8 +276,6 @@ static void cisco_timer(unsigned long arg)
spin_unlock(&st->lock);
st->timer.expires = jiffies + st->settings.interval * HZ;
- st->timer.function = cisco_timer;
- st->timer.data = arg;
add_timer(&st->timer);
}
@@ -293,10 +291,9 @@ static void cisco_start(struct net_device *dev)
st->up = st->txseq = st->rxseq = 0;
spin_unlock_irqrestore(&st->lock, flags);
- init_timer(&st->timer);
+ st->dev = dev;
+ timer_setup(&st->timer, cisco_timer, 0);
st->timer.expires = jiffies + HZ; /* First poll after 1 s */
- st->timer.function = cisco_timer;
- st->timer.data = (unsigned long)dev;
add_timer(&st->timer);
}
diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c
index 78596e42a3f3..038236a9c60e 100644
--- a/drivers/net/wan/hdlc_fr.c
+++ b/drivers/net/wan/hdlc_fr.c
@@ -140,6 +140,7 @@ struct frad_state {
int dce_pvc_count;
struct timer_list timer;
+ struct net_device *dev;
unsigned long last_poll;
int reliable;
int dce_changed;
@@ -597,9 +598,10 @@ static void fr_set_link_state(int reliable, struct net_device *dev)
}
-static void fr_timer(unsigned long arg)
+static void fr_timer(struct timer_list *t)
{
- struct net_device *dev = (struct net_device *)arg;
+ struct frad_state *st = from_timer(st, t, timer);
+ struct net_device *dev = st->dev;
hdlc_device *hdlc = dev_to_hdlc(dev);
int i, cnt = 0, reliable;
u32 list;
@@ -644,8 +646,6 @@ static void fr_timer(unsigned long arg)
state(hdlc)->settings.t391 * HZ;
}
- state(hdlc)->timer.function = fr_timer;
- state(hdlc)->timer.data = arg;
add_timer(&state(hdlc)->timer);
}
@@ -1003,11 +1003,10 @@ static void fr_start(struct net_device *dev)
state(hdlc)->n391cnt = 0;
state(hdlc)->txseq = state(hdlc)->rxseq = 0;
- init_timer(&state(hdlc)->timer);
+ state(hdlc)->dev = dev;
+ timer_setup(&state(hdlc)->timer, fr_timer, 0);
/* First poll after 1 s */
state(hdlc)->timer.expires = jiffies + HZ;
- state(hdlc)->timer.function = fr_timer;
- state(hdlc)->timer.data = (unsigned long)dev;
add_timer(&state(hdlc)->timer);
} else
fr_set_link_state(1, dev);
--
2.7.4
--
Kees Cook
Pixel Security
^ permalink raw reply related
* Re: [net-next V6 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Daniel Borkmann @ 2017-10-10 22:48 UTC (permalink / raw)
To: Jesper Dangaard Brouer, netdev
Cc: jakub.kicinski, Michael S. Tsirkin, pavel.odintsov, Jason Wang,
mchan, John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
Alexei Starovoitov, Andy Gospodarek
In-Reply-To: <150763965869.14394.6619644617101345170.stgit@firesoul>
On 10/10/2017 02:47 PM, Jesper Dangaard Brouer wrote:
[...]
> +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> +{
> + struct bpf_cpu_map *cmap;
> + int err = -ENOMEM;
> + u64 cost;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> +
> + /* check sanity of attributes */
> + if (attr->max_entries == 0 || attr->key_size != 4 ||
> + attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> + return ERR_PTR(-EINVAL);
> +
> + cmap = kzalloc(sizeof(*cmap), GFP_USER);
> + if (!cmap)
> + return ERR_PTR(-ENOMEM);
> +
> + /* mandatory map attributes */
> + cmap->map.map_type = attr->map_type;
> + cmap->map.key_size = attr->key_size;
> + cmap->map.value_size = attr->value_size;
> + cmap->map.max_entries = attr->max_entries;
> + cmap->map.map_flags = attr->map_flags;
> + cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> +
> + /* Pre-limit array size based on NR_CPUS, not final CPU check */
> + if (cmap->map.max_entries > NR_CPUS)
> + return ERR_PTR(-E2BIG);
We still have a leak here, meaning kfree(cmap) is missing on above error.
> +
> + /* make sure page count doesn't overflow */
> + cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> + cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> + if (cost >= U32_MAX - PAGE_SIZE)
> + goto free_cmap;
> + cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /* Notice returns -EPERM on if map size is larger than memlock limit */
> + ret = bpf_map_precharge_memlock(cmap->map.pages);
> + if (ret) {
> + err = ret;
> + goto free_cmap;
> + }
> +
> + /* A per cpu bitfield with a bit per possible CPU in map */
> + cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> + __alignof__(unsigned long));
> + if (!cmap->flush_needed)
> + goto free_cmap;
> +
> + /* Alloc array for possible remote "destination" CPUs */
> + cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> + sizeof(struct bpf_cpu_map_entry *),
> + cmap->map.numa_node);
> + if (!cmap->cpu_map)
> + goto free_cmap;
> +
> + return &cmap->map;
> +free_cmap:
> + free_percpu(cmap->flush_needed);
> + kfree(cmap);
> + return ERR_PTR(err);
> +}
^ permalink raw reply
* Re: [PATCH net-next] net: dst: move cpu inside ifdef to avoid compilation warning
From: David Miller @ 2017-10-10 22:56 UTC (permalink / raw)
To: jakub.kicinski; +Cc: netdev
In-Reply-To: <20171010220539.21200-1-jakub.kicinski@netronome.com>
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 10 Oct 2017 15:05:39 -0700
> If CONFIG_DST_CACHE is not selected cpu variable
> will be unused and we will see a compilation warning.
> Move it under the ifdef.
>
> Reported-by: kbuild test robot <fengguang.wu@intel.com>
> Fixes: d66f2b91f95b ("bpf: don't rely on the verifier lock for metadata_dst allocation")
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next v2] openvswitch: add ct_clear action
From: Pravin Shelar @ 2017-10-10 23:34 UTC (permalink / raw)
To: Eric Garver; +Cc: Linux Kernel Network Developers, ovs dev, Joe Stringer
In-Reply-To: <20171010205444.20628-1-e@erig.me>
On Tue, Oct 10, 2017 at 1:54 PM, Eric Garver <e@erig.me> wrote:
> This adds a ct_clear action for clearing conntrack state. ct_clear is
> currently implemented in OVS userspace, but is not backed by an action
> in the kernel datapath. This is useful for flows that may modify a
> packet tuple after a ct lookup has already occurred.
>
> Signed-off-by: Eric Garver <e@erig.me>
> ---
> v2:
> - Use IP_CT_UNTRACKED for nf_ct_set()
> - Only fill key if previously conntracked
>
Looks good.
Acked-by: Pravin B Shelar <pshelar@ovn.org>
^ permalink raw reply
* Re: [PATCH net-next v2] openvswitch: add ct_clear action
From: David Miller @ 2017-10-10 23:38 UTC (permalink / raw)
To: pshelar; +Cc: e, netdev, dev, joe
In-Reply-To: <CAOrHB_AMzr-1uTRfE_xWLdTPwUPoz1Dwoobp3P1BkqQeL3qGzA@mail.gmail.com>
From: Pravin Shelar <pshelar@ovn.org>
Date: Tue, 10 Oct 2017 16:34:29 -0700
> On Tue, Oct 10, 2017 at 1:54 PM, Eric Garver <e@erig.me> wrote:
>> This adds a ct_clear action for clearing conntrack state. ct_clear is
>> currently implemented in OVS userspace, but is not backed by an action
>> in the kernel datapath. This is useful for flows that may modify a
>> packet tuple after a ct lookup has already occurred.
>>
>> Signed-off-by: Eric Garver <e@erig.me>
>> ---
>> v2:
>> - Use IP_CT_UNTRACKED for nf_ct_set()
>> - Only fill key if previously conntracked
>>
> Looks good.
> Acked-by: Pravin B Shelar <pshelar@ovn.org>
Applied.
^ permalink raw reply
* [PATCH net-next v3 0/5] bpf: security: New file mode and LSM hooks for eBPF object permission control
From: Chenbo Feng @ 2017-10-11 0:09 UTC (permalink / raw)
To: linux-security-module, netdev, SELinux
Cc: Jeffrey Vander Stoep, Alexei Starovoitov, lorenzo,
Daniel Borkmann, Stephen Smalley, Chenbo Feng
From: Chenbo Feng <fengc@google.com>
Much like files and sockets, eBPF objects are accessed, controlled, and
shared via a file descriptor (FD). Unlike files and sockets, the
existing mechanism for eBPF object access control is very limited.
Currently there are two options for granting accessing to eBPF
operations: grant access to all processes, or only CAP_SYS_ADMIN
processes. The CAP_SYS_ADMIN-only mode is not ideal because most users
do not have this capability and granting a user CAP_SYS_ADMIN grants too
many other security-sensitive permissions. It also unnecessarily allows
all CAP_SYS_ADMIN processes access to eBPF functionality. Allowing all
processes to access to eBPF objects is also undesirable since it has
potential to allow unprivileged processes to consume kernel memory, and
opens up attack surface to the kernel.
Adding LSM hooks maintains the status quo for systems which do not use
an LSM, preserving compatibility with userspace, while allowing security
modules to choose how best to handle permissions on eBPF objects. Here
is a possible use case for the lsm hooks with selinux module:
The network-control daemon (netd) creates and loads an eBPF object for
network packet filtering and analysis. It passes the object FD to an
unprivileged network monitor app (netmonitor), which is not allowed to
create, modify or load eBPF objects, but is allowed to read the traffic
stats from the map.
Selinux could use these hooks to grant the following permissions:
allow netd self:bpf_map { create read write};
allow netmonitor netd:fd use;
allow netmonitor netd:bpf_map read;
In this patch series, A file mode is added to bpf map to store the
accessing mode. With this file mode flags, the map can be obtained read
only, write only or read and write. With the help of this file mode,
several security hooks can be added to the eBPF syscall implementations
to do permissions checks. These LSM hooks are mainly focused on checking
the process privileges before it obtains the fd for a specific bpf
object. No matter from a file location or from a eBPF id. Besides that,
a general check hook is also implemented at the start of bpf syscalls so
that each security module can have their own implementation on the reset
of bpf object related functionalities.
In order to store the ownership and security information about eBPF
maps, a security field pointer is added to the struct bpf_map. And the
last two patch set are implementation of selinux check on these hooks
introduced, plus an additional check when eBPF object is passed between
processes using unix socket as well as binder IPC.
Change since V1:
- Whitelist the new bpf flags in the map allocate check.
- Added bpf selftest for the new flags.
- Added two new security hooks for copying the security information from
the bpf object security struct to file security struct
- Simplified the checking action when bpf fd is passed between processes.
Change since V2:
- Fixed the line break problem for map flags check
- Fixed the typo in selinux check of file mode.
- Merge bpf_map and bpf_prog into one selinux class
- Added bpf_type and bpf_sid into file security struct to store the
security information when generate fd.
- Add the hook to bpf_map_new_fd and bpf_prog_new_fd.
Chenbo Feng (5):
bpf: Add file mode configuration into bpf maps
bpf: Add tests for eBPF file mode
security: bpf: Add LSM hooks for bpf object related syscall
selinux: bpf: Add selinux check for eBPF syscall operations
selinux: bpf: Add addtional check for bpf object file receive
include/linux/bpf.h | 12 ++-
include/linux/lsm_hooks.h | 71 +++++++++++++
include/linux/security.h | 54 ++++++++++
include/uapi/linux/bpf.h | 6 ++
kernel/bpf/arraymap.c | 6 +-
kernel/bpf/devmap.c | 5 +-
kernel/bpf/hashtab.c | 5 +-
kernel/bpf/inode.c | 15 ++-
kernel/bpf/lpm_trie.c | 3 +-
kernel/bpf/sockmap.c | 5 +-
kernel/bpf/stackmap.c | 5 +-
kernel/bpf/syscall.c | 108 ++++++++++++++++++--
security/security.c | 40 ++++++++
security/selinux/hooks.c | 174 ++++++++++++++++++++++++++++++++
security/selinux/include/classmap.h | 2 +
security/selinux/include/objsec.h | 13 +++
tools/testing/selftests/bpf/test_maps.c | 48 +++++++++
17 files changed, 548 insertions(+), 24 deletions(-)
--
2.14.2.920.gcf0c67979c-goog
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox