* [PATCH 2/2] userns: control capabilities of some user namespaces
From: Mahesh Bandewar @ 2017-09-29 23:10 UTC (permalink / raw)
To: LKML
Cc: Netdev, Kernel-hardening, Linux API, Kees Cook, Serge Hallyn,
Eric W . Biederman, Eric Dumazet, David Miller, Mahesh Bandewar,
Mahesh Bandewar
From: Mahesh Bandewar <maheshb@google.com>
With this new notion of "controlled" user-namespaces, the controlled
user-namespaces are marked at the time of their creation while the
capabilities of processes that belong to them are controlled using the
global mask.
Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
that belongs to uncontrolled user-ns can create another (child) user-
namespace that is uncontrolled. Any other process (that either does
not have SYS_ADMIN or belongs to a controlled user-ns) can only
create a user-ns that is controlled.
global-capability-whitelist (controlled_userns_caps_whitelist) is used
at the capability check-time and keeps the semantics for the processes
that belong to uncontrolled user-ns as it is. Processes that belong to
controlled user-ns however are subjected to different checks-
(a) if the capability in question is controlled and process belongs
to controlled user-ns, then it's always denied.
(b) if the capability in question is NOT controlled then fall back
to the traditional check.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
include/linux/capability.h | 1 +
include/linux/user_namespace.h | 20 ++++++++++++++++++++
kernel/capability.c | 5 +++++
kernel/user_namespace.c | 3 +++
security/commoncap.c | 8 ++++++++
5 files changed, 37 insertions(+)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6c0b9677c03f..b8c6cac18658 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
void __user *buff, size_t *lenp, loff_t *ppos);
+bool is_capability_controlled(int cap);
extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..e890fe81b47e 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -22,6 +22,7 @@ struct uid_gid_map { /* 64 bytes -- 1 cache line */
};
#define USERNS_SETGROUPS_ALLOWED 1UL
+#define USERNS_CONTROLLED 2UL
#define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
@@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
__put_user_ns(ns);
}
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+ return ns->flags & USERNS_CONTROLLED;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+ ns->flags |= USERNS_CONTROLLED;
+}
+
struct seq_operations;
extern const struct seq_operations proc_uid_seq_operations;
extern const struct seq_operations proc_gid_seq_operations;
@@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
{
return ERR_PTR(-EPERM);
}
+
+static inline bool is_user_ns_controlled(const struct user_namespace *ns)
+{
+ return false;
+}
+
+static inline void mark_user_ns_controlled(struct user_namespace *ns)
+{
+}
#endif
#endif /* _LINUX_USER_H */
diff --git a/kernel/capability.c b/kernel/capability.c
index 62dbe3350c1b..40a38cc4ff43 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
}
/* Controlled-userns capabilities routines */
+bool is_capability_controlled(int cap)
+{
+ return !cap_raised(controlled_userns_caps_whitelist, cap);
+}
+
#ifdef CONFIG_SYSCTL
int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
void __user *buff, size_t *lenp, loff_t *ppos)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..f393ea5108f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
cred->cap_effective = CAP_FULL_SET;
cred->cap_ambient = CAP_EMPTY_SET;
cred->cap_bset = CAP_FULL_SET;
+ if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
+ is_user_ns_controlled(user_ns->parent))
+ mark_user_ns_controlled(user_ns);
#ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
diff --git a/security/commoncap.c b/security/commoncap.c
index 6bf72b175b49..26f41602da10 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
{
struct user_namespace *ns = targ_ns;
+ /* If the capability is controlled and user-ns that process
+ * belongs-to is 'controlled' then return EPERM and no need
+ * to check the user-ns hierarchy.
+ */
+ if (is_user_ns_controlled(cred->user_ns) &&
+ is_capability_controlled(cap))
+ return -EPERM;
+
/* See if cred has the capability in the target user namespace
* by examining the target user namespace and all of the target
* user namespace's parents.
--
2.14.2.822.g60be5d43e6-goog
^ permalink raw reply related
* [PATCH 1/2] capability: introduce sysctl for controlled user-ns capability whitelist
From: Mahesh Bandewar @ 2017-09-29 23:10 UTC (permalink / raw)
To: LKML
Cc: Netdev, Kernel-hardening, Linux API, Kees Cook, Serge Hallyn,
Eric W . Biederman, Eric Dumazet, David Miller, Mahesh Bandewar,
Mahesh Bandewar
From: Mahesh Bandewar <maheshb@google.com>
Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
takes input as capability mask expressed as two comma separated hex
u32 words. The mask, however, is stored in kernel as kernel_cap_t type.
Any capabilities that are not part of this mask will be controlled and
will not be allowed to processes in controlled user-ns.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
Documentation/sysctl/kernel.txt | 21 ++++++++++++++++++
include/linux/capability.h | 3 +++
kernel/capability.c | 47 +++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 5 +++++
4 files changed, 76 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c7523c..a1d39dbae847 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
- bootloader_version [ X86 only ]
- callhome [ S390 only ]
- cap_last_cap
+- controlled_userns_caps_whitelist
- core_pattern
- core_pipe_limit
- core_uses_pid
@@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
==============================================================
+controlled_userns_caps_whitelist
+
+Capability mask that is whitelisted for "controlled" user namespaces.
+Any capability that is missing from this mask will not be allowed to
+any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
+is not part of this mask, then processes running inside any controlled
+userns's will not be allowed to perform action that needs CAP_NET_RAW
+capability. However, processes that are attached to a parent user-ns
+hierarchy that is *not* controlled and has CAP_NET_RAW can continue
+performing those actions. User-namespaces are marked "controlled" at
+the time of their creation based on the capabilities of the creator.
+A process that does not have CAP_SYS_ADMIN will create user-namespaces
+that are controlled.
+
+The value is expressed as two comma separated hex words (u32). This
+sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
+are allowed to make changes.
+
+==============================================================
+
core_pattern:
core_pattern is used to specify a core dumpfile pattern name.
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b52e278e4744..6c0b9677c03f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,6 +13,7 @@
#define _LINUX_CAPABILITY_H
#include <uapi/linux/capability.h>
+#include <linux/sysctl.h>
#define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
@@ -247,6 +248,8 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+ void __user *buff, size_t *lenp, loff_t *ppos);
extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
diff --git a/kernel/capability.c b/kernel/capability.c
index f97fe77ceb88..62dbe3350c1b 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -28,6 +28,8 @@ EXPORT_SYMBOL(__cap_empty_set);
int file_caps_enabled = 1;
+kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
+
static int __init file_caps_disable(char *str)
{
file_caps_enabled = 0;
@@ -506,3 +508,48 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
rcu_read_unlock();
return (ret == 0);
}
+
+/* Controlled-userns capabilities routines */
+#ifdef CONFIG_SYSCTL
+int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
+ void __user *buff, size_t *lenp, loff_t *ppos)
+{
+ DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
+ struct ctl_table caps_table;
+ char tbuf[NAME_MAX];
+ int ret;
+
+ ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
+ controlled_userns_caps_whitelist.cap,
+ _KERNEL_CAPABILITY_U32S);
+ if (ret != CAP_LAST_CAP)
+ return -1;
+
+ scnprintf(tbuf, NAME_MAX, "%*pb", CAP_LAST_CAP, caps_bitmap);
+
+ caps_table.data = tbuf;
+ caps_table.maxlen = NAME_MAX;
+ caps_table.mode = table->mode;
+ ret = proc_dostring(&caps_table, write, buff, lenp, ppos);
+ if (ret)
+ return ret;
+ if (write) {
+ kernel_cap_t tmp;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = bitmap_parse_user(buff, *lenp, caps_bitmap, CAP_LAST_CAP);
+ if (ret)
+ return ret;
+
+ ret = bitmap_to_u32array(tmp.cap, _KERNEL_CAPABILITY_U32S,
+ caps_bitmap, CAP_LAST_CAP);
+ if (ret != CAP_LAST_CAP)
+ return -1;
+
+ controlled_userns_caps_whitelist = tmp;
+ }
+ return 0;
+}
+#endif /* CONFIG_SYSCTL */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..9903cf0de287 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,11 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .procname = "controlled_userns_caps_whitelist",
+ .mode = 0644,
+ .proc_handler = proc_douserns_caps_whitelist,
+ },
{ }
};
--
2.14.2.822.g60be5d43e6-goog
^ permalink raw reply related
* [PATCH 0/2] capability controlled user-namespaces
From: Mahesh Bandewar @ 2017-09-29 23:09 UTC (permalink / raw)
To: LKML
Cc: Netdev, Kernel-hardening, Linux API, Kees Cook, Serge Hallyn,
Eric W . Biederman, Eric Dumazet, David Miller, Mahesh Bandewar,
Mahesh Bandewar
From: Mahesh Bandewar <maheshb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
[Same as the previous RFC series sent on 9/21]
TL;DR version
-------------
Creating a sandbox environment with namespaces is challenging
considering what these sandboxed processes can engage into. e.g.
CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
Current form of user-namespaces, however, if changed a bit can allow
us to create a sandbox environment without locking down user-
namespaces.
Detailed version
----------------
Problem
-------
User-namespaces in the current form have increased the attack surface as
any process can acquire capabilities which are not available to them (by
default) by performing combination of clone()/unshare()/setns() syscalls.
#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <netinet/in.h>
int main(int ac, char **av)
{
int sock = -1;
printf("Attempting to open RAW socket before unshare()...\n");
sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
if (sock < 0) {
perror("socket() SOCK_RAW failed: ");
} else {
printf("Successfully opened RAW-Sock before unshare().\n");
close(sock);
sock = -1;
}
if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
perror("unshare() failed: ");
return 1;
}
printf("Attempting to open RAW socket after unshare()...\n");
sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
if (sock < 0) {
perror("socket() SOCK_RAW failed: ");
} else {
printf("Successfully opened RAW-Sock after unshare().\n");
close(sock);
sock = -1;
}
return 0;
}
The above example shows how easy it is to acquire NET_RAW capabilities
and once acquired, these processes could take benefit of above mentioned
or similar issues discovered/undiscovered with malicious intent. Note
that this is just an example and the problem/solution is not limited
to NET_RAW capability *only*.
The easiest fix one can apply here is to lock-down user-namespaces which
many of the distros do (i.e. don't allow users to create user namespaces),
but unfortunately that prevents everyone from using them.
Approach
--------
Introduce a notion of 'controlled' user-namespaces. Every process on
the host is allowed to create user-namespaces (governed by the limit
imposed by per-ns sysctl) however, mark user-namespaces created by
sandboxed processes as 'controlled'. Use this 'mark' at the time of
capability check in conjunction with a global capability whitelist.
If the capability is not whitelisted, processes that belong to
controlled user-namespaces will not be allowed.
Once a user-ns is marked as 'controlled'; all its child user-
namespaces are marked as 'controlled' too.
A global whitelist is list of capabilities governed by the
sysctl which is available to (privileged) user in init-ns to modify
while it's applicable to all controlled user-namespaces on the host.
Marking user-namespaces controlled without modifying the whitelist is
equivalent of the current behavior. The default value of whitelist includes
all capabilities so that the compatibility is maintained. However it gives
admins fine-grained ability to control various capabilities system wide
without locking down user-namespaces.
Please see individual patches in this series.
Mahesh Bandewar (2):
capability: introduce sysctl for controlled user-ns capability
whitelist
userns: control capabilities of some user namespaces
Documentation/sysctl/kernel.txt | 21 +++++++++++++++++
include/linux/capability.h | 4 ++++
include/linux/user_namespace.h | 20 ++++++++++++++++
kernel/capability.c | 52 +++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 5 ++++
kernel/user_namespace.c | 3 +++
security/commoncap.c | 8 +++++++
7 files changed, 113 insertions(+)
--
2.14.2.822.g60be5d43e6-goog
^ permalink raw reply
* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: Roopa Prabhu @ 2017-09-29 22:21 UTC (permalink / raw)
To: Nikolay Aleksandrov; +Cc: Stephen Hemminger, netdev@vger.kernel.org, bridge
In-Reply-To: <5dd9e4e8-466d-47dc-1257-09823ba783b9@cumulusnetworks.com>
On Fri, Sep 29, 2017 at 3:11 PM, Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
> On 30/09/17 00:51, Stephen Hemminger wrote:
>> On Sat, 30 Sep 2017 00:01:24 +0300
>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>
>>> On 29/09/17 18:14, Stephen Hemminger wrote:
>>>> On Wed, 27 Sep 2017 16:12:44 +0300
>>>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>>>
>>>>> We need to be able to transparently forward most link-local frames via
>>>>> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
>>>>> mask which restricts the forwarding of STP and LACP, but we need to be able
>>>>> to forward these over tunnels and control that forwarding on a per-port
>>>>> basis thus add a new per-port group_fwd_mask option which only disallows
>>>>> mac pause frames to be forwarded (they're always dropped anyway).
>>>>> The patch does not change the current default situation - all of the others
>>>>> are still restricted unless configured for forwarding.
>>>>> We have successfully tested this patch with LACP and STP forwarding over
>>>>> VxLAN and qinq tunnels.
>>>>>
>>>>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>
>>>> LACP is fine, but STP must not be forwarded if STP in user or kernel
>>>> mode is enabled.
>>>>
>>>> Please update this patch or revert it.
>>>>
>>>
>>> The default has not changed, STP is still _not_ forwarded. It can be only if explicitly
>>> requested by the user and that means the port might be participating in STP but not
>>> the bridge's STP, that is explicitly forward all STP frames from that port.
>>> I don't think we have to change anything.
>>>
>>
>> You need to fail if user does something stupid. And DNR.
>>
>
> I get the motivation, but it does not have to be stupid. It may be 802.1q in q, not 802.1ad
> where STP is already forwarded by default, or it may be that the port is participating
> in a different STP. Wouldn't be enough to warn the user that STP forwarding was enabled
> for that port, instead of forcing it off and having it only for 802.1ad ?
> Later when we upstream 802.1Q in q patches we'll have to work around that anyway.
>
>> From 16c42db4852234c13288922c1bffbf7650e08434 Mon Sep 17 00:00:00 2001
>> From: Stephen Hemminger <sthemmin@microsoft.com>
>> Date: Fri, 29 Sep 2017 14:48:17 -0700
>> Subject: [PATCH] bridge: check for attempt to forward STP PDU's with STP
>> enabled
>>
>> Move validation into set function and do not allow
>> configuring forwarding of STP packets if STP is enabled.
>>
>> Fixes: 5af48b59f35c ("net: bridge: add per-port group_fwd_mask with less restrictions")
>
> What if the user requested explicitly to forward these frames from that port ?
>
exactly and this could be on a provider bridge which is just tunneling
stp frames from a customer.
^ permalink raw reply
* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: Nikolay Aleksandrov @ 2017-09-29 22:11 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, roopa, bridge
In-Reply-To: <20170929145103.1e30f73c@xeon-e3>
On 30/09/17 00:51, Stephen Hemminger wrote:
> On Sat, 30 Sep 2017 00:01:24 +0300
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>
>> On 29/09/17 18:14, Stephen Hemminger wrote:
>>> On Wed, 27 Sep 2017 16:12:44 +0300
>>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>>
>>>> We need to be able to transparently forward most link-local frames via
>>>> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
>>>> mask which restricts the forwarding of STP and LACP, but we need to be able
>>>> to forward these over tunnels and control that forwarding on a per-port
>>>> basis thus add a new per-port group_fwd_mask option which only disallows
>>>> mac pause frames to be forwarded (they're always dropped anyway).
>>>> The patch does not change the current default situation - all of the others
>>>> are still restricted unless configured for forwarding.
>>>> We have successfully tested this patch with LACP and STP forwarding over
>>>> VxLAN and qinq tunnels.
>>>>
>>>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>
>>> LACP is fine, but STP must not be forwarded if STP in user or kernel
>>> mode is enabled.
>>>
>>> Please update this patch or revert it.
>>>
>>
>> The default has not changed, STP is still _not_ forwarded. It can be only if explicitly
>> requested by the user and that means the port might be participating in STP but not
>> the bridge's STP, that is explicitly forward all STP frames from that port.
>> I don't think we have to change anything.
>>
>
> You need to fail if user does something stupid. And DNR.
>
I get the motivation, but it does not have to be stupid. It may be 802.1q in q, not 802.1ad
where STP is already forwarded by default, or it may be that the port is participating
in a different STP. Wouldn't be enough to warn the user that STP forwarding was enabled
for that port, instead of forcing it off and having it only for 802.1ad ?
Later when we upstream 802.1Q in q patches we'll have to work around that anyway.
> From 16c42db4852234c13288922c1bffbf7650e08434 Mon Sep 17 00:00:00 2001
> From: Stephen Hemminger <sthemmin@microsoft.com>
> Date: Fri, 29 Sep 2017 14:48:17 -0700
> Subject: [PATCH] bridge: check for attempt to forward STP PDU's with STP
> enabled
>
> Move validation into set function and do not allow
> configuring forwarding of STP packets if STP is enabled.
>
> Fixes: 5af48b59f35c ("net: bridge: add per-port group_fwd_mask with less restrictions")
What if the user requested explicitly to forward these frames from that port ?
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
> net/bridge/br_if.c | 13 +++++++++++++
> net/bridge/br_netlink.c | 6 +++---
> net/bridge/br_private.h | 1 +
> net/bridge/br_sysfs_if.c | 6 +-----
> 4 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index f3aef22931ab..a30e12f76266 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -654,3 +654,16 @@ void br_port_flags_change(struct net_bridge_port *p, unsigned long mask)
> if (mask & BR_AUTO_MASK)
> nbp_update_port_count(br);
> }
> +
> +int br_set_group_fwd_mask(struct net_bridge_port *p, u16 fwd_mask)
> +{
> + if (fwd_mask & BR_GROUPFWD_MACPAUSE)
> + return -EINVAL;
> +
> + if ((fwd_mask & BR_GROUPFWD_STP) &&
> + (p->br->stp_enabled != BR_NO_STP))
> + return -EINVAL;
> +
> + p->group_fwd_mask = fwd_mask;
> + return 0;
> +}
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index dea88a255d26..7b16819ecb59 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -780,9 +780,9 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
> if (tb[IFLA_BRPORT_GROUP_FWD_MASK]) {
> u16 fwd_mask = nla_get_u16(tb[IFLA_BRPORT_GROUP_FWD_MASK]);
>
> - if (fwd_mask & BR_GROUPFWD_MACPAUSE)
> - return -EINVAL;
> - p->group_fwd_mask = fwd_mask;
> + err = br_set_group_fwd_mask(p, fwd_mask);
> + if (err)
> + return err;
> }
>
> br_port_flags_change(p, old_flags ^ p->flags);
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 020c709a017f..734933bebb08 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -573,6 +573,7 @@ netdev_features_t br_features_recompute(struct net_bridge *br,
> netdev_features_t features);
> void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
> void br_manage_promisc(struct net_bridge *br);
> +int br_set_group_fwd_mask(struct net_bridge_port *p, u16 mask);
>
> /* br_input.c */
> int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
> diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
> index 9110d5e56085..f5871be10b24 100644
> --- a/net/bridge/br_sysfs_if.c
> +++ b/net/bridge/br_sysfs_if.c
> @@ -173,11 +173,7 @@ static ssize_t show_group_fwd_mask(struct net_bridge_port *p, char *buf)
> static int store_group_fwd_mask(struct net_bridge_port *p,
> unsigned long v)
> {
> - if (v & BR_GROUPFWD_MACPAUSE)
> - return -EINVAL;
> - p->group_fwd_mask = v;
> -
> - return 0;
> + return br_set_group_fwd_mask(p, v);
> }
> static BRPORT_ATTR(group_fwd_mask, S_IRUGO | S_IWUSR, show_group_fwd_mask,
> store_group_fwd_mask);
>
^ permalink raw reply
* Re: [PATCH next] bonding: speed/duplex update at NETDEV_UP event
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-09-29 22:02 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Mahesh Bandewar, Jay Vosburgh, Andy Gospodarek, Veaceslav Falico,
David Miller, Netdev
In-Reply-To: <20170929130819.74879e99@xeon-e3>
On Fri, Sep 29, 2017 at 1:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 27 Sep 2017 18:03:49 -0700
> Mahesh Bandewar <mahesh@bandewar.net> wrote:
>
>> From: Mahesh Bandewar <maheshb@google.com>
>>
>> Some NIC drivers don't have correct speed/duplex settings at the
>> time they send NETDEV_UP notification and that messes up the
>> bonding state. Especially 802.3ad mode which is very sensitive
>> to these settings. In the current implementation we invoke
>> bond_update_speed_duplex() when we receive NETDEV_UP, however,
>> ignore the return value. If the values we get are invalid
>> (UNKNOWN), then slave gets removed from the aggregator with
>> speed and duplex set to UNKNOWN while link is still marked as UP.
>>
>> This patch fixes this scenario. Also 802.3ad mode is sensitive to
>> these conditions while other modes are not, so making sure that it
>> doesn't change the behavior for other modes.
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>> drivers/net/bonding/bond_main.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index b7313c1d9dcd..177be373966b 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -3076,7 +3076,16 @@ static int bond_slave_netdev_event(unsigned long event,
>> break;
>> case NETDEV_UP:
>> case NETDEV_CHANGE:
>> - bond_update_speed_duplex(slave);
>> + /* For 802.3ad mode only:
>> + * Getting invalid Speed/Duplex values here will put slave
>> + * in weird state. So mark it as link-down for the time
>> + * being and let link-monitoring (miimon) set it right when
>> + * correct speeds/duplex are available.
>> + */
>> + if (bond_update_speed_duplex(slave) &&
>> + BOND_MODE(bond) == BOND_MODE_8023AD)
>> + slave->link = BOND_LINK_DOWN;
>> +
>> if (BOND_MODE(bond) == BOND_MODE_8023AD)
>> bond_3ad_adapter_speed_duplex_changed(slave);
>> /* Fallthrough */
>
> Then fix the drivers. Trying to workaround it here isn't helping.
>
This is not a workaround. It avoids bonding state being weird.
> The problem is that miimon is not required. Bonding can be purely
> event driven.
>
really? Here is a code-snippet from bonding itself -
/* reset values for 802.3ad/TLB/ALB */
if (!bond_mode_uses_arp(bond_mode)) {
if (!miimon) {
pr_warn("Warning: miimon must be specified,
otherwise bonding will not detect link failure, speed and duplex which
are essential for 802.3ad operation\n");
pr_warn("Forcing miimon to 100msec\n");
miimon = BOND_DEFAULT_MIIMON;
}
}
^ permalink raw reply
* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: Stephen Hemminger @ 2017-09-29 21:51 UTC (permalink / raw)
To: Nikolay Aleksandrov; +Cc: netdev, roopa, bridge
In-Reply-To: <e9a70870-9b6b-0ec4-de1e-d41da80fd10f@cumulusnetworks.com>
On Sat, 30 Sep 2017 00:01:24 +0300
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> On 29/09/17 18:14, Stephen Hemminger wrote:
> > On Wed, 27 Sep 2017 16:12:44 +0300
> > Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> >
> >> We need to be able to transparently forward most link-local frames via
> >> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
> >> mask which restricts the forwarding of STP and LACP, but we need to be able
> >> to forward these over tunnels and control that forwarding on a per-port
> >> basis thus add a new per-port group_fwd_mask option which only disallows
> >> mac pause frames to be forwarded (they're always dropped anyway).
> >> The patch does not change the current default situation - all of the others
> >> are still restricted unless configured for forwarding.
> >> We have successfully tested this patch with LACP and STP forwarding over
> >> VxLAN and qinq tunnels.
> >>
> >> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> >
> > LACP is fine, but STP must not be forwarded if STP in user or kernel
> > mode is enabled.
> >
> > Please update this patch or revert it.
> >
>
> The default has not changed, STP is still _not_ forwarded. It can be only if explicitly
> requested by the user and that means the port might be participating in STP but not
> the bridge's STP, that is explicitly forward all STP frames from that port.
> I don't think we have to change anything.
>
You need to fail if user does something stupid. And DNR.
From 16c42db4852234c13288922c1bffbf7650e08434 Mon Sep 17 00:00:00 2001
From: Stephen Hemminger <sthemmin@microsoft.com>
Date: Fri, 29 Sep 2017 14:48:17 -0700
Subject: [PATCH] bridge: check for attempt to forward STP PDU's with STP
enabled
Move validation into set function and do not allow
configuring forwarding of STP packets if STP is enabled.
Fixes: 5af48b59f35c ("net: bridge: add per-port group_fwd_mask with less restrictions")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
net/bridge/br_if.c | 13 +++++++++++++
net/bridge/br_netlink.c | 6 +++---
net/bridge/br_private.h | 1 +
net/bridge/br_sysfs_if.c | 6 +-----
4 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f3aef22931ab..a30e12f76266 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -654,3 +654,16 @@ void br_port_flags_change(struct net_bridge_port *p, unsigned long mask)
if (mask & BR_AUTO_MASK)
nbp_update_port_count(br);
}
+
+int br_set_group_fwd_mask(struct net_bridge_port *p, u16 fwd_mask)
+{
+ if (fwd_mask & BR_GROUPFWD_MACPAUSE)
+ return -EINVAL;
+
+ if ((fwd_mask & BR_GROUPFWD_STP) &&
+ (p->br->stp_enabled != BR_NO_STP))
+ return -EINVAL;
+
+ p->group_fwd_mask = fwd_mask;
+ return 0;
+}
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index dea88a255d26..7b16819ecb59 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -780,9 +780,9 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
if (tb[IFLA_BRPORT_GROUP_FWD_MASK]) {
u16 fwd_mask = nla_get_u16(tb[IFLA_BRPORT_GROUP_FWD_MASK]);
- if (fwd_mask & BR_GROUPFWD_MACPAUSE)
- return -EINVAL;
- p->group_fwd_mask = fwd_mask;
+ err = br_set_group_fwd_mask(p, fwd_mask);
+ if (err)
+ return err;
}
br_port_flags_change(p, old_flags ^ p->flags);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 020c709a017f..734933bebb08 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -573,6 +573,7 @@ netdev_features_t br_features_recompute(struct net_bridge *br,
netdev_features_t features);
void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
void br_manage_promisc(struct net_bridge *br);
+int br_set_group_fwd_mask(struct net_bridge_port *p, u16 mask);
/* br_input.c */
int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 9110d5e56085..f5871be10b24 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -173,11 +173,7 @@ static ssize_t show_group_fwd_mask(struct net_bridge_port *p, char *buf)
static int store_group_fwd_mask(struct net_bridge_port *p,
unsigned long v)
{
- if (v & BR_GROUPFWD_MACPAUSE)
- return -EINVAL;
- p->group_fwd_mask = v;
-
- return 0;
+ return br_set_group_fwd_mask(p, v);
}
static BRPORT_ATTR(group_fwd_mask, S_IRUGO | S_IWUSR, show_group_fwd_mask,
store_group_fwd_mask);
--
2.11.0
^ permalink raw reply related
* [PATCH net-next v2 3/7] net: dsa: use temporary dsa_device_ops variable
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
When resolving the DSA tagging protocol used by a CPU switch, use a
temporary "tag_ops" variable to store the dsa_device_ops instead of
using directly dst->tag_ops. This will make the future patches moving
this pointer around easier to read.
There is no functional changes.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
net/dsa/dsa2.c | 8 +++++---
net/dsa/legacy.c | 8 +++++---
2 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index dcccaebde708..6a10c5c1639f 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -485,6 +485,7 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
struct dsa_switch_tree *dst,
struct dsa_switch *ds)
{
+ const struct dsa_device_ops *tag_ops;
enum dsa_tag_protocol tag_protocol;
struct net_device *ethernet_dev;
struct device_node *ethernet;
@@ -514,13 +515,14 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
ds->cpu_port_mask |= BIT(index);
tag_protocol = ds->ops->get_tag_protocol(ds);
- dst->tag_ops = dsa_resolve_tag_protocol(tag_protocol);
- if (IS_ERR(dst->tag_ops)) {
+ tag_ops = dsa_resolve_tag_protocol(tag_protocol);
+ if (IS_ERR(tag_ops)) {
dev_warn(ds->dev, "No tagger for this switch\n");
ds->cpu_port_mask &= ~BIT(index);
- return PTR_ERR(dst->tag_ops);
+ return PTR_ERR(tag_ops);
}
+ dst->tag_ops = tag_ops;
dst->rcv = dst->tag_ops->rcv;
return 0;
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index ae505d8e4417..8e849013f69d 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -144,13 +144,15 @@ static int dsa_switch_setup_one(struct dsa_switch *ds,
* switch.
*/
if (dst->cpu_dp->ds == ds) {
+ const struct dsa_device_ops *tag_ops;
enum dsa_tag_protocol tag_protocol;
tag_protocol = ops->get_tag_protocol(ds);
- dst->tag_ops = dsa_resolve_tag_protocol(tag_protocol);
- if (IS_ERR(dst->tag_ops))
- return PTR_ERR(dst->tag_ops);
+ tag_ops = dsa_resolve_tag_protocol(tag_protocol);
+ if (IS_ERR(tag_ops))
+ return PTR_ERR(tag_ops);
+ dst->tag_ops = tag_ops;
dst->rcv = dst->tag_ops->rcv;
}
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 7/7] net: dsa: remove tag ops from the switch tree
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
Now that the dsa_ptr is a dsa_port instance, there is no need to keep
the tag operations in the dsa_switch_tree structure. Remove it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
include/net/dsa.h | 11 -----------
net/dsa/dsa2.c | 2 --
net/dsa/legacy.c | 2 --
3 files changed, 15 deletions(-)
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 6bda01fa5747..10dceccd9ce8 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -130,11 +130,6 @@ struct dsa_switch_tree {
*/
struct dsa_platform_data *pd;
- /* Copy of tag_ops->rcv for faster access in hot path */
- struct sk_buff * (*rcv)(struct sk_buff *skb,
- struct net_device *dev,
- struct packet_type *pt);
-
/*
* The switch port to which the CPU is attached.
*/
@@ -144,12 +139,6 @@ struct dsa_switch_tree {
* Data for the individual switch chips.
*/
struct dsa_switch *ds[DSA_MAX_SWITCHES];
-
- /*
- * Tagging protocol operations for adding and removing an
- * encapsulation tag.
- */
- const struct dsa_device_ops *tag_ops;
};
/* TC matchall action types, only mirroring for now */
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 62302558f38c..54ed054777bd 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -523,11 +523,9 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
}
dst->cpu_dp->tag_ops = tag_ops;
- dst->tag_ops = tag_ops;
/* Make a few copies for faster access in master receive hot path */
dst->cpu_dp->rcv = dst->cpu_dp->tag_ops->rcv;
- dst->rcv = dst->tag_ops->rcv;
dst->cpu_dp->dst = dst;
return 0;
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 71917505a5cc..19ff6e0a21dc 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -153,11 +153,9 @@ static int dsa_switch_setup_one(struct dsa_switch *ds,
return PTR_ERR(tag_ops);
dst->cpu_dp->tag_ops = tag_ops;
- dst->tag_ops = tag_ops;
/* Few copies for faster access in master receive hot path */
dst->cpu_dp->rcv = dst->cpu_dp->tag_ops->rcv;
- dst->rcv = dst->tag_ops->rcv;
dst->cpu_dp->dst = dst;
}
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 6/7] net: dsa: change dsa_ptr for a dsa_port
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
With DSA, a master net device (CPU facing interface) has a dsa_ptr
pointer to which hangs a dsa_switch_tree. This is not correct because a
master interface is wired to a dedicated switch port, and because we can
theoretically have several master interfaces pointing to several CPU
ports of the same switch fabric.
Change the master interface's dsa_ptr for the CPU dsa_port pointer.
This is a step towards supporting multiple CPU ports.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
include/linux/netdevice.h | 4 ++--
net/dsa/dsa.c | 6 +++---
net/dsa/dsa2.c | 2 +-
net/dsa/dsa_priv.h | 3 ++-
net/dsa/legacy.c | 2 +-
net/dsa/master.c | 15 +++++----------
6 files changed, 14 insertions(+), 18 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..e1d6ef130611 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -55,7 +55,7 @@
struct netpoll_info;
struct device;
struct phy_device;
-struct dsa_switch_tree;
+struct dsa_port;
/* 802.11 specific */
struct wireless_dev;
@@ -1752,7 +1752,7 @@ struct net_device {
struct vlan_info __rcu *vlan_info;
#endif
#if IS_ENABLED(CONFIG_NET_DSA)
- struct dsa_switch_tree *dsa_ptr;
+ struct dsa_port *dsa_ptr;
#endif
#if IS_ENABLED(CONFIG_TIPC)
struct tipc_bearer __rcu *tipc_ptr;
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 81c852e32821..51ca2a524a27 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -160,12 +160,12 @@ EXPORT_SYMBOL_GPL(dsa_dev_to_net_device);
static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt, struct net_device *unused)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
struct sk_buff *nskb = NULL;
struct pcpu_sw_netstats *s;
struct dsa_slave_priv *p;
- if (unlikely(dst == NULL)) {
+ if (unlikely(!cpu_dp)) {
kfree_skb(skb);
return 0;
}
@@ -174,7 +174,7 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
if (!skb)
return 0;
- nskb = dst->rcv(skb, dev, pt);
+ nskb = cpu_dp->rcv(skb, dev, pt);
if (!nskb) {
kfree_skb(skb);
return 0;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index b71e3bb478e4..62302558f38c 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -438,7 +438,7 @@ static int dsa_dst_apply(struct dsa_switch_tree *dst)
* sent to the tag format's receive function.
*/
wmb();
- dst->cpu_dp->netdev->dsa_ptr = dst;
+ dst->cpu_dp->netdev->dsa_ptr = dst->cpu_dp;
err = dsa_master_ethtool_setup(dst->cpu_dp->netdev);
if (err)
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 9397291bb3aa..2850077cc9cc 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -116,7 +116,8 @@ void dsa_master_ethtool_restore(struct net_device *dev);
static inline struct net_device *dsa_master_get_slave(struct net_device *dev,
int device, int port)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
+ struct dsa_switch_tree *dst = cpu_dp->dst;
struct dsa_switch *ds;
if (device < 0 || device >= DSA_MAX_SWITCHES)
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 96c7e3f8b8bb..71917505a5cc 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -607,7 +607,7 @@ static int dsa_setup_dst(struct dsa_switch_tree *dst, struct net_device *dev,
* sent to the tag format's receive function.
*/
wmb();
- dev->dsa_ptr = dst;
+ dev->dsa_ptr = dst->cpu_dp;
return dsa_master_ethtool_setup(dst->cpu_dp->netdev);
}
diff --git a/net/dsa/master.c b/net/dsa/master.c
index ef15d35f1574..5f3f57e372e0 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -16,8 +16,7 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
struct ethtool_stats *stats,
uint64_t *data)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
struct dsa_switch *ds = cpu_dp->ds;
int port = cpu_dp->index;
@@ -34,8 +33,7 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
static int dsa_master_get_sset_count(struct net_device *dev, int sset)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
@@ -52,8 +50,7 @@ static int dsa_master_get_sset_count(struct net_device *dev, int sset)
static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
uint8_t *data)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
struct dsa_switch *ds = cpu_dp->ds;
int port = cpu_dp->index;
@@ -90,8 +87,7 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
int dsa_master_ethtool_setup(struct net_device *dev)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
struct dsa_switch *ds = cpu_dp->ds;
struct ethtool_ops *ops;
@@ -114,8 +110,7 @@ int dsa_master_ethtool_setup(struct net_device *dev)
void dsa_master_ethtool_restore(struct net_device *dev)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dev->dsa_ptr;
dev->ethtool_ops = cpu_dp->orig_ethtool_ops;
cpu_dp->orig_ethtool_ops = NULL;
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 5/7] net: dsa: prepare master receive hot path
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
In preparation to make DSA master devices point to their corresponding
CPU port instead of the whole tree, add copies of dst and rcv in the
dsa_port structure so that we keep fast access in the receive hot path.
Also keep the copies at the beginning of the dsa_port structure in order
to ensure they are available in cacheline 1.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
include/net/dsa.h | 5 +++++
net/dsa/dsa2.c | 4 ++++
net/dsa/legacy.c | 4 ++++
3 files changed, 13 insertions(+)
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 4d1df2f086e8..6bda01fa5747 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -178,6 +178,11 @@ struct dsa_port {
/* CPU port tagging operations used by master or slave devices */
const struct dsa_device_ops *tag_ops;
+ /* Copies for faster access in master receive hot path */
+ struct dsa_switch_tree *dst;
+ struct sk_buff *(*rcv)(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pt);
+
struct dsa_switch *ds;
unsigned int index;
const char *name;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 9eac4726dc0c..b71e3bb478e4 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -524,7 +524,11 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
dst->cpu_dp->tag_ops = tag_ops;
dst->tag_ops = tag_ops;
+
+ /* Make a few copies for faster access in master receive hot path */
+ dst->cpu_dp->rcv = dst->cpu_dp->tag_ops->rcv;
dst->rcv = dst->tag_ops->rcv;
+ dst->cpu_dp->dst = dst;
return 0;
}
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 4d374541815a..96c7e3f8b8bb 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -154,7 +154,11 @@ static int dsa_switch_setup_one(struct dsa_switch *ds,
dst->cpu_dp->tag_ops = tag_ops;
dst->tag_ops = tag_ops;
+
+ /* Few copies for faster access in master receive hot path */
+ dst->cpu_dp->rcv = dst->cpu_dp->tag_ops->rcv;
dst->rcv = dst->tag_ops->rcv;
+ dst->cpu_dp->dst = dst;
}
memcpy(ds->rtable, cd->rtable, sizeof(ds->rtable));
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 4/7] net: dsa: add tagging ops to port
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
The DSA tagging protocol operations are specific to each CPU port,
thus the dsa_device_ops pointer belongs to the dsa_port structure.
>From now on assign a slave's xmit copy from its CPU port tagging
operations. This will ease the future support for multiple CPU ports.
Also keep the tag_ops at the beginning of the dsa_port structure so that
we ensure copies for hot path are in cacheline 1.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
include/net/dsa.h | 3 +++
net/dsa/dsa2.c | 1 +
net/dsa/dsa_priv.h | 2 +-
net/dsa/legacy.c | 1 +
net/dsa/slave.c | 3 +--
5 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 8dee216a5a9b..4d1df2f086e8 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -175,6 +175,9 @@ struct dsa_mall_tc_entry {
struct dsa_port {
+ /* CPU port tagging operations used by master or slave devices */
+ const struct dsa_device_ops *tag_ops;
+
struct dsa_switch *ds;
unsigned int index;
const char *name;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 6a10c5c1639f..9eac4726dc0c 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -522,6 +522,7 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
return PTR_ERR(tag_ops);
}
+ dst->cpu_dp->tag_ops = tag_ops;
dst->tag_ops = tag_ops;
dst->rcv = dst->tag_ops->rcv;
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index d429505dc4e7..9397291bb3aa 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -66,7 +66,7 @@ struct dsa_notifier_vlan_info {
};
struct dsa_slave_priv {
- /* Copy of dp->ds->dst->tag_ops->xmit for faster access in hot path */
+ /* Copy of CPU port xmit for faster access in slave transmit hot path */
struct sk_buff * (*xmit)(struct sk_buff *skb,
struct net_device *dev);
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 8e849013f69d..4d374541815a 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -152,6 +152,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds,
if (IS_ERR(tag_ops))
return PTR_ERR(tag_ops);
+ dst->cpu_dp->tag_ops = tag_ops;
dst->tag_ops = tag_ops;
dst->rcv = dst->tag_ops->rcv;
}
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bf8800de13c1..4b634db05cee 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1117,7 +1117,6 @@ int dsa_slave_resume(struct net_device *slave_dev)
int dsa_slave_create(struct dsa_port *port, const char *name)
{
struct dsa_switch *ds = port->ds;
- struct dsa_switch_tree *dst = ds->dst;
struct net_device *master;
struct net_device *slave_dev;
struct dsa_slave_priv *p;
@@ -1162,7 +1161,7 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
}
p->dp = port;
INIT_LIST_HEAD(&p->mall_tc_list);
- p->xmit = dst->tag_ops->xmit;
+ p->xmit = cpu_dp->tag_ops->xmit;
p->old_pause = -1;
p->old_link = -1;
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 2/7] net: dsa: use cpu_dp in master code
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
Make it clear that the master device is linked to a CPU port by using
"cpu_dp" for the dsa_port variable in master.c instead of "port", then
use a "port" variable to describe the port index, as usually seen in
other places of DSA core.
This will make the future patch touching dsa_ptr more readable. There is
no functional changes.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
net/dsa/master.c | 42 ++++++++++++++++++++++--------------------
1 file changed, 22 insertions(+), 20 deletions(-)
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 5e5147ec5a44..ef15d35f1574 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -17,9 +17,10 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
uint64_t *data)
{
struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *port = dst->cpu_dp;
- struct dsa_switch *ds = port->ds;
- const struct ethtool_ops *ops = port->orig_ethtool_ops;
+ struct dsa_port *cpu_dp = dst->cpu_dp;
+ const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+ struct dsa_switch *ds = cpu_dp->ds;
+ int port = cpu_dp->index;
int count = 0;
if (ops && ops->get_sset_count && ops->get_ethtool_stats) {
@@ -28,15 +29,15 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
}
if (ds->ops->get_ethtool_stats)
- ds->ops->get_ethtool_stats(ds, port->index, data + count);
+ ds->ops->get_ethtool_stats(ds, port, data + count);
}
static int dsa_master_get_sset_count(struct net_device *dev, int sset)
{
struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *port = dst->cpu_dp;
- struct dsa_switch *ds = port->ds;
- const struct ethtool_ops *ops = port->orig_ethtool_ops;
+ struct dsa_port *cpu_dp = dst->cpu_dp;
+ const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+ struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
if (ops && ops->get_sset_count)
@@ -52,16 +53,17 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
uint8_t *data)
{
struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *port = dst->cpu_dp;
- struct dsa_switch *ds = port->ds;
- const struct ethtool_ops *ops = port->orig_ethtool_ops;
+ struct dsa_port *cpu_dp = dst->cpu_dp;
+ const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+ struct dsa_switch *ds = cpu_dp->ds;
+ int port = cpu_dp->index;
int len = ETH_GSTRING_LEN;
int mcount = 0, count;
unsigned int i;
uint8_t pfx[4];
uint8_t *ndata;
- snprintf(pfx, sizeof(pfx), "p%.2d", port->index);
+ snprintf(pfx, sizeof(pfx), "p%.2d", port);
/* We do not want to be NULL-terminated, since this is a prefix */
pfx[sizeof(pfx) - 1] = '_';
@@ -76,7 +78,7 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
* the output after to prepend our CPU port prefix we
* constructed earlier
*/
- ds->ops->get_strings(ds, port->index, ndata);
+ ds->ops->get_strings(ds, port, ndata);
count = ds->ops->get_sset_count(ds);
for (i = 0; i < count; i++) {
memmove(ndata + (i * len + sizeof(pfx)),
@@ -89,17 +91,17 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
int dsa_master_ethtool_setup(struct net_device *dev)
{
struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *port = dst->cpu_dp;
- struct dsa_switch *ds = port->ds;
+ struct dsa_port *cpu_dp = dst->cpu_dp;
+ struct dsa_switch *ds = cpu_dp->ds;
struct ethtool_ops *ops;
ops = devm_kzalloc(ds->dev, sizeof(*ops), GFP_KERNEL);
if (!ops)
return -ENOMEM;
- port->orig_ethtool_ops = dev->ethtool_ops;
- if (port->orig_ethtool_ops)
- memcpy(ops, port->orig_ethtool_ops, sizeof(*ops));
+ cpu_dp->orig_ethtool_ops = dev->ethtool_ops;
+ if (cpu_dp->orig_ethtool_ops)
+ memcpy(ops, cpu_dp->orig_ethtool_ops, sizeof(*ops));
ops->get_sset_count = dsa_master_get_sset_count;
ops->get_ethtool_stats = dsa_master_get_ethtool_stats;
@@ -113,8 +115,8 @@ int dsa_master_ethtool_setup(struct net_device *dev)
void dsa_master_ethtool_restore(struct net_device *dev)
{
struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *port = dst->cpu_dp;
+ struct dsa_port *cpu_dp = dst->cpu_dp;
- dev->ethtool_ops = port->orig_ethtool_ops;
- port->orig_ethtool_ops = NULL;
+ dev->ethtool_ops = cpu_dp->orig_ethtool_ops;
+ cpu_dp->orig_ethtool_ops = NULL;
}
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 1/7] net: dsa: add master helper to look up slaves
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
In-Reply-To: <20170929211921.5571-1-vivien.didelot@savoirfairelinux.com>
The DSA tagging code does not need to know about the DSA architecture,
it only needs to return the slave device corresponding to the source
port index (and eventually the source device index for cascade-capable
switches) parsed from the frame received on the master device.
For this purpose, provide an inline dsa_master_get_slave helper which
validates the device and port indexes and look up the slave device.
This makes the tagging rcv functions more concise and robust, and also
makes dsa_get_cpu_port obsolete.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
net/dsa/dsa_priv.h | 24 +++++++++++++++++++-----
net/dsa/tag_brcm.c | 9 ++-------
net/dsa/tag_dsa.c | 18 ++----------------
net/dsa/tag_edsa.c | 18 ++----------------
net/dsa/tag_ksz.c | 9 +++------
net/dsa/tag_lan9303.c | 20 ++------------------
net/dsa/tag_mtk.c | 16 +++-------------
net/dsa/tag_qca.c | 17 +++--------------
net/dsa/tag_trailer.c | 9 +++------
9 files changed, 39 insertions(+), 101 deletions(-)
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index eccc62776283..d429505dc4e7 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -113,6 +113,25 @@ int dsa_legacy_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
int dsa_master_ethtool_setup(struct net_device *dev);
void dsa_master_ethtool_restore(struct net_device *dev);
+static inline struct net_device *dsa_master_get_slave(struct net_device *dev,
+ int device, int port)
+{
+ struct dsa_switch_tree *dst = dev->dsa_ptr;
+ struct dsa_switch *ds;
+
+ if (device < 0 || device >= DSA_MAX_SWITCHES)
+ return NULL;
+
+ ds = dst->ds[device];
+ if (!ds)
+ return NULL;
+
+ if (port < 0 || port >= ds->num_ports)
+ return NULL;
+
+ return ds->ports[port].netdev;
+}
+
/* port.c */
int dsa_port_set_state(struct dsa_port *dp, u8 state,
struct switchdev_trans *trans);
@@ -182,9 +201,4 @@ static inline struct net_device *dsa_master_netdev(struct dsa_slave_priv *p)
return p->dp->cpu_dp->netdev;
}
-static inline struct dsa_port *dsa_get_cpu_port(struct dsa_switch_tree *dst)
-{
- return dst->cpu_dp;
-}
-
#endif
diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c
index dbb016434ace..8e4bdb9d9ae3 100644
--- a/net/dsa/tag_brcm.c
+++ b/net/dsa/tag_brcm.c
@@ -92,9 +92,6 @@ static struct sk_buff *brcm_tag_xmit(struct sk_buff *skb, struct net_device *dev
static struct sk_buff *brcm_tag_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
- struct dsa_switch *ds = cpu_dp->ds;
int source_port;
u8 *brcm_tag;
@@ -117,8 +114,8 @@ static struct sk_buff *brcm_tag_rcv(struct sk_buff *skb, struct net_device *dev,
/* Locate which port this is coming from */
source_port = brcm_tag[3] & BRCM_EG_PID_MASK;
- /* Validate port against switch setup, either the port is totally */
- if (source_port >= ds->num_ports || !ds->ports[source_port].netdev)
+ skb->dev = dsa_master_get_slave(dev, 0, source_port);
+ if (!skb->dev)
return NULL;
/* Remove Broadcom tag and update checksum */
@@ -129,8 +126,6 @@ static struct sk_buff *brcm_tag_rcv(struct sk_buff *skb, struct net_device *dev,
skb->data - ETH_HLEN - BRCM_TAG_LEN,
2 * ETH_ALEN);
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index fbf9ca954773..c77218f173d1 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -67,8 +67,6 @@ static struct sk_buff *dsa_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_switch *ds;
u8 *dsa_header;
int source_device;
int source_port;
@@ -93,18 +91,8 @@ static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev,
source_device = dsa_header[0] & 0x1f;
source_port = (dsa_header[1] >> 3) & 0x1f;
- /*
- * Check that the source device exists and that the source
- * port is a registered DSA port.
- */
- if (source_device >= DSA_MAX_SWITCHES)
- return NULL;
-
- ds = dst->ds[source_device];
- if (!ds)
- return NULL;
-
- if (source_port >= ds->num_ports || !ds->ports[source_port].netdev)
+ skb->dev = dsa_master_get_slave(dev, source_device, source_port);
+ if (!skb->dev)
return NULL;
/*
@@ -153,8 +141,6 @@ static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev,
2 * ETH_ALEN);
}
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
diff --git a/net/dsa/tag_edsa.c b/net/dsa/tag_edsa.c
index 76367ba1b2e2..0b83cbe0c9e8 100644
--- a/net/dsa/tag_edsa.c
+++ b/net/dsa/tag_edsa.c
@@ -80,8 +80,6 @@ static struct sk_buff *edsa_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_switch *ds;
u8 *edsa_header;
int source_device;
int source_port;
@@ -106,18 +104,8 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev,
source_device = edsa_header[0] & 0x1f;
source_port = (edsa_header[1] >> 3) & 0x1f;
- /*
- * Check that the source device exists and that the source
- * port is a registered DSA port.
- */
- if (source_device >= DSA_MAX_SWITCHES)
- return NULL;
-
- ds = dst->ds[source_device];
- if (!ds)
- return NULL;
-
- if (source_port >= ds->num_ports || !ds->ports[source_port].netdev)
+ skb->dev = dsa_master_get_slave(dev, source_device, source_port);
+ if (!skb->dev)
return NULL;
/*
@@ -172,8 +160,6 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev,
2 * ETH_ALEN);
}
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index 010ca0a336c4..b241c990cde0 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -80,22 +80,19 @@ static struct sk_buff *ksz_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *ksz_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
- struct dsa_switch *ds = cpu_dp->ds;
u8 *tag;
int source_port;
tag = skb_tail_pointer(skb) - KSZ_EGRESS_TAG_LEN;
source_port = tag[0] & 7;
- if (source_port >= ds->num_ports || !ds->ports[source_port].netdev)
+
+ skb->dev = dsa_master_get_slave(dev, 0, source_port);
+ if (!skb->dev)
return NULL;
pskb_trim_rcsum(skb, skb->len - KSZ_EGRESS_TAG_LEN);
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c
index 0b9826105e42..4f211e56c81f 100644
--- a/net/dsa/tag_lan9303.c
+++ b/net/dsa/tag_lan9303.c
@@ -71,17 +71,8 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
u16 *lan9303_tag;
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_switch *ds;
unsigned int source_port;
- ds = dst->ds[0];
-
- if (unlikely(!ds)) {
- dev_warn_ratelimited(&dev->dev, "Dropping packet, due to missing DSA switch device\n");
- return NULL;
- }
-
if (unlikely(!pskb_may_pull(skb, LAN9303_TAG_LEN))) {
dev_warn_ratelimited(&dev->dev,
"Dropping packet, cannot pull\n");
@@ -103,16 +94,12 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev,
source_port = ntohs(lan9303_tag[1]) & 0x3;
- if (source_port >= ds->num_ports) {
+ skb->dev = dsa_master_get_slave(dev, 0, source_port);
+ if (!skb->dev) {
dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid source port\n");
return NULL;
}
- if (!ds->ports[source_port].netdev) {
- dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid netdev or device\n");
- return NULL;
- }
-
/* remove the special VLAN tag between the MAC addresses
* and the current ethertype field.
*/
@@ -120,9 +107,6 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev,
memmove(skb->data - ETH_HLEN, skb->data - (ETH_HLEN + LAN9303_TAG_LEN),
2 * ETH_ALEN);
- /* forward the packet to the dedicated interface */
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
diff --git a/net/dsa/tag_mtk.c b/net/dsa/tag_mtk.c
index ec8ee5f43255..968586c5d40e 100644
--- a/net/dsa/tag_mtk.c
+++ b/net/dsa/tag_mtk.c
@@ -46,8 +46,6 @@ static struct sk_buff *mtk_tag_xmit(struct sk_buff *skb,
static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_switch *ds;
int port;
__be16 *phdr, hdr;
@@ -68,20 +66,12 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev,
skb->data - ETH_HLEN - MTK_HDR_LEN,
2 * ETH_ALEN);
- /* This protocol doesn't support cascading multiple
- * switches so it's safe to assume the switch is first
- * in the tree.
- */
- ds = dst->ds[0];
- if (!ds)
- return NULL;
-
/* Get source port information */
port = (hdr & MTK_HDR_RECV_SOURCE_PORT_MASK);
- if (!ds->ports[port].netdev)
- return NULL;
- skb->dev = ds->ports[port].netdev;
+ skb->dev = dsa_master_get_slave(dev, 0, port);
+ if (!skb->dev)
+ return NULL;
return skb;
}
diff --git a/net/dsa/tag_qca.c b/net/dsa/tag_qca.c
index 1d4c70711c0f..8d33d9ebf910 100644
--- a/net/dsa/tag_qca.c
+++ b/net/dsa/tag_qca.c
@@ -65,9 +65,6 @@ static struct sk_buff *qca_tag_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
- struct dsa_switch *ds;
u8 ver;
int port;
__be16 *phdr, hdr;
@@ -92,20 +89,12 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev,
memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - QCA_HDR_LEN,
ETH_HLEN - QCA_HDR_LEN);
- /* This protocol doesn't support cascading multiple switches so it's
- * safe to assume the switch is first in the tree
- */
- ds = cpu_dp->ds;
- if (!ds)
- return NULL;
-
/* Get source port information */
port = (hdr & QCA_HDR_RECV_SOURCE_PORT_MASK);
- if (!ds->ports[port].netdev)
- return NULL;
- /* Update skb & forward the frame accordingly */
- skb->dev = ds->ports[port].netdev;
+ skb->dev = dsa_master_get_slave(dev, 0, port);
+ if (!skb->dev)
+ return NULL;
return skb;
}
diff --git a/net/dsa/tag_trailer.c b/net/dsa/tag_trailer.c
index d2fd4923aa3e..61668be267f5 100644
--- a/net/dsa/tag_trailer.c
+++ b/net/dsa/tag_trailer.c
@@ -58,9 +58,6 @@ static struct sk_buff *trailer_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *trailer_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt)
{
- struct dsa_switch_tree *dst = dev->dsa_ptr;
- struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
- struct dsa_switch *ds = cpu_dp->ds;
u8 *trailer;
int source_port;
@@ -73,13 +70,13 @@ static struct sk_buff *trailer_rcv(struct sk_buff *skb, struct net_device *dev,
return NULL;
source_port = trailer[1] & 7;
- if (source_port >= ds->num_ports || !ds->ports[source_port].netdev)
+
+ skb->dev = dsa_master_get_slave(dev, 0, source_port);
+ if (!skb->dev)
return NULL;
pskb_trim_rcsum(skb, skb->len - 4);
- skb->dev = ds->ports[source_port].netdev;
-
return skb;
}
--
2.14.1
^ permalink raw reply related
* [PATCH net-next v2 0/7] net: dsa: change dsa_ptr for a dsa_port
From: Vivien Didelot @ 2017-09-29 21:19 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
Andrew Lunn, Vivien Didelot
With DSA, a master net_device is physically wired to a dedicated CPU
switch port. For interaction with the DSA layer, the struct net_device
contains a dsa_ptr, which currently points to a dsa_switch_tree object.
This is only valid for a switch fabric with a single CPU port. In order
to support switch fabrics with multiple CPU ports, we first need to
change the type of dsa_ptr to what it really is: a dsa_port object.
This is what this patchset does. The first patches adds a
dsa_master_get_slave helper and cleans up portions of DSA core to make
the next patches more readable. These next patches prepare the xmit and
receive hot paths and finally change dsa_ptr.
Changes in v2:
- introduce dsa_master_get_slave helper to simplify patch 6
- keep hot path data at beginning of dsa_port for cacheline 1
Vivien Didelot (7):
net: dsa: add master helper to look up slaves
net: dsa: use cpu_dp in master code
net: dsa: use temporary dsa_device_ops variable
net: dsa: add tagging ops to port
net: dsa: prepare master receive hot path
net: dsa: change dsa_ptr for a dsa_port
net: dsa: remove tag ops from the switch tree
include/linux/netdevice.h | 4 ++--
include/net/dsa.h | 19 ++++++++-----------
net/dsa/dsa.c | 6 +++---
net/dsa/dsa2.c | 15 ++++++++++-----
net/dsa/dsa_priv.h | 27 +++++++++++++++++++++------
net/dsa/legacy.c | 15 ++++++++++-----
net/dsa/master.c | 47 ++++++++++++++++++++++-------------------------
net/dsa/slave.c | 3 +--
net/dsa/tag_brcm.c | 9 ++-------
net/dsa/tag_dsa.c | 18 ++----------------
net/dsa/tag_edsa.c | 18 ++----------------
net/dsa/tag_ksz.c | 9 +++------
net/dsa/tag_lan9303.c | 20 ++------------------
net/dsa/tag_mtk.c | 16 +++-------------
net/dsa/tag_qca.c | 17 +++--------------
net/dsa/tag_trailer.c | 9 +++------
16 files changed, 97 insertions(+), 155 deletions(-)
--
2.14.1
^ permalink raw reply
* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Rodney Cummings @ 2017-09-29 20:44 UTC (permalink / raw)
To: Linux Kernel Network Developers
Hi,
I am posting my reply to this thread after subscribing, so I apologize
if the archive happens to attach it to the wrong thread.
First, I'd like to say that I strongly support this RFC.
We need Linux interfaces for IEEE 802.1 TSN features.
Although I haven't looked in detail, the proposal for CBS looks good.
My questions/concerns are more related to future work, such for 802.1Qbv
(scheduled traffic).
1. Question: From an 802.1 perspective, is this RFC intended to support
end-station (e.g. NIC in host), bridges (i.e. DSA), or both?
This is very important to clarify, because the usage of this interface
will be very different for one or the other.
For a bridge, the user code typically represents a remote management
protocol (e.g. SNMP, NETCONF, RESTCONF), and this interface is
expected to align with the specifications of 802.1Q clause 12,
which serves as the information model for management. Historically,
a standard kernel interface for management hasn't been viewed as
essential, but I suppose it wouldn't hurt.
For an end station, the user code can be an implementation of SRP
(802.1Q clause 35), or it can be an application-specific
protocol (e.g. industrial fieldbus) that exchanges data according
to P802.1Qcc clause 46. Either way, the top-level user interface
is designed for individual streams, not queues and shapers. That
implies some translation code between that top-level interface
and this sort of kernel interface.
As a specific end-station example, for CBS, 802.1Q-2014 subclause
34.6.1 requires "per-stream queues" in the Talker end-station.
I don't see 34.6.1 represented in the proposed RFC, but that's
okay... maybe per-stream queues are implemented in user code.
Nevertheless, if that is the assumption, I think we need to
clarify, especially in examples.
2. Suggestion: Do not assume that a time-aware (i.e. scheduled)
end-station will always use 802.1Qbv.
For those who are subscribed to the 802.1 mailing list,
I'd suggest a read of draft P802.1Qcc/D1.6, subclause U.1
of Annex U. Subclause U.1 assumes that bridges in the network use
802.1Qbv, and then it poses the question of what an end-station
Talker should do. If the end-station also uses 802.1Qbv,
and that end-station transmits multiple streams, 802.1Qbv is
a bad implementation. The reason is that the scheduling
(i.e. order in time) of each stream cannot be controlled, which
in turn means that the CNC (network manager) cannot optimize
the 802.1Qbv schedules in bridges. The preferred technique
is to use "per-stream scheduling" in each Talker, so that
the CNC can create an optimal schedules (i.e. best determinism).
I'm aware of a small number of proprietary CNC implementations for
802.1Qbv in bridges, and they are generally assuming per-stream
scheduling in end-stations (Talkers).
The i210 NIC's LaunchTime can be used to implement per-stream
scheduling. I haven't looked at SO_TXTIME in detail, but it sounds
like per-stream scheduling. If so, then we already have the
fundamental building blocks for a complete implementation
of a time-aware end-station.
If we answer the preceding question #1 as "end-station only",
I would recommend avoiding 802.1Qbv in this interface. There
isn't really anything wrong with it per-se, but it would lead
developers down the wrong path.
Rodney Cummings (National Instruments)
Editor, IEEE P802.1Qcc
---
Hi,
This patchset is an RFC on a proposal of how the Traffic Control subsystem can
be used to offload the configuration of traffic shapers into network devices
that provide support for them in HW. Our goal here is to start upstreaming
support for features related to the Time-Sensitive Networking (TSN) set of
standards into the kernel.
As part of this work, we've assessed previous public discussions related to TSN
enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
Please note that the patches provided as part of this RFC are implementing what
is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu interfaces
as well. The current patches are only providing support for HW offload of the
configs.
Overview
========
Time-sensitive Networking (TSN) is a set of standards that aim to address
resources availability for providing bandwidth reservation and bounded latency
on Ethernet based LANs. The proposal described here aims to cover mainly what is
needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
802.1Qbu.
The initial target of this work is the Intel i210 NIC, but other controllers'
datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
the Synopsis DesignWare Ethernet QoS controller.
Proposal
========
Feature-wise, what is covered here are configuration interfaces for HW
implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
(802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
Qbv and Qbu must be configured per port, with the configuration covering all
queues. Given that these features are related to traffic shaping, and that the
traffic control subsystem already provides a queueing discipline that offloads
config into the device driver (i.e. mqprio), designing new qdiscs for the
specific purpose of offloading the config for each shaper seemed like a good
fit.
For steering traffic into the correct queues, we use the socket option
SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues.
The qdisc mqprio is currently used in our tests.
As for the shapers config interface:
* CBS (802.1Qav)
This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
$ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
idleslope I
Note that the parameters for this qdisc are the ones defined by the
802.1Q-2014 spec, so no hardware specific functionality is exposed here.
* Time-aware shaper (802.1Qbv):
The idea we are currently exploring is to add a "time-aware", priority based
qdisc, that also exposes the Tx queues available and provides a mechanism for
mapping priority <-> traffic class <-> Tx queues in a similar fashion as
mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
$ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
queues 0 1 2 3 \
sched-file gates.sched [base-time <interval>] \
[cycle-time <interval>] [extension-time <interval>]
<file> is multi-line, with each line being of the following format:
<cmd> <gate mask> <interval in nanoseconds>
Qbv only defines one <cmd>: "S" for 'SetGates'
For example:
S 0x01 300
S 0x03 500
This means that there are two intervals, the first will have the gate
for traffic class 0 open for 300 nanoseconds, the second will have
both traffic classes open for 500 nanoseconds.
Additionally, an option to set just one entry of the gate control list will
also be provided by 'taprio':
$ tc qdisc (...) \
sched-row <row number> <cmd> <gate mask> <interval> \
[base-time <interval>] [cycle-time <interval>] \
[extension-time <interval>]
* Frame Preemption (802.1Qbu):
To control even further the latency, it may prove useful to signal which
traffic classes are marked as preemptable. For that, 'taprio' provides the
preemption command so you set each traffic class as preemptable or not:
$ tc qdisc (...) \
preemption 0 1 1 1
* Time-aware shaper + Preemption:
As an example of how Qbv and Qbu can be used together, we may specify
both the schedule and the preempt-mask, and this way we may also
specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
specified in the Qbu spec:
$ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
queues 0 1 2 3 \
preemption 0 1 1 1 \
sched-file preempt_gates.sched
<file> is multi-line, with each line being of the following format:
<cmd> <gate mask> <interval in nanoseconds>
For this case, two new commands are introduced:
"H" for 'set gates and hold'
"R" for 'set gates and release'
H 0x01 300
R 0x03 500
Testing this RFC
================
For testing the patches of this RFC only, you can refer to the samples and
helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
configure the HW shaper of the i210 controller:
1) Setup priorities to traffic classes to hardware queues mapping
$ tc qdisc replace dev enp3s0 parent root mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
2) Check scheme. You want to get the inner qdiscs ID from the bottom up
$ tc -g class show dev enp3s0
Ex.:
+---(802a:3) mqprio
| +---(802a:6) mqprio
| +---(802a:7) mqprio
|
+---(802a:2) mqprio
| +---(802a:5) mqprio
|
+---(802a:1) mqprio
+---(802a:4) mqprio
* Here '802a:4' is Tx Queue #0 and '802a:5' is Tx Queue #1.
3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and
for B is 10Mbps:
$ ./samples/tsn/calculate_cbs_params.py -A 20000 -a 1500 -B 10000 -b 1500
4) Configure CBS for traffic class A (priority 3) as provided by the script:
$ tc qdisc replace dev enp3s0 parent 802a:4 cbs locredit -1470 \
hicredit 30 sendslope -980000 idleslope 20000
5) Configure CBS for traffic class B (priority 2):
$ tc qdisc replace dev enp3s0 parent 802a:5 cbs \
locredit -1485 hicredit 31 sendslope -990000 idleslope 10000
6) Run Listener, compiled from samples/tsn/listener.c
$ ./listener -i enp3s0
7) Run Talker for class A (prio 3 here), compiled from samples/tsn/talker.c
$ ./talker -i enp3s0 -p 3
* The bandwidth displayed on the listener output at this stage should be very
close to the one configured for class A.
8) You can also run a Talker for class B (prio 2 here)
$ ./talker -i enp3s0 -p 2
* The bandwidth displayed on the listener output now should increase to very
close to the one configured for class A + class B.
Authors
=======
- Andre Guedes <andre.guedes@xxxxxxxxx>
- Ivan Briano <ivan.briano@xxxxxxxxx>
- Jesus Sanchez-Palencia <jesus.sanchez-palencia@xxxxxxxxx>
- Vinicius Gomes <vinicius.gomes@xxxxxxxxx>
Andre Guedes (2):
igb: Add support for CBS offload
samples/tsn: Add script for calculating CBS config
Jesus Sanchez-Palencia (1):
sample: Add TSN Talker and Listener examples
Vinicius Costa Gomes (2):
net/sched: Introduce the user API for the CBS shaper
net/sched: Introduce Credit Based Shaper (CBS) qdisc
drivers/net/ethernet/intel/igb/e1000_defines.h | 23 ++
drivers/net/ethernet/intel/igb/e1000_regs.h | 8 +
drivers/net/ethernet/intel/igb/igb.h | 6 +
drivers/net/ethernet/intel/igb/igb_main.c | 349 +++++++++++++++++++++++++
include/linux/netdevice.h | 1 +
include/uapi/linux/pkt_sched.h | 29 ++
net/sched/Kconfig | 11 +
net/sched/Makefile | 1 +
net/sched/sch_cbs.c | 286 ++++++++++++++++++++
samples/tsn/calculate_cbs_params.py | 112 ++++++++
samples/tsn/listener.c | 254 ++++++++++++++++++
samples/tsn/talker.c | 136 ++++++++++
12 files changed, 1216 insertions(+)
create mode 100644 net/sched/sch_cbs.c
create mode 100755 samples/tsn/calculate_cbs_params.py
create mode 100644 samples/tsn/listener.c
create mode 100644 samples/tsn/talker.c
^ permalink raw reply
* Re: [PATCH net-next] net: bridge: add per-port group_fwd_mask with less restrictions
From: Nikolay Aleksandrov @ 2017-09-29 21:01 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, roopa, bridge
In-Reply-To: <20170929081420.3b069170@xeon-e3>
On 29/09/17 18:14, Stephen Hemminger wrote:
> On Wed, 27 Sep 2017 16:12:44 +0300
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>
>> We need to be able to transparently forward most link-local frames via
>> tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
>> mask which restricts the forwarding of STP and LACP, but we need to be able
>> to forward these over tunnels and control that forwarding on a per-port
>> basis thus add a new per-port group_fwd_mask option which only disallows
>> mac pause frames to be forwarded (they're always dropped anyway).
>> The patch does not change the current default situation - all of the others
>> are still restricted unless configured for forwarding.
>> We have successfully tested this patch with LACP and STP forwarding over
>> VxLAN and qinq tunnels.
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>
> LACP is fine, but STP must not be forwarded if STP in user or kernel
> mode is enabled.
>
> Please update this patch or revert it.
>
The default has not changed, STP is still _not_ forwarded. It can be only if explicitly
requested by the user and that means the port might be participating in STP but not
the bridge's STP, that is explicitly forward all STP frames from that port.
I don't think we have to change anything.
^ permalink raw reply
* Re: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: Andrew Lunn @ 2017-09-29 20:39 UTC (permalink / raw)
To: Tristram.Ha
Cc: David.Laight, muvarov, pavel, nathan.leigh.conrad, vivien.didelot,
f.fainelli, netdev, linux-kernel, Woojung.Huh
In-Reply-To: <93AF473E2DA327428DE3D46B72B1E9FD4112CC1D@CHN-SV-EXMX02.mchp-main.com>
On Fri, Sep 29, 2017 at 07:19:17PM +0000, Tristram.Ha@microchip.com wrote:
> > > My concern is if a task is already running with SPI access to a lot
> > > of registers like reading the 32 MIB counters in every port of the
> > > switch, another register access has to wait until they are finished.
> >
> > Why does it have to wait? Looking at the code in
> > ksz_get_ethtool_stats(), you don't take any mutex which will prevent
> > others from using the SPI bus. All there is is a mutex which prevents
> > two sets of ksz_get_ethtool_stats() at the same time.
> >
> > So a PTP read could happen in parallel, and will not be blocked by MIB
> > reads. They should get interleaved access to the SPI bus.
> >
>
> The MIB counters are read in the background. For multiple CPU cores 2
> tasks may run in the same time allowing SPI access one after another.
> For single core I am not sure an SPI access like coming from an interrupt
> routine can jump ahead from one in a background task.
The SPI subsystem has a mutex per controller. When starting a
transfer, it takes the mutex and release it once the transfer has
completed. There is also a reschedule point at the end of a
transfer. So even on your single core CPU, there can be multi tasking
going on.
Andrew
^ permalink raw reply
* Re: [PATCH next] bonding: speed/duplex update at NETDEV_UP event
From: Stephen Hemminger @ 2017-09-29 20:08 UTC (permalink / raw)
To: Mahesh Bandewar
Cc: Jay Vosburgh, Andy Gospodarek, Veaceslav Falico, David Miller,
Netdev, Mahesh Bandewar
In-Reply-To: <20170928010349.8988-1-mahesh@bandewar.net>
On Wed, 27 Sep 2017 18:03:49 -0700
Mahesh Bandewar <mahesh@bandewar.net> wrote:
> From: Mahesh Bandewar <maheshb@google.com>
>
> Some NIC drivers don't have correct speed/duplex settings at the
> time they send NETDEV_UP notification and that messes up the
> bonding state. Especially 802.3ad mode which is very sensitive
> to these settings. In the current implementation we invoke
> bond_update_speed_duplex() when we receive NETDEV_UP, however,
> ignore the return value. If the values we get are invalid
> (UNKNOWN), then slave gets removed from the aggregator with
> speed and duplex set to UNKNOWN while link is still marked as UP.
>
> This patch fixes this scenario. Also 802.3ad mode is sensitive to
> these conditions while other modes are not, so making sure that it
> doesn't change the behavior for other modes.
>
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
> drivers/net/bonding/bond_main.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index b7313c1d9dcd..177be373966b 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3076,7 +3076,16 @@ static int bond_slave_netdev_event(unsigned long event,
> break;
> case NETDEV_UP:
> case NETDEV_CHANGE:
> - bond_update_speed_duplex(slave);
> + /* For 802.3ad mode only:
> + * Getting invalid Speed/Duplex values here will put slave
> + * in weird state. So mark it as link-down for the time
> + * being and let link-monitoring (miimon) set it right when
> + * correct speeds/duplex are available.
> + */
> + if (bond_update_speed_duplex(slave) &&
> + BOND_MODE(bond) == BOND_MODE_8023AD)
> + slave->link = BOND_LINK_DOWN;
> +
> if (BOND_MODE(bond) == BOND_MODE_8023AD)
> bond_3ad_adapter_speed_duplex_changed(slave);
> /* Fallthrough */
Then fix the drivers. Trying to workaround it here isn't helping.
The problem is that miimon is not required. Bonding can be purely
event driven.
^ permalink raw reply
* Re: [net-next V2 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Jesper Dangaard Brouer @ 2017-09-29 19:58 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
peter.waskiewicz.jr, Daniel Borkmann, Alexei Starovoitov,
Andy Gospodarek, brouer
In-Reply-To: <20170929114154.4b5d5918@cakuba>
On Fri, 29 Sep 2017 11:41:54 -0700
Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 29 Sep 2017 18:34:12 +0200, Jesper Dangaard Brouer wrote:
> > The 'cpumap' is primary used as a backend map for XDP BPF helper
> > call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> >
> > This patch implement the main part of the map. It is not connected to
> > the XDP redirect system yet, and no SKB allocation are done yet.
> >
> > The main concern in this patch is to ensure the datapath can run
> > without any locking. This adds complexity to the setup and tear-down
> > procedure, which assumptions are extra carefully documented in the
> > code comments.
> >
> > V2: make sure array isn't larger than num possible CPUs
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> Few trivial nitpicks, hope you don't mind :)
>
> > @@ -0,0 +1,555 @@
> > +/* bpf/cpumap.c
> > + *
> > + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> > + * Released under terms in GPL version 2. See COPYING.
> > + */
> > +
> > +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> > + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > + *
> > + * Unlike devmap which redirect XDP frames out another NIC device,
> > + * this map type redirect raw XDP frames to another CPU. The remote
> > + * CPU will do SKB-allocation and call the normal network stack.
> > + *
> > + * This is a scalability and isolation mechanism, that allow
> > + * separating the early driver network XDP layer, from the rest of the
> > + * netstack, and assigning dedicated CPUs for this stage. This
> > + * basically allows for 10G wirespeed pre-filtering via bpf.
> > + */
> > +#include <linux/bpf.h>
> > +#include <linux/filter.h>
> > +#include <linux/ptr_ring.h>
> > +
> > +#include <linux/sched.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/kthread.h>
> > +
> > +/*
> > + * General idea: XDP packets getting XDP redirected to another CPU,
> > + * will maximum be stored/queued for one driver ->poll() call. It is
> > + * guaranteed that setting flush bit and flush operation happen on
> > + * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> > + * which queue in bpf_cpu_map_entry contains packets.
> > + */
> > +
> > +#define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */
> > +struct xdp_bulk_queue {
> > + void *q[CPU_MAP_BULK_SIZE];
> > + unsigned int count;
> > +};
>
> Out of curiosity - would it make sense to make sure the entire struct
> fits into a cache line? The comment seems to indicate that the array is
> sized to fit a cache line, but then there is also the count member...
Nope, it does not make sense to align the struct itself for a cacheline.
The idea is that when I bulk enqueue transfer (from this percpu mem)
into the shared queue (ptr_ring), then I dirty a cacheline, thus I
might as well fill up the cacheline.
> > +/*
> > + * After xchg pointer to bpf_cpu_map_entry, use the call_rcu() to
> ...
>
> There is a mix for networking and non-networking style comments in this
> file, is this intentional?
No
> > +const struct bpf_map_ops cpu_map_ops = {
> > + .map_alloc = cpu_map_alloc,
> > + .map_free = cpu_map_free,
> > + .map_delete_elem = cpu_map_delete_elem,
> > + .map_update_elem = cpu_map_update_elem,
> > + .map_lookup_elem = cpu_map_lookup_elem,
> > + .map_get_next_key = cpu_map_get_next_key,
> > +};
> > +
> > +
>
> Extra new line.
Sorry
> > +/* Runs under RCU-read-side, plus in softirq under NAPI protection.
> > + * Thus, safe percpu variable access.
> > + */
> > +static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
> > +{
> > + struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
> > +
> > + if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) {
> > + bq_flush_to_queue(rcpu, bq);
> > + }
>
> Curly brackets not needed.
Sorry
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH] net: hns3: fix null pointer dereference before null check
From: Colin King @ 2017-09-29 19:51 UTC (permalink / raw)
To: Yisen Zhuang, Salil Mehta, David S . Miller, netdev
Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
pointer ndev is being dereferenced with the call to netif_running
before it is being null checked. Re-order the code to only dereference
ndev after it has been null checked.
Detected by CoverityScan, CID#1457206 ("Dereference before null check")
Fixes: 9df8f79a4d29 ("net: hns3: Add DCB support when interacting with network stack")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c
index 4a0890f98b70..c31506514e5d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_enet.c
@@ -2865,7 +2865,7 @@ static int hns3_client_setup_tc(struct hnae3_handle *handle, u8 tc)
{
struct hnae3_knic_private_info *kinfo = &handle->kinfo;
struct net_device *ndev = kinfo->netdev;
- bool if_running = netif_running(ndev);
+ bool if_running;
int ret;
u8 i;
@@ -2875,6 +2875,8 @@ static int hns3_client_setup_tc(struct hnae3_handle *handle, u8 tc)
if (!ndev)
return -ENODEV;
+ if_running = netif_running(ndev);
+
ret = netdev_set_num_tc(ndev, tc);
if (ret)
return ret;
--
2.14.1
^ permalink raw reply related
* Re: [PATCH net-next] vhost_net: do not stall on zerocopy depletion
From: Michael S. Tsirkin @ 2017-09-29 19:38 UTC (permalink / raw)
To: Willem de Bruijn
Cc: netdev, davem, jasowang, den, virtualization, Willem de Bruijn
In-Reply-To: <20170928002556.41240-1-willemdebruijn.kernel@gmail.com>
On Wed, Sep 27, 2017 at 08:25:56PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> Vhost-net has a hard limit on the number of zerocopy skbs in flight.
> When reached, transmission stalls. Stalls cause latency, as well as
> head-of-line blocking of other flows that do not use zerocopy.
>
> Instead of stalling, revert to copy-based transmission.
>
> Tested by sending two udp flows from guest to host, one with payload
> of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
> large flow is redirected to a netem instance with 1MBps rate limit
> and deep 1000 entry queue.
>
> modprobe ifb
> ip link set dev ifb0 up
> tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
>
> tc qdisc add dev tap0 ingress
> tc filter add dev tap0 parent ffff: protocol ip \
> u32 match ip dport 8000 0xffff \
> action mirred egress redirect dev ifb0
>
> Before the delay, both flows process around 80K pps. With the delay,
> before this patch, both process around 400. After this patch, the
> large flow is still rate limited, while the small reverts to its
> original rate. See also discussion in the first link, below.
>
> The limit in vhost_exceeds_maxpend must be carefully chosen. When
> vq->num >> 1, the flows remain correlated. This value happens to
> correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
> fractions and ensure correctness also for much smaller values of
> vq->num, by testing the min() of both explicitly. See also the
> discussion in the second link below.
>
> Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
> Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
> Signed-off-by: Willem de Bruijn <willemb@google.com>
I'd like to see the effect on the non rate limited case though.
If guest is quick won't we have lots of copies then?
> ---
> drivers/vhost/net.c | 14 ++++----------
> 1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 58585ec8699e..50758602ae9d 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -436,8 +436,8 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
> struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> struct vhost_virtqueue *vq = &nvq->vq;
>
> - return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
> - == nvq->done_idx;
> + return (nvq->upend_idx + UIO_MAXIOV - nvq->done_idx) % UIO_MAXIOV >
> + min(VHOST_MAX_PEND, vq->num >> 2);
> }
>
> /* Expects to be always run from workqueue - which acts as
> @@ -480,12 +480,6 @@ static void handle_tx(struct vhost_net *net)
> if (zcopy)
> vhost_zerocopy_signal_used(net, vq);
>
> - /* If more outstanding DMAs, queue the work.
> - * Handle upend_idx wrap around
> - */
> - if (unlikely(vhost_exceeds_maxpend(net)))
> - break;
> -
> head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
> ARRAY_SIZE(vq->iov),
> &out, &in);
> @@ -509,6 +503,7 @@ static void handle_tx(struct vhost_net *net)
> len = iov_length(vq->iov, out);
> iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
> iov_iter_advance(&msg.msg_iter, hdr_size);
> +
> /* Sanity check */
> if (!msg_data_left(&msg)) {
> vq_err(vq, "Unexpected header len for TX: "
> @@ -519,8 +514,7 @@ static void handle_tx(struct vhost_net *net)
> len = msg_data_left(&msg);
>
> zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
> - && (nvq->upend_idx + 1) % UIO_MAXIOV !=
> - nvq->done_idx
> + && !vhost_exceeds_maxpend(net)
> && vhost_net_tx_select_zcopy(net);
>
> /* use msg_control to pass vhost zerocopy ubuf info to skb */
> --
> 2.14.2.822.g60be5d43e6-goog
^ permalink raw reply
* Re: [PATCH net-next 0/8] net: dsa: change dsa_ptr for a dsa_port
From: Vivien Didelot @ 2017-09-29 19:34 UTC (permalink / raw)
To: Florian Fainelli, netdev
Cc: linux-kernel, kernel, David S. Miller, Andrew Lunn
In-Reply-To: <03cef84e-546d-d8c2-85fb-afc697ced2cb@gmail.com>
Hi Florian,
Florian Fainelli <f.fainelli@gmail.com> writes:
> On 09/29/2017 11:36 AM, Vivien Didelot wrote:
>> With DSA, a master net_device is physically wired to a dedicated CPU
>> switch port. For interaction with the DSA layer, the struct net_device
>> contains a dsa_ptr, which currently points to a dsa_switch_tree object.
>>
>> This is only valid for a switch fabric with a single CPU port. In order
>> to support switch fabrics with multiple CPU ports, we first need to
>> change the type of dsa_ptr to what it really is: a dsa_port object.
>>
>> This is what this patchset does. The first 4 patches cleans up portions
>> of DSA core to make the next patches more readable. These next patches
>> prepare the xmit and receive hot paths and finally change dsa_ptr.
>
> This looks nice and clean, as mentioned in patch 5, there may be room
> for organizing the structure a bit more efficiently such that everything
> still fits within the first cacheline .
Thanks for this very constructive comment! I'll look into this.
Respinning in a few.
Vivien
^ permalink raw reply
* Re: [PATCH net-next 0/8] net: dsa: change dsa_ptr for a dsa_port
From: Florian Fainelli @ 2017-09-29 19:26 UTC (permalink / raw)
To: Vivien Didelot, netdev; +Cc: linux-kernel, kernel, David S. Miller, Andrew Lunn
In-Reply-To: <20170929183635.8122-1-vivien.didelot@savoirfairelinux.com>
On 09/29/2017 11:36 AM, Vivien Didelot wrote:
> With DSA, a master net_device is physically wired to a dedicated CPU
> switch port. For interaction with the DSA layer, the struct net_device
> contains a dsa_ptr, which currently points to a dsa_switch_tree object.
>
> This is only valid for a switch fabric with a single CPU port. In order
> to support switch fabrics with multiple CPU ports, we first need to
> change the type of dsa_ptr to what it really is: a dsa_port object.
>
> This is what this patchset does. The first 4 patches cleans up portions
> of DSA core to make the next patches more readable. These next patches
> prepare the xmit and receive hot paths and finally change dsa_ptr.
This looks nice and clean, as mentioned in patch 5, there may be room
for organizing the structure a bit more efficiently such that everything
still fits within the first cacheline .
>
> Vivien Didelot (8):
> net: dsa: directly fetch switch in mtk_tag_rcv
> net: dsa: directly fetch switch in lan9303_rcv
> net: dsa: use cpu_dp in master code
> net: dsa: use temporary dsa_device_ops variable
> net: dsa: add tagging ops to port
> net: dsa: prepare master receive hot path
> net: dsa: change dsa_ptr for a dsa_port
> net: dsa: remove tag ops from the switch tree
>
> include/linux/netdevice.h | 4 ++--
> include/net/dsa.h | 19 ++++++++-----------
> net/dsa/dsa.c | 6 +++---
> net/dsa/dsa2.c | 15 ++++++++++-----
> net/dsa/dsa_priv.h | 7 +------
> net/dsa/legacy.c | 15 ++++++++++-----
> net/dsa/master.c | 47 ++++++++++++++++++++++-------------------------
> net/dsa/slave.c | 3 +--
> net/dsa/tag_brcm.c | 3 +--
> net/dsa/tag_dsa.c | 3 ++-
> net/dsa/tag_edsa.c | 3 ++-
> net/dsa/tag_ksz.c | 3 +--
> net/dsa/tag_lan9303.c | 6 ++----
> net/dsa/tag_mtk.c | 12 ++----------
> net/dsa/tag_qca.c | 3 +--
> net/dsa/tag_trailer.c | 3 +--
> 16 files changed, 69 insertions(+), 83 deletions(-)
>
--
Florian
^ permalink raw reply
* Re: [PATCH net-next 5/8] net: dsa: add tagging ops to port
From: Florian Fainelli @ 2017-09-29 19:24 UTC (permalink / raw)
To: Vivien Didelot, netdev; +Cc: linux-kernel, kernel, David S. Miller, Andrew Lunn
In-Reply-To: <20170929183635.8122-6-vivien.didelot@savoirfairelinux.com>
On 09/29/2017 11:36 AM, Vivien Didelot wrote:
> The DSA tagging protocol operations are specific to each CPU port,
> thus the dsa_device_ops pointer belongs to the dsa_port structure.
>
> From now on assign a slave's xmit copy from its CPU port tagging
> operations. This will ease the future support for multiple CPU ports.
>
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> ---
> include/net/dsa.h | 3 +++
> net/dsa/dsa2.c | 1 +
> net/dsa/dsa_priv.h | 2 +-
> net/dsa/legacy.c | 1 +
> net/dsa/slave.c | 3 +--
> 5 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 8dee216a5a9b..6cd36dcb65e1 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -189,6 +189,9 @@ struct dsa_port {
> * Original copy of the master netdev ethtool_ops
> */
> const struct ethtool_ops *orig_ethtool_ops;
> +
> + /* CPU port tagging operations used by master or slave devices */
> + const struct dsa_device_ops *tag_ops;
You might actually want to move this up in the dsa_port structure in
order to keep being in the first cacheline (you can use pahole -C
dsa_port vmlinux).
dsa_switch_tree is currently a 56 bytes structure, thus fitting in a 64b
cache line, but dsa_port is 80bytes, and the hot-path are in the second
cacheline, so less efficient.
--
Florian
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox