* [PATCH nf-next] netfilter: xtables: lightweight process control group matching
From: Daniel Borkmann @ 2013-10-04 18:20 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev, Tejun Heo, cgroups
It would be useful e.g. in a server or desktop environment to have
a facility in the notion of fine-grained "per application" or "per
application group" firewall policies. Probably, users in the mobile/
embedded area (e.g. Android based) with different security policy
requirements for application groups could have great benefit from
that as well. For example, with a little bit of configuration effort,
an admin could whitelist well-known applications, and thus block
otherwise unwanted "hard-to-track" applications like [1] from a
user's machine.
Implementation of PID-based matching would not be appropriate
as they frequently change, and child tracking would make that
even more complex and ugly. Cgroups would be a perfect candidate
for accomplishing that as they associate a set of tasks with a
set of parameters for one or more subsystems, in our case the
netfilter subsystem, which, of course, can be combined with other
cgroup subsystems into something more complex.
As mentioned, to overcome this constraint, such processes could
be placed into one or multiple cgroups where different fine-grained
rules can be defined depending on the application scenario, while
e.g. everything else that is not part of that could be dropped (or
vice versa), thus making life harder for unwanted processes to
communicate to the outside world. So, we make use of cgroups here
to track jobs and limit their resources in terms of iptables
policies; in other words, limiting what they are allowed to
communicate.
We have similar cgroup facilities in networking for traffic
classifier, and netprio cgroups. This feature adds a lightweight
cgroup id matching in terms of network security resp. network
traffic isolation as part of netfilter's xtables subsystem.
Minimal, basic usage example (many other iptables options can be
applied obviously):
1) Configuring cgroups:
mkdir /sys/fs/cgroup/net_filter
mount -t cgroup -o net_filter net_filter /sys/fs/cgroup/net_filter
mkdir /sys/fs/cgroup/net_filter/0
echo 1 > /sys/fs/cgroup/net_filter/0/net_filter.fwid
2) Configuring netfilter:
iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP
3) Running applications:
ping 208.67.222.222 <pid:1799>
echo 1799 > /sys/fs/cgroup/net_filter/0/tasks
64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
...
ping 208.67.220.220 <pid:1804>
ping: sendmsg: Operation not permitted
...
echo 1804 > /sys/fs/cgroup/net_filter/0/tasks
64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
...
Of course, real-world deployments would make use of cgroups user
space toolsuite, or custom daemons dynamically moving applications
from/to net_filter cgroups.
[1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
---
Documentation/cgroups/00-INDEX | 2 +
Documentation/cgroups/net_filter.txt | 27 +++++
include/linux/cgroup_subsys.h | 5 +
include/net/netfilter/xt_cgroup.h | 58 ++++++++++
include/net/sock.h | 3 +
include/uapi/linux/netfilter/Kbuild | 1 +
include/uapi/linux/netfilter/xt_cgroup.h | 11 ++
net/core/scm.c | 2 +
net/core/sock.c | 14 +++
net/netfilter/Kconfig | 8 ++
net/netfilter/Makefile | 1 +
net/netfilter/xt_cgroup.c | 182 +++++++++++++++++++++++++++++++
12 files changed, 314 insertions(+)
create mode 100644 Documentation/cgroups/net_filter.txt
create mode 100644 include/net/netfilter/xt_cgroup.h
create mode 100644 include/uapi/linux/netfilter/xt_cgroup.h
create mode 100644 net/netfilter/xt_cgroup.c
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index bc461b6..14424d2 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -20,6 +20,8 @@ memory.txt
- Memory Resource Controller; design, accounting, interface, testing.
net_cls.txt
- Network classifier cgroups details and usages.
+net_filter.txt
+ - Network firewalling (netfilter) cgroups details and usages.
net_prio.txt
- Network priority cgroups details and usages.
resource_counter.txt
diff --git a/Documentation/cgroups/net_filter.txt b/Documentation/cgroups/net_filter.txt
new file mode 100644
index 0000000..0e21822
--- /dev/null
+++ b/Documentation/cgroups/net_filter.txt
@@ -0,0 +1,27 @@
+Netfilter cgroup
+----------------
+
+The netfilter cgroup provides an interface to aggregate jobs
+to a particular netfilter tag, that can be used to apply
+various iptables/netfilter policies for those jobs in order
+to limit resources/abilities for network communication.
+
+Creating a net_filter cgroups instance creates a net_filter.fwid
+file. The value of net_filter.fwid is initialized to 0 on
+default (so only global iptables/netfilter policies apply).
+You can write a unique decimal fwid tag into net_filter.fwid
+file, and use that tag along with iptables' --cgroup option.
+
+Minimal/basic usage example:
+
+1) Configuring cgroup:
+
+ mkdir /sys/fs/cgroup/net_filter
+ mount -t cgroup -o net_filter net_filter /sys/fs/cgroup/net_filter
+ mkdir /sys/fs/cgroup/net_filter/0
+ echo 1 > /sys/fs/cgroup/net_filter/0/net_filter.fwid
+ echo [pid] > /sys/fs/cgroup/net_filter/0/tasks
+
+2) Configuring netfilter:
+
+ iptables -A OUTPUT -m cgroup ! --cgroup 1 -p tcp --dport 80 -j DROP
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index b613ffd..ef58217 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -50,6 +50,11 @@ SUBSYS(net_prio)
#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_HUGETLB)
SUBSYS(hugetlb)
#endif
+
+#if IS_SUBSYS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+SUBSYS(net_filter)
+#endif
+
/*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/
diff --git a/include/net/netfilter/xt_cgroup.h b/include/net/netfilter/xt_cgroup.h
new file mode 100644
index 0000000..b2c702f
--- /dev/null
+++ b/include/net/netfilter/xt_cgroup.h
@@ -0,0 +1,58 @@
+#ifndef _XT_CGROUP_H
+#define _XT_CGROUP_H
+
+#include <linux/types.h>
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+struct cgroup_nf_state {
+ struct cgroup_subsys_state css;
+ u32 fwid;
+};
+
+void sock_update_fwid(struct sock *sk);
+
+#if IS_BUILTIN(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+static inline u32 task_fwid(struct task_struct *p)
+{
+ u32 fwid;
+
+ if (in_interrupt())
+ return 0;
+
+ rcu_read_lock();
+ fwid = container_of(task_css(p, net_filter_subsys_id),
+ struct cgroup_nf_state, css)->fwid;
+ rcu_read_unlock();
+
+ return fwid;
+}
+#elif IS_MODULE(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+static inline u32 task_fwid(struct task_struct *p)
+{
+ struct cgroup_subsys_state *css;
+ u32 fwid = 0;
+
+ if (in_interrupt())
+ return 0;
+
+ rcu_read_lock();
+ css = task_css(p, net_filter_subsys_id);
+ if (css)
+ fwid = container_of(css, struct cgroup_nf_state, css)->fwid;
+ rcu_read_unlock();
+
+ return fwid;
+}
+#endif
+#else /* !CONFIG_NETFILTER_XT_MATCH_CGROUP */
+static inline u32 task_fwid(struct task_struct *p)
+{
+ return 0;
+}
+
+#define sock_update_fwid(sk)
+#endif /* CONFIG_NETFILTER_XT_MATCH_CGROUP */
+#endif /* _XT_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..f7da4b4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -387,6 +387,9 @@ struct sock {
#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
__u32 sk_cgrp_prioidx;
#endif
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+ __u32 sk_cgrp_fwid;
+#endif
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
long sk_rcvtimeo;
diff --git a/include/uapi/linux/netfilter/Kbuild b/include/uapi/linux/netfilter/Kbuild
index 1749154..94a4890 100644
--- a/include/uapi/linux/netfilter/Kbuild
+++ b/include/uapi/linux/netfilter/Kbuild
@@ -37,6 +37,7 @@ header-y += xt_TEE.h
header-y += xt_TPROXY.h
header-y += xt_addrtype.h
header-y += xt_bpf.h
+header-y += xt_cgroup.h
header-y += xt_cluster.h
header-y += xt_comment.h
header-y += xt_connbytes.h
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
new file mode 100644
index 0000000..43acb7e
--- /dev/null
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -0,0 +1,11 @@
+#ifndef _UAPI_XT_CGROUP_H
+#define _UAPI_XT_CGROUP_H
+
+#include <linux/types.h>
+
+struct xt_cgroup_info {
+ __u32 id;
+ __u32 invert;
+};
+
+#endif /* _UAPI_XT_CGROUP_H */
diff --git a/net/core/scm.c b/net/core/scm.c
index b442e7e..f08672a 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -36,6 +36,7 @@
#include <net/sock.h>
#include <net/compat.h>
#include <net/scm.h>
+#include <net/netfilter/xt_cgroup.h>
#include <net/cls_cgroup.h>
@@ -290,6 +291,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
/* Bump the usage count and install the file. */
sock = sock_from_file(fp[i], &err);
if (sock) {
+ sock_update_fwid(sock->sk);
sock_update_netprioidx(sock->sk);
sock_update_classid(sock->sk);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 2bd9b3f..524a376 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
#include <linux/skbuff.h>
#include <net/net_namespace.h>
#include <net/request_sock.h>
+#include <net/netfilter/xt_cgroup.h>
#include <net/sock.h>
#include <linux/net_tstamp.h>
#include <net/xfrm.h>
@@ -1337,6 +1338,18 @@ void sock_update_netprioidx(struct sock *sk)
EXPORT_SYMBOL_GPL(sock_update_netprioidx);
#endif
+#if IS_ENABLED(CONFIG_NETFILTER_XT_MATCH_CGROUP)
+void sock_update_fwid(struct sock *sk)
+{
+ u32 fwid;
+
+ fwid = task_fwid(current);
+ if (fwid != sk->sk_cgrp_fwid)
+ sk->sk_cgrp_fwid = fwid;
+}
+EXPORT_SYMBOL(sock_update_fwid);
+#endif
+
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
@@ -1363,6 +1376,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
sock_update_classid(sk);
sock_update_netprioidx(sk);
+ sock_update_fwid(sk);
}
return sk;
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 6e839b6..d276ff4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -806,6 +806,14 @@ config NETFILTER_XT_MATCH_BPF
To compile it as a module, choose M here. If unsure, say N.
+config NETFILTER_XT_MATCH_CGROUP
+ tristate '"control group" match support'
+ depends on NETFILTER_ADVANCED
+ depends on CGROUPS
+ ---help---
+ Socket/process control group matching allows you to match locally
+ generated packets based on which control group processes belong to.
+
config NETFILTER_XT_MATCH_CLUSTER
tristate '"cluster" match support'
depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index c3a0a12..12f014f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_MULTIPORT) += xt_multiport.o
obj-$(CONFIG_NETFILTER_XT_MATCH_NFACCT) += xt_nfacct.o
obj-$(CONFIG_NETFILTER_XT_MATCH_OSF) += xt_osf.o
obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP) += xt_cgroup.o
obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
new file mode 100644
index 0000000..86be16d
--- /dev/null
+++ b/net/netfilter/xt_cgroup.c
@@ -0,0 +1,182 @@
+/*
+ * Xtables module to match the process control group.
+ *
+ * Might be used to implement individual "per-application" firewall
+ * policies (in contrast to global policies) based on control groups.
+ *
+ * (C) 2013 Daniel Borkmann <dborkman@redhat.com>
+ * (C) 2013 Thomas Graf <tgraf@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/skbuff.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/cgroup.h>
+#include <linux/fdtable.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_cgroup.h>
+#include <net/netfilter/xt_cgroup.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Daniel Borkmann <dborkman@redhat.com>");
+MODULE_DESCRIPTION("Xtables: process control group matching");
+MODULE_ALIAS("ipt_cgroup");
+MODULE_ALIAS("ip6t_cgroup");
+
+static int cgroup_mt_check(const struct xt_mtchk_param *par)
+{
+ struct xt_cgroup_info *info = par->matchinfo;
+
+ if (info->invert & ~1)
+ return -EINVAL;
+
+ return info->id ? 0 : -EINVAL;
+}
+
+static bool
+cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ const struct xt_cgroup_info *info = par->matchinfo;
+
+ if (skb->sk == NULL)
+ return false;
+
+ return (info->id == skb->sk->sk_cgrp_fwid) ^ info->invert;
+}
+
+static struct xt_match cgroup_mt_reg __read_mostly = {
+ .name = "cgroup",
+ .revision = 0,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup_mt_check,
+ .match = cgroup_mt,
+ .matchsize = sizeof(struct xt_cgroup_info),
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING),
+};
+
+static inline struct cgroup_nf_state *
+css_nf_state(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct cgroup_nf_state, css) : NULL;
+}
+
+static inline struct cgroup_nf_state *task_nf_state(struct task_struct *p)
+{
+ return css_nf_state(task_css(p, net_filter_subsys_id));
+}
+
+static struct cgroup_subsys_state *
+cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct cgroup_nf_state *cs;
+
+ cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+ if (!cs)
+ return ERR_PTR(-ENOMEM);
+
+ return &cs->css;
+}
+
+static int cgroup_css_online(struct cgroup_subsys_state *css)
+{
+ struct cgroup_nf_state *cs = css_nf_state(css);
+ struct cgroup_nf_state *parent = css_nf_state(css_parent(css));
+
+ if (parent)
+ cs->fwid = parent->fwid;
+
+ return 0;
+}
+
+static void cgroup_css_free(struct cgroup_subsys_state *css)
+{
+ kfree(css_nf_state(css));
+}
+
+static int cgroup_fwid_update(const void *v, struct file *file, unsigned n)
+{
+ int err;
+ struct socket *sock = sock_from_file(file, &err);
+
+ if (sock)
+ sock->sk->sk_cgrp_fwid = (u32)(unsigned long) v;
+
+ return 0;
+}
+
+static u64 cgroup_fwid_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return css_nf_state(css)->fwid;
+}
+
+static int cgroup_fwid_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 id)
+{
+ css_nf_state(css)->fwid = (u32) id;
+
+ return 0;
+}
+
+static void cgroup_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset)
+{
+ struct task_struct *p;
+ void *v;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ task_lock(p);
+ v = (void *)(unsigned long) task_fwid(p);
+ iterate_fd(p->files, 0, cgroup_fwid_update, v);
+ task_unlock(p);
+ }
+}
+
+static struct cftype net_filter_ss_files[] = {
+ {
+ .name = "fwid",
+ .read_u64 = cgroup_fwid_read,
+ .write_u64 = cgroup_fwid_write,
+ },
+ { }
+};
+
+struct cgroup_subsys net_filter_subsys = {
+ .name = "net_filter",
+ .css_alloc = cgroup_css_alloc,
+ .css_online = cgroup_css_online,
+ .css_free = cgroup_css_free,
+ .attach = cgroup_attach,
+ .subsys_id = net_filter_subsys_id,
+ .base_cftypes = net_filter_ss_files,
+ .module = THIS_MODULE,
+};
+
+static int __init cgroup_mt_init(void)
+{
+ int ret = cgroup_load_subsys(&net_filter_subsys);
+ if (ret)
+ goto out;
+
+ ret = xt_register_match(&cgroup_mt_reg);
+ if (ret)
+ cgroup_unload_subsys(&net_filter_subsys);
+out:
+ return ret;
+}
+
+static void __exit cgroup_mt_exit(void)
+{
+ xt_unregister_match(&cgroup_mt_reg);
+ cgroup_unload_subsys(&net_filter_subsys);
+}
+
+module_init(cgroup_mt_init);
+module_exit(cgroup_mt_exit);
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH] tcp: do not forget FIN in tcp_shifted_skb()
From: David Miller @ 2013-10-04 18:17 UTC (permalink / raw)
To: ilpo.jarvinen; +Cc: eric.dumazet, netdev, ncardwell, ycheng
In-Reply-To: <alpine.DEB.2.02.1310042101040.32153@melkinpaasi.cs.helsinki.fi>
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 4 Oct 2013 21:03:53 +0300 (EEST)
> On Fri, 4 Oct 2013, Eric Dumazet wrote:
>
>> From: Eric Dumazet <edumazet@google.com>
>>
>> Yuchung found following problem :
>>
>> There are bugs in the SACK processing code, merging part in
>> tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
>> skbs FIN flag. When a receiver first SACK the FIN sequence, and later
>> throw away ofo queue (e.g., sack-reneging), the sender will stop
>> retransmitting the FIN flag, and hangs forever.
>>
>> Following packetdrill test can be used to reproduce the bug.
...
> Nice that it was finally found. For some reason my memory tries to say
> that it wouldn't have not even tried to merge skbs with FIN but either
> I changed that at some point while deving or I just remember wrong.
>
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Applied, thanks everyone.
^ permalink raw reply
* Re: [PATCH] tcp: do not forget FIN in tcp_shifted_skb()
From: Ilpo Järvinen @ 2013-10-04 18:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <1380907901.3564.24.camel@edumazet-glaptop.roam.corp.google.com>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1941 bytes --]
On Fri, 4 Oct 2013, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Yuchung found following problem :
>
> There are bugs in the SACK processing code, merging part in
> tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
> skbs FIN flag. When a receiver first SACK the FIN sequence, and later
> throw away ofo queue (e.g., sack-reneging), the sender will stop
> retransmitting the FIN flag, and hangs forever.
>
> Following packetdrill test can be used to reproduce the bug.
>
> [...snip...]
>
> First, a typo inverted left/right of one OR operation, then
> code forgot to advance end_seq if the merged skb carried FIN.
>
> Bug was added in 2.6.29 by commit 832d11c5cd076ab
> ("tcp: Try to restore large SKBs while SACK processing")
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
> net/ipv4/tcp_input.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 25a89ea..113dc5f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1284,7 +1284,10 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
> tp->lost_cnt_hint -= tcp_skb_pcount(prev);
> }
>
> - TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(prev)->tcp_flags;
> + TCP_SKB_CB(prev)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> + TCP_SKB_CB(prev)->end_seq++;
> +
> if (skb == tcp_highest_sack(sk))
> tcp_advance_highest_sack(sk, skb);
Nice that it was finally found. For some reason my memory tries to say
that it wouldn't have not even tried to merge skbs with FIN but either
I changed that at some point while deving or I just remember wrong.
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
--
i.
^ permalink raw reply
* Re: [PATCH v3 net-next] fix unsafe set_memory_rw from softirq
From: Ingo Molnar @ 2013-10-04 18:00 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Benjamin Herrenschmidt, Heiko Carstens, Eric Dumazet,
Paul Mackerras, H. Peter Anvin, sparclinux, Nicolas Dichtel,
linux-s390, Russell King, x86, James Morris, Ingo Molnar,
Alexey Kuznetsov, Paul E. McKenney, Xi Wang, Matt Evans,
Thomas Gleixner, linux-arm-kernel, Stelian Nirlu,
Nicolas Schichan, Hideaki YOSHIFUJI, netdev, linux-kernel,
David S. Miller, Mircea
In-Reply-To: <CAMEtUuy0mpdUHJjztK31mHjNVahnoZ8CHo2sLLyJmt=p5+gq0w@mail.gmail.com>
* Alexei Starovoitov <ast@plumgrid.com> wrote:
> >> #else
> >> +#include <linux/slab.h>
> >
> > Inlines in the middle of header files are generally frowned upon.
> >
> > The standard pattern is to put them at the top, that way it's easier to
> > see the dependencies and there's also less .config dependent inclusion,
> > which makes header hell cleanup work easier.
>
> Agree. I only followed the style that is already in filter.h 20 lines above.
>
> #ifdef CONFIG_BPF_JIT
> #include <stdarg.h>
> #include <linux/linkage.h>
> #include <linux/printk.h>
>
> as part of the cleanup can move all of them to the top. In the separate commit?
Yeah, sure, that's fine.
> >> struct sk_filter *fp;
> >> unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
> >> + unsigned int sk_fsize = max_t(u32, fsize, sizeof(struct work_struct))
> >> + + sizeof(*fp);
> >
> > Using the structure definition I suggested, this could be replaced with
> > the more obvious:
> >
> > unsigned int sk_fsize = max(fsize, sizeof(*fp));
>
> with helper function it's even cleaner. Fixed in V4
So my thought was that the helper function is perhaps too trivial and
somewhat obscures the allocation pattern, but yeah. Either way is fine
with me.
> > A couple of questions/suggestions:
> >
> > 1)
> >
> > I took a brief look at arch/x86/net/bpf_jit_comp.c while reviewing this
> > patch.
> >
> > You need to split up bpf_jit_compile(), it's an obscenely large, ~600
> > lines long function. We don't do that in modern, maintainable kernel code.
>
> I had the same thought, therefore in my proposed generalization of bpf:
> http://patchwork.ozlabs.org/patch/279280/
> It is split into two. do_jit() is still a bit large at 400 lines. Can
> split it further.
Yeah, I think as long as you split out the loop iterator into a separate
function it gets much better.
The actual instruction generation code within the iterator looks good in a
single chunk - splitting it up further than that might in fact make it
less readable.
> > 3)
> >
> > It's nice code altogether! Are there any plans to generalize its
> > interfaces, to allow arbitrary bytecode to be used by other kernel
> > subsystems as well? In particular tracing filters could make use of
> > it, but it would also allow safe probe points.
>
> That was exactly the reasons to generalize bpf as I proposed.
Ok, cool :-)
For the x86 bits:
Acked-by: Ingo Molnar <mingo@kernel.org>
Thanks,
Ingo
^ permalink raw reply
* Re: [PATCH 00/33] Netfilter updates for net-next
From: David Miller @ 2013-10-04 17:59 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1380875598-5250-1-git-send-email-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Fri, 4 Oct 2013 10:32:45 +0200
> The following patchset contains Netfilter updates for your net-next tree,
> mostly ipset improvements and enhancements features, they are:
...
Pulled, thanks a lot Pablo.
^ permalink raw reply
* Re: [PATCH] veth: Showing peer of veth type dev in ip link (kernel side)
From: Stephen Hemminger @ 2013-10-04 17:55 UTC (permalink / raw)
To: Masatake YAMATO; +Cc: netdev
In-Reply-To: <1380859529-32351-1-git-send-email-yamato@redhat.com>
On Fri, 4 Oct 2013 13:05:29 +0900
Masatake YAMATO <yamato@redhat.com> wrote:
> ip link has ability to show extra information of net work device if
> kernel provides sunh information. With this patch veth driver can
> provide its peer ifindex information to ip command via netlink
> interface.
>
> Signed-off-by: Masatake YAMATO <yamato@redhat.com>
> ---
netlink API's are supposed to be symmetrical.
When creating veth, the VETH_INFO_PEER attribute is struct(ifinfomsg).
The fill_info should tack on the same data.
^ permalink raw reply
* Re: [PATCH v3 net-next] fix unsafe set_memory_rw from softirq
From: Alexei Starovoitov @ 2013-10-04 17:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Benjamin Herrenschmidt, Heiko Carstens, Eric Dumazet,
Paul Mackerras, H. Peter Anvin, sparclinux, Nicolas Dichtel,
linux-s390, Russell King, x86, James Morris, Ingo Molnar,
Alexey Kuznetsov, Paul E. McKenney, Xi Wang, Matt Evans,
Thomas Gleixner, linux-arm-kernel, Stelian Nirlu,
Nicolas Schichan, Hideaki YOSHIFUJI, netdev, linux-kernel,
David S. Miller, Mircea
In-Reply-To: <20131004075133.GA12313@gmail.com>
On Fri, Oct 4, 2013 at 12:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> +static void bpf_jit_free_deferred(struct work_struct *work)
>> +{
>> + struct sk_filter *fp = container_of((void *)work, struct sk_filter,
>> + insns);
>> + unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
>> + struct bpf_binary_header *header = (void *)addr;
>> +
>> + set_memory_rw(addr, header->pages);
>> + module_free(NULL, header);
>> + kfree(fp);
>> +}
>
> Using the data type suggestions I make further below, this could be
> written in a simpler form, as:
>
> struct sk_filter *fp = container_of(work, struct sk_filter, work);
yes. I've made it already as part of V4
> Also, a question, why do you mask with PAGE_MASK here:
>
> unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
>
> ?
>
> AFAICS bpf_func is the module_alloc() result - and module code is page
> aligned. So ->bpf_func is always page aligned here. (The sk_run_filter
> special case cannot happen here.)
randomization of bpf_func start is a prevention of jit spraying
attacks as Eric pointed out.
>> void bpf_jit_free(struct sk_filter *fp)
>> {
>> if (fp->bpf_func != sk_run_filter) {
>> - unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
>> - struct bpf_binary_header *header = (void *)addr;
>> -
>> - set_memory_rw(addr, header->pages);
>> - module_free(NULL, header);
>> + struct work_struct *work = (struct work_struct *)fp->insns;
>> + INIT_WORK(work, bpf_jit_free_deferred);
>
> Missing newline between local variables and statements.
yes. noted and fixed in V4.
>> unsigned int (*bpf_func)(const struct sk_buff *skb,
>> const struct sock_filter *filter);
>> - struct rcu_head rcu;
>> + /* insns start right after bpf_func, so that sk_run_filter() fetches
>> + * first insn from the same cache line that was used to call into
>> + * sk_run_filter()
>> + */
>> struct sock_filter insns[0];
>
> Please use the customary (multi-line) comment style:
>
> /*
> * Comment .....
> * ...... goes here.
> */
>
> specified in Documentation/CodingStyle.
I believe filter.h belongs to networking comment style, that's why
checkpatch didn't complain.
But I removed the comment in V4
>> static inline unsigned int sk_filter_len(const struct sk_filter *fp)
>> {
>> - return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
>> + return max(fp->len * sizeof(struct sock_filter),
>> + sizeof(struct work_struct)) + sizeof(*fp);
>
> So, "sizeof(struct work_struct)) + sizeof(*fp)" is a pattern that repeats
> a couple of times. Might make sense to stick that into a helper of its own
> - but in general this open coded overlay allocation method looks a bit
> fragile and not very obvious in isolation.
>
> So it could be done a bit cleaner, using an anonymous union:
>
> /*
> * These two overlay, the work struct is used during workqueue
> * driven teardown, when the instructions are not used anymore:
> */
> union {
> struct sock_filter insns[0];
> struct work_struct work;
> };
>
> And then all the sizeof() calculations become obvious and sk_filter_len()
> could be eliminated - I've marked the conversions in the code further
> below.
Eric made exactly the same comment. Agreed and fixed in V4
>> #else
>> +#include <linux/slab.h>
>
> Inlines in the middle of header files are generally frowned upon.
>
> The standard pattern is to put them at the top, that way it's easier to
> see the dependencies and there's also less .config dependent inclusion,
> which makes header hell cleanup work easier.
Agree. I only followed the style that is already in filter.h 20 lines above.
#ifdef CONFIG_BPF_JIT
#include <stdarg.h>
#include <linux/linkage.h>
#include <linux/printk.h>
as part of the cleanup can move all of them to the top. In the separate commit?
>> struct sk_filter *fp;
>> unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>> + unsigned int sk_fsize = max_t(u32, fsize, sizeof(struct work_struct))
>> + + sizeof(*fp);
>
> Using the structure definition I suggested, this could be replaced with
> the more obvious:
>
> unsigned int sk_fsize = max(fsize, sizeof(*fp));
with helper function it's even cleaner. Fixed in V4
> A couple of questions/suggestions:
>
> 1)
>
> I took a brief look at arch/x86/net/bpf_jit_comp.c while reviewing this
> patch.
>
> You need to split up bpf_jit_compile(), it's an obscenely large, ~600
> lines long function. We don't do that in modern, maintainable kernel code.
I had the same thought, therefore in my proposed generalization of bpf:
http://patchwork.ozlabs.org/patch/279280/
It is split into two. do_jit() is still a bit large at 400 lines. Can
split it further.
> 3)
>
> It's nice code altogether! Are there any plans to generalize its
> interfaces, to allow arbitrary bytecode to be used by other kernel
> subsystems as well? In particular tracing filters could make use of it,
> but it would also allow safe probe points.
That was exactly the reasons to generalize bpf as I proposed.
"extended BPF is a set of pseudo instructions that stitch kernel provided
data in the form of bpf_context with kernel provided set of functions in a safe
and deterministic way with minimal performance overhead vs native code"
Not sure what 'tracing filters' you have in mind, but I'm happy to try
them with extended bpf model.
imo existing bpf with two registers is too limiting for performance
and argument passing.
That's why going to 10 made it a lot more flexible and generically usable.
Sorry to hijack the thread.
Thanks
Alexei
^ permalink raw reply
* Let the moment last as much as you want.
From: JamesEPolk @ 2013-10-04 17:40 UTC (permalink / raw)
Dear Customer,
Sometimes things happen much quicker than you would love to.
Let the moment last as much as you want.
http://translate.googleusercontent.com/translate_c?depth=1&hl=auto&sl=de&url=www.google.cn&u=http://onlineshop63.yolasite.com/store&usg=ALkJrhjuuBC9SIdn3nj6vfQ7-8vR_Bt35g
Best Regards,
^ permalink raw reply
* Re: [PATCH v3 net-next] fix unsafe set_memory_rw from softirq
From: Ingo Molnar @ 2013-10-04 17:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: Benjamin Herrenschmidt, Heiko Carstens, Paul Mackerras,
H. Peter Anvin, sparclinux, Nicolas Dichtel, Alexei Starovoitov,
linux-s390, Russell King, x86, James Morris, Ingo Molnar,
Alexey Kuznetsov, Paul E. McKenney, Xi Wang, Matt Evans,
Thomas Gleixner, linux-arm-kernel, Stelian Nirlu,
Nicolas Schichan, Hideaki YOSHIFUJI, netdev, linux-kernel,
David S. Miller, Mi
In-Reply-To: <CANn89iKkbR0_HaofvC_OVvqRv_Hqj3rATx-Z_4xXeusOasa56g@mail.gmail.com>
* Eric Dumazet <edumazet@google.com> wrote:
> 1)
> >
> > I took a brief look at arch/x86/net/bpf_jit_comp.c while reviewing this
> > patch.
> >
> > You need to split up bpf_jit_compile(), it's an obscenely large, ~600
> > lines long function. We don't do that in modern, maintainable kernel code.
> >
> > 2)
> >
> > This 128 bytes extra padding:
> >
> > /* Most of BPF filters are really small,
> > * but if some of them fill a page, allow at least
> > * 128 extra bytes to insert a random section of int3
> > */
> > sz = round_up(proglen + sizeof(*header) + 128, PAGE_SIZE);
> >
> > why is it done? It's not clear to me from the comment.
> >
>
> commit 314beb9bcabfd6b4542ccbced2402af2c6f6142a
> Author: Eric Dumazet <edumazet@google.com>
> Date: Fri May 17 16:37:03 2013 +0000
>
> x86: bpf_jit_comp: secure bpf jit against spraying attacks
>
> hpa bringed into my attention some security related issues
> with BPF JIT on x86.
>
> This patch makes sure the bpf generated code is marked read only,
> as other kernel text sections.
>
> It also splits the unused space (we vmalloc() and only use a fraction of
> the page) in two parts, so that the generated bpf code not starts at a
> known offset in the page, but a pseudo random one.
Thanks for the explanation - that makes sense.
Ingo
^ permalink raw reply
* Re: [PATCH 0/3] net: mv643xx_eth: various small fixes for v3.12
From: Sebastian Hesselbarth @ 2013-10-04 17:34 UTC (permalink / raw)
To: Ezequiel Garcia
Cc: Jason Cooper, netdev, linux-kernel, Lennert Buytenhek,
David Miller, linux-arm-kernel
In-Reply-To: <20131004165354.GA9065@localhost>
On 10/04/2013 06:53 PM, Ezequiel Garcia wrote:
> On Wed, Oct 02, 2013 at 12:57:19PM +0200, Sebastian Hesselbarth wrote:
>> This patch set comprises some one-liners to fix issues with repeated
>> loading and unloading of a modular mv643xx_eth driver.
>>
>> First two patches take care of the periodic port statistic timer, that
>> updates statistics by reading port registers using add_timer/mod_timer.
>>
>> Patch 1 moves timer re-schedule from mib_counters_update to the timer
>> callback. As mib_counters_update is also called from non-timer context,
>> this ensures the timer is reactivated from timer context only.
>>
>> Patch 2 moves initial timer schedule from _probe() time to right before
>> the port is actually started as the corresponding del_timer_sync is at
>> _stop() time. This fixes a regression, where unloading the driver from a
>> non-started eth device can cause the timer to access deallocated mem.
>>
>> Patch 3 adds an assignment of the ports device_node to the corresponding
>> self-created platform_device. This is required to allow fixups based on
>> the device_node's compatible string later. Actually, it is also a potential
>> regression because we already check compatible string for Kirkwood, but
>> does not (yet) rely on the fixup.
>>
>> All patches are based on v3.12-rc3 and have been tested on Kirkwood-based
>> Seagate Dockstar.
>>
>> Patches 1 and 2 can also possibly queued up for -stable.
>>
>> Sebastian Hesselbarth (3):
>> net: mv643xx_eth: update statistics timer from timer context only
>> net: mv643xx_eth: fix orphaned statistics timer crash
>> net: mv643xx_eth: fix missing device_node for port devices
>>
>> drivers/net/ethernet/marvell/mv643xx_eth.c | 7 +++----
>> 1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> ---
>> Cc: David Miller <davem@davemloft.net>
>> Cc: Lennert Buytenhek <buytenh@wantstofly.org>
>> Cc: Jason Cooper <jason@lakedaemon.net>
>> Cc: netdev@vger.kernel.org
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>
> Sorry for missing this series, somehow my filters failed to catch it.
>
> Anyway, I would expect these patches to have my Reported-by, but I'm
> glad to see you've solved all the issues.
True, thanks to Ezequiel for initially reporting the issues. I will
have to be more careful about the Reported-by next time.
> I see David has already applied them all, but I'll re-run my tests to
> check there aren't any more issues if I can find some time next week.
>
> Thanks for taking care of it!
^ permalink raw reply
* [PATCH] tcp: do not forget FIN in tcp_shifted_skb()
From: Eric Dumazet @ 2013-10-04 17:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Neal Cardwell, Ilpo Järvinen, Yuchung Cheng
From: Eric Dumazet <edumazet@google.com>
Yuchung found following problem :
There are bugs in the SACK processing code, merging part in
tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
skbs FIN flag. When a receiver first SACK the FIN sequence, and later
throw away ofo queue (e.g., sack-reneging), the sender will stop
retransmitting the FIN flag, and hangs forever.
Following packetdrill test can be used to reproduce the bug.
$ cat sack-merge-bug.pkt
`sysctl -q net.ipv4.tcp_fack=0`
// Establish a connection and send 10 MSS.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.001 < . 1:1(0) ack 1 win 1024
+.000 accept(3, ..., ...) = 4
+.100 write(4, ..., 12000) = 12000
+.000 shutdown(4, SHUT_WR) = 0
+.000 > . 1:10001(10000) ack 1
+.050 < . 1:1(0) ack 2001 win 257
+.000 > FP. 10001:12001(2000) ack 1
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
// SACK reneg
+.050 < . 1:1(0) ack 12001 win 257
+0 %{ print "unacked: ",tcpi_unacked }%
+5 %{ print "" }%
First, a typo inverted left/right of one OR operation, then
code forgot to advance end_seq if the merged skb carried FIN.
Bug was added in 2.6.29 by commit 832d11c5cd076ab
("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
net/ipv4/tcp_input.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 25a89ea..113dc5f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1284,7 +1284,10 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
tp->lost_cnt_hint -= tcp_skb_pcount(prev);
}
- TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(prev)->tcp_flags;
+ TCP_SKB_CB(prev)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+ if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
+ TCP_SKB_CB(prev)->end_seq++;
+
if (skb == tcp_highest_sack(sk))
tcp_advance_highest_sack(sk, skb);
^ permalink raw reply related
* Re: [PATCH net-next] dev: add support of flag IFF_NOPROC
From: David Miller @ 2013-10-04 17:29 UTC (permalink / raw)
To: nicolas.dichtel; +Cc: stephen, netdev
In-Reply-To: <524EAF64.8000801@6wind.com>
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Fri, 04 Oct 2013 14:07:00 +0200
> Of course optimizing /proc and /sysfs is a good option, but any
> optimizations
> will never be as fast as disabling them for some well known
> netdevices.
>
> Note also that the memory consumption is significantly less with this
> flag:
It potentially breaks tools, it's a non-starter, sorry.
^ permalink raw reply
* Re: [PATCH 0/3] net: mv643xx_eth: various small fixes for v3.12
From: Ezequiel Garcia @ 2013-10-04 16:53 UTC (permalink / raw)
To: Sebastian Hesselbarth
Cc: Jason Cooper, netdev, linux-kernel, Lennert Buytenhek,
David Miller, linux-arm-kernel
In-Reply-To: <1380711442-24735-1-git-send-email-sebastian.hesselbarth@gmail.com>
On Wed, Oct 02, 2013 at 12:57:19PM +0200, Sebastian Hesselbarth wrote:
> This patch set comprises some one-liners to fix issues with repeated
> loading and unloading of a modular mv643xx_eth driver.
>
> First two patches take care of the periodic port statistic timer, that
> updates statistics by reading port registers using add_timer/mod_timer.
>
> Patch 1 moves timer re-schedule from mib_counters_update to the timer
> callback. As mib_counters_update is also called from non-timer context,
> this ensures the timer is reactivated from timer context only.
>
> Patch 2 moves initial timer schedule from _probe() time to right before
> the port is actually started as the corresponding del_timer_sync is at
> _stop() time. This fixes a regression, where unloading the driver from a
> non-started eth device can cause the timer to access deallocated mem.
>
> Patch 3 adds an assignment of the ports device_node to the corresponding
> self-created platform_device. This is required to allow fixups based on
> the device_node's compatible string later. Actually, it is also a potential
> regression because we already check compatible string for Kirkwood, but
> does not (yet) rely on the fixup.
>
> All patches are based on v3.12-rc3 and have been tested on Kirkwood-based
> Seagate Dockstar.
>
> Patches 1 and 2 can also possibly queued up for -stable.
>
> Sebastian Hesselbarth (3):
> net: mv643xx_eth: update statistics timer from timer context only
> net: mv643xx_eth: fix orphaned statistics timer crash
> net: mv643xx_eth: fix missing device_node for port devices
>
> drivers/net/ethernet/marvell/mv643xx_eth.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> ---
> Cc: David Miller <davem@davemloft.net>
> Cc: Lennert Buytenhek <buytenh@wantstofly.org>
> Cc: Jason Cooper <jason@lakedaemon.net>
> Cc: netdev@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
Sorry for missing this series, somehow my filters failed to catch it.
Anyway, I would expect these patches to have my Reported-by, but I'm
glad to see you've solved all the issues.
I see David has already applied them all, but I'll re-run my tests to
check there aren't any more issues if I can find some time next week.
Thanks for taking care of it!
--
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com
^ permalink raw reply
* Re: [PATCH v2.42 3/5] lib: Support pushing of MPLS LSE before or after VLAN tag
From: Ben Pfaff @ 2013-10-04 16:32 UTC (permalink / raw)
To: Simon Horman
Cc: dev, netdev, Jesse Gross, Pravin B Shelar, Ravi K, Isaku Yamahata,
Joe Stringer
In-Reply-To: <1380874200-8981-4-git-send-email-horms@verge.net.au>
On Fri, Oct 04, 2013 at 05:09:58PM +0900, Simon Horman wrote:
> From: Joe Stringer <joe@wand.net.nz>
>
> This patch modifies the push_mpls behaviour to allow
> pushing of an MPLS LSE either before any VLAN tag that may be present.
>
> Pushing the MPLS LSE before any VLAN tag that is present is the
> behaviour specified in OpenFlow 1.3.
>
> Pushing the MPLS LSE after the any VLAN tag that is present is the
> behaviour specified in OpenFlow 1.1 and 1.2. This is the only behaviour
> that was supported prior to this patch.
>
> When an push_mpls action has been inserted using OpenFlow 1.2 or earlier
> the behaviour of pushing the MPLS LSE before any VLAN tag that may be
> present is implemented by by inserting VLAN actions around the MPLS push
> action during odp translation; Pop VLAN tags before committing MPLS
> actions, and push the expected VLAN tag afterwards.
>
> The trigger condition for the two different behaviours is the value of the
> mpls_before_vlan field of struct ofpact_push_mpls. This field is set when
> parsing OpenFlow actions.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Simon Horman <horms@verge.net.au>
I'm happy with this, I think. It will need a trivial update if you take
my suggestion on patch 2.
^ permalink raw reply
* Re: [PATCH v2.42 2/5] ofp-actions: Add separate OpenFlow 1.3 action parser
From: Ben Pfaff @ 2013-10-04 16:31 UTC (permalink / raw)
To: Simon Horman
Cc: dev, netdev, Jesse Gross, Pravin B Shelar, Ravi K, Isaku Yamahata,
Joe Stringer
In-Reply-To: <1380874200-8981-3-git-send-email-horms@verge.net.au>
On Fri, Oct 04, 2013 at 05:09:57PM +0900, Simon Horman wrote:
> From: Joe Stringer <joe@wand.net.nz>
>
> This patch adds new ofpact_from_openflow13() and
> ofpacts_from_openflow13() functions parallel to the existing ofpact
> handling code. In the OpenFlow 1.3 version, push_mpls is handled
> differently, but all other actions are handled by the existing code.
>
> In the case of push_mpls for OpenFlow 1.3 the new mpls_before_vlan field of
> struct ofpact_push_mpls is set to true. This will be used by a subsequent
> patch to allow allow the correct VLAN+MPLS datapath behaviour to be
> determined at odp translation time.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> Signed-off-by: Simon Horman <horms@verge.net.au>
The need for a huge comment on mpls_before_vlan suggests to me that
breaking it out as a separate type would be helpful. How about this:
/* The position in the packet at which to insert an MPLS header.
*
* Used NXAST_PUSH_MPLS, OFPAT11_PUSH_MPLS. */
enum ofpact_mpls_position {
/* Add the MPLS LSE after the Ethernet header but before any VLAN tags.
* OpenFlow 1.3+ requires this behavior. */
OFPACT_MPLS_BEFORE_VLAN,
/* Add the MPLS LSE after the Ethernet header and any VLAN tags.
* OpenFlow 1.1 and 1.2 require this behavior. */
OFPACT_MPLS_AFTER_VLAN
};
/* OFPACT_PUSH_VLAN/MPLS/PBB
*
* Used for NXAST_PUSH_MPLS, OFPAT11_PUSH_MPLS. */
struct ofpact_push_mpls {
struct ofpact ofpact;
ovs_be16 ethertype;
enum ofpact_mpls_position position;
};
Thanks,
Ben.
^ permalink raw reply
* USB ethernet (smsc95xx) transmit timeout errors
From: David Laight @ 2013-10-04 16:28 UTC (permalink / raw)
To: netdev
We are seeing some 'transmit timeout' errors from the smsc95xx driver.
I've found references to similar problems on the RPI, but no explicit
resolution.
Our systems are Intel i7 motherboards with the USB chip on a 'front panel'
board (together with three USB2 sockets) plugged into one on the
motherboard's USB3 sockets.
(The silicon has some kind of clever USB hub in it.)
Initial failures were with the 3.2 kernel from Ubuntu 12.04, but I've
updated one of the systems that failed to a 3.8 kernel and it still
fails.
Does this problem 'ring a bell' with anyone? and is it likely to have
been fixed in a more recent kernel?
Even with heavy traffic the systems might run for 24 hours before
failing - so proving a system is 'fixed' is difficult.
David
^ permalink raw reply
* [PATCH net-next] xen-netback: fix xenvif_count_skb_slots()
From: Paul Durrant @ 2013-10-04 16:26 UTC (permalink / raw)
To: xen-devel, netdev
Cc: Paul Durrant, Xi Xiong, Matt Wilson, Annie Li, Wei Liu,
Ian Campbell
Commit 4f0581d25827d5e864bcf07b05d73d0d12a20a5c introduced an error into
xenvif_count_skb_slots() for skbs with a linear area spanning a page
boundary. The alignment of skb->data needs to be taken into account, not
just the head length. This patch fixes the issue by dry-running the code
from xenvif_gop_skb() (and adjusting the comment above the function to note
that).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Xi Xiong <xixiong@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
---
drivers/net/xen-netback/netback.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index d0b0feb..6f680f4 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -223,15 +223,28 @@ static bool start_new_rx_buffer(int offset, unsigned long size, int head)
/*
* Figure out how many ring slots we're going to need to send @skb to
* the guest. This function is essentially a dry run of
- * xenvif_gop_frag_copy.
+ * xenvif_gop_skb.
*/
unsigned int xenvif_count_skb_slots(struct xenvif *vif, struct sk_buff *skb)
{
+ unsigned char *data;
unsigned int count;
int i, copy_off;
struct skb_cb_overlay *sco;
- count = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE);
+ count = 0;
+
+ data = skb->data;
+ while (data < skb_tail_pointer(skb)) {
+ unsigned int offset = offset_in_page(data);
+ unsigned int len = PAGE_SIZE - offset;
+
+ if (data + len > skb_tail_pointer(skb))
+ len = skb_tail_pointer(skb) - data;
+
+ count++;
+ data += len;
+ }
copy_off = skb_headlen(skb) % PAGE_SIZE;
--
1.7.10.4
^ permalink raw reply related
* Re: [PATCH v2.42 1/5] odp: Allow VLAN actions after MPLS actions
From: Ben Pfaff @ 2013-10-04 16:21 UTC (permalink / raw)
To: Simon Horman
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Ravi K, netdev-u79uwXL29TY76Z2rM5mHXA,
Isaku Yamahata
In-Reply-To: <1380874200-8981-2-git-send-email-horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>
On Fri, Oct 04, 2013 at 05:09:56PM +0900, Simon Horman wrote:
> From: Joe Stringer <joe-Q1GJJQv1iO6lP80pJB477g@public.gmane.org>
>
> OpenFlow 1.1 and 1.2, and 1.3 differ in their handling of MPLS actions in the
> presence of VLAN tags. To allow correct behaviour to be committed in
> each situation, this patch adds a second round of VLAN tag action
> handling to commit_odp_actions(), which occurs after MPLS actions. This
> is implemented with a new field in 'struct xlate_in' called 'vlan_tci'.
>
> When an push_mpls action is composed, the flow's current VLAN state is
> stored into xin->vlan_tci, and flow->vlan_tci is set to 0 (pop_vlan). If
> a VLAN tag is present, it is stripped; if not, then there is no change.
> Any later modifications to the VLAN state is written to xin->vlan_tci.
> When committing the actions, flow->vlan_tci is used before MPLS actions,
> and xin->vlan_tci is used afterwards. This retains the current datapath
> behaviour, but allows VLAN actions to be applied in a more flexible
> manner.
>
> Both before and after this patch MPLS LSEs are pushed onto a packet after
> any VLAN tags that may be present. This is the behaviour described in
> OpenFlow 1.1 and 1.2. OpenFlow 1.3 specifies that MPLS LSEs should be
> pushed onto a packet before any VLAN tags that are present. Support
> for this will be added by a subsequent patch that makes use of
> the infrastructure added by this patch.
>
> Signed-off-by: Joe Stringer <joe-Q1GJJQv1iO6lP80pJB477g@public.gmane.org>
> Signed-off-by: Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>
I noticed a couple more minor points.
First, it seems to me that the "vlan_tci" member that this adds to
xlate_in could go in xlate_ctx just as well. I would prefer that,
because (as far as I can tell) there is no reason for the client to use
any value other than flow->vlan_tci here, and putting it in xlate_ctx
hides it from the client.
Thanks for rearranging the code and updating the comment in
do_xlate_actions(). It makes the operation clearer. But now that it's
clear I have an additional question. Does it really make sense to have
'vlan_tci' as only a local variable in do_xlate_actions()? Presumably,
MPLS and VLANs should interact the same way regardless of whether they
are separated by resubmits or goto_tables. That is, I suspect that this
is xlate_ctx state, not local state.
Thanks,
Ben.
^ permalink raw reply
* Re: [stable 3.0] add some CVE fixes
From: Greg KH @ 2013-10-04 16:03 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Jiri Slaby, stable, netdev
In-Reply-To: <1380860287.1441.101.camel@deadeye.wl.decadent.org.uk>
On Fri, Oct 04, 2013 at 05:18:07AM +0100, Ben Hutchings wrote:
> On Thu, 2013-10-03 at 21:11 +0200, Jiri Slaby wrote:
> > On 10/03/2013 08:44 PM, Greg KH wrote:
> > > On Thu, Oct 03, 2013 at 11:20:28AM +0200, Jiri Slaby wrote:
> > >> Plus the backports that are replied to this mail?
> > >
> > > I don't see any backports, did you forget to send them?
> >
> > I don't think so, they were sent and this is a log of one of them:
> > OK. Log says:
> > Sendmail: /usr/sbin/sendmail -f jslaby@suse.cz -i stable@vger.kernel.org
> > jslaby@suse.cz
> > From: Jiri Slaby <jslaby@suse.cz>
> > To: <stable@vger.kernel.org>
> > Cc: jslaby@suse.cz
> > Subject: [PATCH 4/4] Tools: hv: verify origin of netlink connector message
> > Date: Thu, 3 Oct 2013 11:23:50 +0200
> > Message-Id: <1380792230-27255-4-git-send-email-jslaby@suse.cz>
> > X-Mailer: git-send-email 1.8.4
> > In-Reply-To: <1380792230-27255-1-git-send-email-jslaby@suse.cz>
> > References: <524D36DC.5070506@suse.cz>
> > <1380792230-27255-1-git-send-email-jslaby@suse.cz>
> >
> > Result: OK
> >
> > Could you check your spam folder? Or I can bounce them directly to you?
>
> They didn't hit the list either. If they're applicable to 3.2 as well
> then could you send them this way?
I did not find these anywhere in any spam filter, and as they didn't hit
the list either, I'd blame an email server on your end somewhere.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH] ipv4: fix ineffective source address selection
From: Eric Dumazet @ 2013-10-04 15:49 UTC (permalink / raw)
To: Jiri Benc; +Cc: netdev
In-Reply-To: <245895a0777442b56ecea1453be041aa1b31c5a2.1380898983.git.jbenc@redhat.com>
On Fri, 2013-10-04 at 17:04 +0200, Jiri Benc wrote:
> When sending out multicast messages, the source address in inet->mc_addr is
> ignored and rewritten by an autoselected one. This is caused by a typo in
> commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
> route lookups").
>
> Signed-off-by: Jiri Benc <jbenc@redhat.com>
> ---
Nice catch !
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH 2/2] net/ethernet: cpsw: DT read bool dual_emac
From: Peter Korsgaard @ 2013-10-04 15:24 UTC (permalink / raw)
To: Markus Pargmann
Cc: David S. Miller, Mugunthan V N, netdev, kernel, Florian Fainelli,
linux-arm-kernel
In-Reply-To: <1380890680-30941-2-git-send-email-mpa@pengutronix.de>
>>>>> "Markus" == Markus Pargmann <mpa@pengutronix.de> writes:
Markus> Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
--
Bye, Peter Korsgaard
^ permalink raw reply
* Re: [PATCH 1/2] net: ethernet: cpsw: Search childs for slave nodes
From: Peter Korsgaard @ 2013-10-04 15:24 UTC (permalink / raw)
To: Markus Pargmann
Cc: David S. Miller, Mugunthan V N, netdev, kernel, Florian Fainelli,
linux-arm-kernel
In-Reply-To: <1380890680-30941-1-git-send-email-mpa@pengutronix.de>
>>>>> "Markus" == Markus Pargmann <mpa@pengutronix.de> writes:
Markus> The current implementation searches the whole DT for nodes named
Markus> "slave".
Markus> This patch changes it to search only child nodes for slaves.
Markus> Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
--
Bye, Peter Korsgaard
^ permalink raw reply
* Re: [PATCH] veth: Showing peer of veth type dev in ip link (kernel side)
From: Nicolas Dichtel @ 2013-10-04 15:21 UTC (permalink / raw)
To: Masatake YAMATO, netdev
In-Reply-To: <1380859529-32351-1-git-send-email-yamato@redhat.com>
Le 04/10/2013 06:05, Masatake YAMATO a écrit :
> ip link has ability to show extra information of net work device if
> kernel provides sunh information. With this patch veth driver can
> provide its peer ifindex information to ip command via netlink
> interface.
But the ifindex can be interpreted only when you know the related netns and
veth peer is probably in another netns.
How to found the netdevice in userland in this case?
^ permalink raw reply
* [PATCH] ipv4: fix ineffective source address selection
From: Jiri Benc @ 2013-10-04 15:04 UTC (permalink / raw)
To: netdev
When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
net/ipv4/route.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 727f436..6011615 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2072,7 +2072,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
RT_SCOPE_LINK);
goto make_route;
}
- if (fl4->saddr) {
+ if (!fl4->saddr) {
if (ipv4_is_multicast(fl4->daddr))
fl4->saddr = inet_select_addr(dev_out, 0,
fl4->flowi4_scope);
--
1.7.6.5
^ permalink raw reply related
* RE: [PATCH] xen-netback: Don't destroy the netdev until the vif is shut
From: Paul Durrant @ 2013-10-04 14:23 UTC (permalink / raw)
To: David Miller; +Cc: netdev@vger.kernel.org, Wei Liu
In-Reply-To: <20131004.002248.789391156050342547.davem@davemloft.net>
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: 04 October 2013 05:23
> To: Paul Durrant
> Cc: netdev@vger.kernel.org; Wei Liu
> Subject: Re: [PATCH] xen-netback: Don't destroy the netdev until the vif is
> shut
>
>
> Can one of you do -stable backports of this change for v3.11 and any
> other active -stable branch where this fix is relevant?
>
Ok. Will do.
Paul
> I gave it a shot, and it got outside my realm of expertise.
>
> Thanks!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox