* RE: [PATCH net-next 1/3] net: add IPv4 routing FIB support for swdev
From: Shrijeet Mukherjee @ 2015-01-07 2:08 UTC (permalink / raw)
To: Hannes Frederic Sowa, Scott Feldman
Cc: Netdev, Jiří Pírko, john fastabend, Thomas Graf,
Jamal Hadi Salim, Andy Gospodarek, Roopa Prabhu
In-Reply-To: <1420574353.15181.19.camel@stressinduktion.org>
>For the first idea, I'll try to make an example:
>
>Initial setup:
># ip rule ls
>0: from all lookup local
>32766: from all lookup main
>32767: from all lookup default
>
># ip rule add pref 100 iif swdev0 table 5 # ip rule ls
>0: from all lookup local
>100: from all iif swdev0 [detached] lookup 5
>> maybe we can show which rules are being able to get offloaded here
>32766: from all lookup main
>32767: from all lookup default
>
>table 5 should be the table we can insert routes into which are offloaded
>to
>hardware.
>
>During table modifications we linearly scan the rules if we find selectors
>which
>cannot be represented by hardware.
>
>In case we have a iif selector, we simply can use this table and just
>synthesize it
>into the particular interface.
>
>A ip-rule-from would need all the hardware being capable of matching source
>addresses, otherwise we cannot offload all routing tables with higher
>preference,
>same for a to/tos rule. If we encounter a fwmark rule, we certainly cannot
>represent it in hardware, so skip it (here we can think about entangling
>those with
>ACLs, but it feels hard to do).
>
>If rules are inserted or changed we must again validate the complete list
>of rules
>and decide if we need to flush all the routes and install a slow path via
>kernel.
>
>What do you think? Does that make sense? I could try to come up with an API
>for
>that. ;)
>
This sounds really good, but I suspect the real problem is the case where
the rule evaluation is in the hardware path right. If it is purely IF based
there is no issue .. but any other policy like missed in table 1, then use
table 2 will not work with this model .. or did I miss something ?
^ permalink raw reply
* [PATCH 6/6] openvswitch: Support VXLAN Group Policy extension
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.
ovs-vsctl add-port br0 vxlan0 -- \
set Interface vxlan0 type=vxlan options:exts=gbp
The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.
The group policy metadata is handled in the same way as Geneve options
and transported as binary blob in a new Netlink attribute
OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS which is mutually exclusive to the
existing OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
include/uapi/linux/openvswitch.h | 19 ++++++++++
net/openvswitch/flow_netlink.c | 78 +++++++++++++++++++++++++--------------
net/openvswitch/vport-vxlan.c | 80 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 148 insertions(+), 29 deletions(-)
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..676a89e 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -248,11 +248,29 @@ enum ovs_vport_attr {
#define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1)
+/**
+ * struct ovs_vxlan_opts - VXLAN tunnel options
+ * @gbp: Group policy bits
+ */
+struct ovs_vxlan_opts {
+ __u32 gbp;
+};
+
+enum {
+ OVS_VXLAN_EXT_UNSPEC,
+ OVS_VXLAN_EXT_GBP,
+ __OVS_VXLAN_EXT_MAX,
+};
+
+#define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
+
+
/* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
*/
enum {
OVS_TUNNEL_ATTR_UNSPEC,
OVS_TUNNEL_ATTR_DST_PORT, /* 16-bit UDP port, used by L4 tunnels. */
+ OVS_TUNNEL_ATTR_EXTENSION,
__OVS_TUNNEL_ATTR_MAX
};
@@ -324,6 +342,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS, /* Array of Geneve options. */
OVS_TUNNEL_KEY_ATTR_TP_SRC, /* be16 src Transport Port. */
OVS_TUNNEL_KEY_ATTR_TP_DST, /* be16 dst Transport Port. */
+ OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS, /* struct ovs_vxlan_opts. */
__OVS_TUNNEL_KEY_ATTR_MAX
};
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index c60ae3f..1528709 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -446,6 +446,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
int rem;
bool ttl = false;
__be16 tun_flags = 0;
+ int opts_type = 0;
nla_for_each_nested(a, attr, rem) {
int type = nla_type(a);
@@ -463,6 +464,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
[OVS_TUNNEL_KEY_ATTR_TP_DST] = sizeof(u16),
[OVS_TUNNEL_KEY_ATTR_OAM] = 0,
[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
+ [OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS] = -1,
};
if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
@@ -519,11 +521,18 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
tun_flags |= TUNNEL_OAM;
break;
case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+ case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+ if (opts_type) {
+ OVS_NLERR(log, "Multiple metadata blocks provided");
+ return -EINVAL;
+ }
+
err = tun_md_opt_from_nlattr(a, match, is_mask, log);
if (err)
return err;
tun_flags |= TUNNEL_OPTIONS_PRESENT;
+ opts_type = type;
break;
default:
OVS_NLERR(log, "Unknown IPv4 tunnel attribute %d",
@@ -552,7 +561,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
}
}
- return 0;
+ return opts_type;
}
static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
@@ -1537,6 +1546,34 @@ void ovs_match_init(struct sw_flow_match *match,
}
}
+static int validate_and_copy_geneve_opts(struct sw_flow_key *key)
+{
+ struct geneve_opt *option;
+ int opts_len = key->tun_opts_len;
+ bool crit_opt = false;
+
+ option = (struct geneve_opt *) TUN_METADATA_OPTS(key, key->tun_opts_len);
+ while (opts_len > 0) {
+ int len;
+
+ if (opts_len < sizeof(*option))
+ return -EINVAL;
+
+ len = sizeof(*option) + option->length * 4;
+ if (len > opts_len)
+ return -EINVAL;
+
+ crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
+
+ option = (struct geneve_opt *)((u8 *)option + len);
+ opts_len -= len;
+ };
+
+ key->tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+
+ return 0;
+}
+
static int validate_and_copy_set_tun(const struct nlattr *attr,
struct sw_flow_actions **sfa, bool log)
{
@@ -1544,36 +1581,23 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
struct sw_flow_key key;
struct ovs_tunnel_info *tun_info;
struct nlattr *a;
- int err, start;
+ int err, start, opts_type;
ovs_match_init(&match, &key, NULL);
- err = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
- if (err)
- return err;
+ opts_type = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
+ if (opts_type < 0)
+ return opts_type;
if (key.tun_opts_len) {
- struct geneve_opt *option;
- int opts_len = key.tun_opts_len;
- bool crit_opt = false;
-
- option = (struct geneve_opt *) TUN_METADATA_OPTS(&key, key.tun_opts_len);
- while (opts_len > 0) {
- int len;
-
- if (opts_len < sizeof(*option))
- return -EINVAL;
-
- len = sizeof(*option) + option->length * 4;
- if (len > opts_len)
- return -EINVAL;
-
- crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
-
- option = (struct geneve_opt *)((u8 *)option + len);
- opts_len -= len;
- };
-
- key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+ switch (opts_type) {
+ case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+ err = validate_and_copy_geneve_opts(&key);
+ if (err < 0)
+ return err;
+ break;
+ case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+ break;
+ }
};
start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 266c595..8ed7163 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -49,6 +49,7 @@
struct vxlan_port {
struct vxlan_sock *vs;
char name[IFNAMSIZ];
+ u32 exts; /* VXLAN_EXT_* in <net/vxlan.h> */
};
static struct vport_ops ovs_vxlan_vport_ops;
@@ -63,16 +64,26 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
struct vxlan_metadata *md)
{
struct ovs_tunnel_info tun_info;
+ struct vxlan_port *vxlan_port;
struct vport *vport = vs->data;
struct iphdr *iph;
+ struct ovs_vxlan_opts opts = {
+ .gbp = md->gbp,
+ };
__be64 key;
+ __be16 flags;
+
+ flags = TUNNEL_KEY;
+ vxlan_port = vxlan_vport(vport);
+ if (vxlan_port->exts & VXLAN_EXT_GBP)
+ flags |= TUNNEL_OPTIONS_PRESENT;
/* Save outer tunnel values */
iph = ip_hdr(skb);
key = cpu_to_be64(ntohl(md->vni) >> 8);
ovs_flow_tun_info_init(&tun_info, iph,
udp_hdr(skb)->source, udp_hdr(skb)->dest,
- key, TUNNEL_KEY, NULL, 0);
+ key, flags, &opts, sizeof(opts));
ovs_vport_receive(vport, skb, &tun_info);
}
@@ -84,6 +95,21 @@ static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(dst_port)))
return -EMSGSIZE;
+
+ if (vxlan_port->exts) {
+ struct nlattr *exts;
+
+ exts = nla_nest_start(skb, OVS_TUNNEL_ATTR_EXTENSION);
+ if (!exts)
+ return -EMSGSIZE;
+
+ if (vxlan_port->exts & VXLAN_EXT_GBP &&
+ nla_put_flag(skb, OVS_VXLAN_EXT_GBP))
+ return -EMSGSIZE;
+
+ nla_nest_end(skb, exts);
+ }
+
return 0;
}
@@ -96,6 +122,31 @@ static void vxlan_tnl_destroy(struct vport *vport)
ovs_vport_deferred_free(vport);
}
+static const struct nla_policy exts_policy[OVS_VXLAN_EXT_MAX+1] = {
+ [OVS_VXLAN_EXT_GBP] = { .type = NLA_FLAG, },
+};
+
+static int vxlan_configure_exts(struct vport *vport, struct nlattr *attr)
+{
+ struct nlattr *exts[OVS_VXLAN_EXT_MAX+1];
+ struct vxlan_port *vxlan_port;
+ int err;
+
+ if (nla_len(attr) < sizeof(struct nlattr))
+ return -EINVAL;
+
+ err = nla_parse_nested(exts, OVS_VXLAN_EXT_MAX, attr, exts_policy);
+ if (err < 0)
+ return err;
+
+ vxlan_port = vxlan_vport(vport);
+
+ if (exts[OVS_VXLAN_EXT_GBP])
+ vxlan_port->exts |= VXLAN_EXT_GBP;
+
+ return 0;
+}
+
static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
{
struct net *net = ovs_dp_get_net(parms->dp);
@@ -128,7 +179,17 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
vxlan_port = vxlan_vport(vport);
strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
- vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
+ a = nla_find_nested(options, OVS_TUNNEL_ATTR_EXTENSION);
+ if (a) {
+ err = vxlan_configure_exts(vport, a);
+ if (err) {
+ ovs_vport_free(vport);
+ goto error;
+ }
+ }
+
+ vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0,
+ vxlan_port->exts);
if (IS_ERR(vs)) {
ovs_vport_free(vport);
return (void *)vs;
@@ -141,6 +202,20 @@ error:
return ERR_PTR(err);
}
+static int vxlan_ext_gbp(struct sk_buff *skb)
+{
+ const struct ovs_tunnel_info *tun_info;
+ const struct ovs_vxlan_opts *opts;
+
+ tun_info = OVS_CB(skb)->egress_tun_info;
+ opts = tun_info->options;
+
+ if (tun_info->options_len >= sizeof(*opts))
+ return opts->gbp;
+ else
+ return 0;
+}
+
static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
{
struct net *net = ovs_dp_get_net(vport->dp);
@@ -181,6 +256,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
src_port = udp_flow_src_port(net, skb, 0, 0, true);
md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
+ md.gbp = vxlan_ext_gbp(skb);
err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
fl.saddr, tun_key->ipv4_dst,
--
1.9.3
^ permalink raw reply related
* [PATCH 5/6] openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
A subsequent patch will introduce VXLAN options. Rename the existing
GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
tunnel metadata options.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
net/openvswitch/flow.c | 2 +-
net/openvswitch/flow.h | 14 +++++++-------
net/openvswitch/flow_netlink.c | 37 +++++++++++++++++--------------------
3 files changed, 25 insertions(+), 28 deletions(-)
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 70bef2a..bfc74ac 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -690,7 +690,7 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
8)) - 1
> sizeof(key->tun_opts));
- memcpy(GENEVE_OPTS(key, tun_info->options_len),
+ memcpy(TUN_METADATA_OPTS(key, tun_info->options_len),
tun_info->options, tun_info->options_len);
key->tun_opts_len = tun_info->options_len;
} else {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a8b30f3..d3d0a40 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -53,7 +53,7 @@ struct ovs_key_ipv4_tunnel {
struct ovs_tunnel_info {
struct ovs_key_ipv4_tunnel tunnel;
- const struct geneve_opt *options;
+ const void *options;
u8 options_len;
};
@@ -61,10 +61,10 @@ struct ovs_tunnel_info {
* maximum size. This allows us to get the benefits of variable length
* matching for small options.
*/
-#define GENEVE_OPTS(flow_key, opt_len) \
- ((struct geneve_opt *)((flow_key)->tun_opts + \
- FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
- opt_len))
+#define TUN_METADATA_OFFSET(opt_len) \
+ (FIELD_SIZEOF(struct sw_flow_key, tun_opts) - opt_len)
+#define TUN_METADATA_OPTS(flow_key, opt_len) \
+ ((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be32 saddr, __be32 daddr,
@@ -73,7 +73,7 @@ static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be16 tp_dst,
__be64 tun_id,
__be16 tun_flags,
- const struct geneve_opt *opts,
+ const void *opts,
u8 opts_len)
{
tun_info->tunnel.tun_id = tun_id;
@@ -105,7 +105,7 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be16 tp_dst,
__be64 tun_id,
__be16 tun_flags,
- const struct geneve_opt *opts,
+ const void *opts,
u8 opts_len)
{
__ovs_flow_tun_info_init(tun_info, iph->saddr, iph->daddr,
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d1eecf7..c60ae3f 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -387,20 +387,20 @@ static int parse_flow_nlattrs(const struct nlattr *attr,
return __parse_flow_nlattrs(attr, a, attrsp, log, false);
}
-static int genev_tun_opt_from_nlattr(const struct nlattr *a,
- struct sw_flow_match *match, bool is_mask,
- bool log)
+static int tun_md_opt_from_nlattr(const struct nlattr *a,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
{
unsigned long opt_key_offset;
if (nla_len(a) > sizeof(match->key->tun_opts)) {
- OVS_NLERR(log, "Geneve option length err (len %d, max %zu).",
+ OVS_NLERR(log, "Tunnel metadata option length err (len %d, max %zu).",
nla_len(a), sizeof(match->key->tun_opts));
return -EINVAL;
}
if (nla_len(a) % 4 != 0) {
- OVS_NLERR(log, "Geneve opt len %d is not a multiple of 4.",
+ OVS_NLERR(log, "Tunnel metadata opt len %d is not a multiple of 4.",
nla_len(a));
return -EINVAL;
}
@@ -424,7 +424,7 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
* information later.
*/
if (match->key->tun_opts_len != nla_len(a)) {
- OVS_NLERR(log, "Geneve option len %d != mask len %d",
+ OVS_NLERR(log, "Tunnel metadata option len %d != mask len %d",
match->key->tun_opts_len, nla_len(a));
return -EINVAL;
}
@@ -432,8 +432,7 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
}
- opt_key_offset = (unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
- nla_len(a));
+ opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
nla_len(a), is_mask);
return 0;
@@ -520,7 +519,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
tun_flags |= TUNNEL_OAM;
break;
case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
- err = genev_tun_opt_from_nlattr(a, match, is_mask, log);
+ err = tun_md_opt_from_nlattr(a, match, is_mask, log);
if (err)
return err;
@@ -558,8 +557,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
const struct ovs_key_ipv4_tunnel *output,
- const struct geneve_opt *tun_opts,
- int swkey_tun_opts_len)
+ const void *tun_opts, int swkey_tun_opts_len)
{
if (output->tun_flags & TUNNEL_KEY &&
nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, output->tun_id))
@@ -600,8 +598,7 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
static int ipv4_tun_to_nlattr(struct sk_buff *skb,
const struct ovs_key_ipv4_tunnel *output,
- const struct geneve_opt *tun_opts,
- int swkey_tun_opts_len)
+ const void *tun_opts, int swkey_tun_opts_len)
{
struct nlattr *nla;
int err;
@@ -1148,10 +1145,10 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
goto nla_put_failure;
if ((swkey->tun_key.ipv4_dst || is_mask)) {
- const struct geneve_opt *opts = NULL;
+ const void *opts = NULL;
if (output->tun_key.tun_flags & TUNNEL_OPTIONS_PRESENT)
- opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+ opts = TUN_METADATA_OPTS(output, swkey->tun_opts_len);
if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
swkey->tun_opts_len))
@@ -1555,11 +1552,11 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
return err;
if (key.tun_opts_len) {
- struct geneve_opt *option = GENEVE_OPTS(&key,
- key.tun_opts_len);
+ struct geneve_opt *option;
int opts_len = key.tun_opts_len;
bool crit_opt = false;
+ option = (struct geneve_opt *) TUN_METADATA_OPTS(&key, key.tun_opts_len);
while (opts_len > 0) {
int len;
@@ -1597,9 +1594,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
* everything else will go away after flow setup. We can append
* it to tun_info and then point there.
*/
- memcpy((tun_info + 1), GENEVE_OPTS(&key, key.tun_opts_len),
- key.tun_opts_len);
- tun_info->options = (struct geneve_opt *)(tun_info + 1);
+ memcpy((tun_info + 1), TUN_METADATA_OPTS(&key,
+ key.tun_opts_len), key.tun_opts_len);
+ tun_info->options = (tun_info + 1);
} else {
tun_info->options = NULL;
}
--
1.9.3
^ permalink raw reply related
* [PATCH 4/6] vxlan: Fail build if VXLAN header is misdefined
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
Due to the complexity of struct vxlanhdr, protect against unwanted
and undesired changes by failing the build if the size of the struct
changes.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
drivers/net/vxlan.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2b75c62..293d524 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2842,6 +2842,8 @@ static int __init vxlan_init_module(void)
{
int rc;
+ BUILD_BUG_ON(sizeof(struct vxlanhdr) != 8);
+
vxlan_wq = alloc_workqueue("vxlan", 0, 0);
if (!vxlan_wq)
return -ENOMEM;
--
1.9.3
^ permalink raw reply related
* [PATCH 3/6] vxlan: Only bind to sockets with correct extensions enabled
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
A VXLAN net_device looking for an appropriate socket may only
consider a socket which has the exact set of extensions enabled.
If none can be found, a new socket must be created.
The OVS VXLAN port is kept unaware of extensions at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
drivers/net/vxlan.c | 35 +++++++++++++++++++++--------------
include/net/vxlan.h | 2 +-
net/openvswitch/vport-vxlan.c | 2 +-
3 files changed, 23 insertions(+), 16 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 30b7b59..2b75c62 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -271,14 +271,15 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
}
/* Find VXLAN socket based on network namespace, address family and UDP port */
-static struct vxlan_sock *vxlan_find_sock(struct net *net,
- sa_family_t family, __be16 port)
+static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
+ __be16 port, u32 exts)
{
struct vxlan_sock *vs;
hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
if (inet_sk(vs->sock->sk)->inet_sport == port &&
- inet_sk(vs->sock->sk)->sk.sk_family == family)
+ inet_sk(vs->sock->sk)->sk.sk_family == family &&
+ vs->exts == exts)
return vs;
}
return NULL;
@@ -298,11 +299,12 @@ static struct vxlan_dev *vxlan_vs_find_vni(struct vxlan_sock *vs, u32 id)
/* Look up VNI in a per net namespace table */
static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id,
- sa_family_t family, __be16 port)
+ sa_family_t family, __be16 port,
+ u32 exts)
{
struct vxlan_sock *vs;
- vs = vxlan_find_sock(net, family, port);
+ vs = vxlan_find_sock(net, family, port, exts);
if (!vs)
return NULL;
@@ -1770,7 +1772,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
ip_rt_put(rt);
dst_vxlan = vxlan_find_vni(vxlan->net, vni,
- dst->sa.sa_family, dst_port);
+ dst->sa.sa_family, dst_port,
+ vxlan->exts);
if (!dst_vxlan)
goto tx_error;
vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -1829,7 +1832,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
dst_release(ndst);
dst_vxlan = vxlan_find_vni(vxlan->net, vni,
- dst->sa.sa_family, dst_port);
+ dst->sa.sa_family, dst_port,
+ vxlan->exts);
if (!dst_vxlan)
goto tx_error;
vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -1999,7 +2003,7 @@ static int vxlan_init(struct net_device *dev)
spin_lock(&vn->sock_lock);
vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
- vxlan->dst_port);
+ vxlan->dst_port, vxlan->exts);
if (vs && atomic_add_unless(&vs->refcnt, 1, 0)) {
/* If we have a socket with same port already, reuse it */
vxlan_vs_add_dev(vs, vxlan);
@@ -2353,7 +2357,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
/* Create new listen socket if needed */
static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- u32 flags)
+ u32 flags, u32 exts)
{
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
@@ -2381,6 +2385,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
atomic_set(&vs->refcnt, 1);
vs->rcv = rcv;
vs->data = data;
+ vs->exts = exts;
/* Initialize the vxlan udp offloads structure */
vs->udp_offloads.port = port;
@@ -2405,13 +2410,14 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- bool no_share, u32 flags)
+ bool no_share, u32 flags,
+ u32 exts)
{
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
bool ipv6 = flags & VXLAN_F_IPV6;
- vs = vxlan_socket_create(net, port, rcv, data, flags);
+ vs = vxlan_socket_create(net, port, rcv, data, flags, exts);
if (!IS_ERR(vs))
return vs;
@@ -2419,7 +2425,7 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
return vs;
spin_lock(&vn->sock_lock);
- vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port);
+ vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port, exts);
if (vs && ((vs->rcv != rcv) ||
!atomic_add_unless(&vs->refcnt, 1, 0)))
vs = ERR_PTR(-EBUSY);
@@ -2441,7 +2447,8 @@ static void vxlan_sock_work(struct work_struct *work)
__be16 port = vxlan->dst_port;
struct vxlan_sock *nvs;
- nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags);
+ nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags,
+ vxlan->exts);
spin_lock(&vn->sock_lock);
if (!IS_ERR(nvs))
vxlan_vs_add_dev(nvs, vxlan);
@@ -2591,7 +2598,7 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
configure_vxlan_exts(vxlan, data[IFLA_VXLAN_EXTENSION]);
if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
- vxlan->dst_port)) {
+ vxlan->dst_port, vxlan->exts)) {
pr_info("duplicate VNI %u\n", vni);
return -EEXIST;
}
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 66000d0..da257a7 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -136,7 +136,7 @@ struct vxlan_sock {
struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- bool no_share, u32 flags);
+ bool no_share, u32 flags, u32 exts);
void vxlan_sock_release(struct vxlan_sock *vs);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index dd68c97..266c595 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -128,7 +128,7 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
vxlan_port = vxlan_vport(vport);
strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
- vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0);
+ vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
if (IS_ERR(vs)) {
ovs_vport_free(vport);
return (void *)vs;
--
1.9.3
^ permalink raw reply related
* [PATCH 2/6] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
$ iptables -I OUTPUT -p icmp -j MARK --set-mark 0x200
$ iptables -I INPUT -i br0 -m mark --mark 0x200 -j ACCEPT
OVS (patches provided separately):
in_port=1, actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
drivers/net/vxlan.c | 155 ++++++++++++++++++++++++++++++------------
include/net/vxlan.h | 80 ++++++++++++++++++++--
include/uapi/linux/if_link.h | 8 +++
net/openvswitch/vport-vxlan.c | 9 ++-
4 files changed, 197 insertions(+), 55 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 4d52aa9..30b7b59 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -132,6 +132,7 @@ struct vxlan_dev {
__u8 tos; /* TOS override */
__u8 ttl;
u32 flags; /* VXLAN_F_* in vxlan.h */
+ u32 exts; /* Enabled extensions */
struct work_struct sock_work;
struct work_struct igmp_join;
@@ -568,7 +569,8 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head, struct sk_buff
continue;
vh2 = (struct vxlanhdr *)(p->data + off_vx);
- if (vh->vx_vni != vh2->vx_vni) {
+ if (vh->vx_flags != vh2->vx_flags ||
+ vh->vx_vni != vh2->vx_vni) {
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
@@ -1095,6 +1097,7 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
{
struct vxlan_sock *vs;
struct vxlanhdr *vxh;
+ struct vxlan_metadata md = {0};
/* Need Vxlan and inner Ethernet header to be present */
if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1113,6 +1116,19 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (vs->exts) {
if (!vxh->vni_present)
goto error_invalid_header;
+
+ if (vxh->gbp_present) {
+ if (!(vs->exts & VXLAN_EXT_GBP))
+ goto error_invalid_header;
+
+ md.gbp = ntohs(vxh->gbp.policy_id);
+
+ if (vxh->gbp.dont_learn)
+ md.gbp |= VXLAN_GBP_DONT_LEARN;
+
+ if (vxh->gbp.policy_applied)
+ md.gbp |= VXLAN_GBP_POLICY_APPLIED;
+ }
} else {
if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
(vxh->vx_vni & htonl(0xff)))
@@ -1122,7 +1138,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (iptunnel_pull_header(skb, VXLAN_HLEN, htons(ETH_P_TEB)))
goto drop;
- vs->rcv(vs, skb, vxh->vx_vni);
+ md.vni = vxh->vx_vni;
+ vs->rcv(vs, skb, &md);
return 0;
drop:
@@ -1138,8 +1155,8 @@ error:
return 1;
}
-static void vxlan_rcv(struct vxlan_sock *vs,
- struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md)
{
struct iphdr *oip = NULL;
struct ipv6hdr *oip6 = NULL;
@@ -1150,7 +1167,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
int err = 0;
union vxlan_addr *remote_ip;
- vni = ntohl(vx_vni) >> 8;
+ vni = ntohl(md->vni) >> 8;
/* Is this VNI defined? */
vxlan = vxlan_vs_find_vni(vs, vni);
if (!vxlan)
@@ -1184,6 +1201,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
goto drop;
skb_reset_network_header(skb);
+ skb->mark = md->gbp;
if (oip6)
err = IP6_ECN_decapsulate(oip6, skb);
@@ -1533,15 +1551,54 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
return false;
}
+static int vxlan_build_hdr(struct sk_buff *skb, struct vxlan_sock *vs,
+ int min_headroom, struct vxlan_metadata *md)
+{
+ struct vxlanhdr *vxh;
+ int err;
+
+ /* Need space for new headers (invalidates iph ptr) */
+ err = skb_cow_head(skb, min_headroom);
+ if (unlikely(err)) {
+ kfree_skb(skb);
+ return err;
+ }
+
+ skb = vlan_hwaccel_push_inside(skb);
+ if (WARN_ON(!skb))
+ return -ENOMEM;
+
+ vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
+ vxh->vx_flags = htonl(VXLAN_FLAGS);
+ vxh->vx_vni = md->vni;
+
+ if (vs->exts) {
+ if (vs->exts & VXLAN_EXT_GBP) {
+ vxh->gbp_present = 1;
+
+ if (md->gbp & VXLAN_GBP_DONT_LEARN)
+ vxh->gbp.dont_learn = 1;
+
+ if (md->gbp & VXLAN_GBP_POLICY_APPLIED)
+ vxh->gbp.policy_applied = 1;
+
+ vxh->gbp.policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);
+ }
+ }
+
+ skb_set_inner_protocol(skb, htons(ETH_P_TEB));
+
+ return 0;
+}
+
#if IS_ENABLED(CONFIG_IPV6)
static int vxlan6_xmit_skb(struct vxlan_sock *vs,
struct dst_entry *dst, struct sk_buff *skb,
struct net_device *dev, struct in6_addr *saddr,
struct in6_addr *daddr, __u8 prio, __u8 ttl,
- __be16 src_port, __be16 dst_port, __be32 vni,
- bool xnet)
+ __be16 src_port, __be16 dst_port,
+ struct vxlan_metadata *md, bool xnet)
{
- struct vxlanhdr *vxh;
int min_headroom;
int err;
bool udp_sum = !udp_get_no_check6_tx(vs->sock->sk);
@@ -1558,24 +1615,9 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
+ VXLAN_HLEN + sizeof(struct ipv6hdr)
+ (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0);
- /* Need space for new headers (invalidates iph ptr) */
- err = skb_cow_head(skb, min_headroom);
- if (unlikely(err)) {
- kfree_skb(skb);
- goto err;
- }
-
- skb = vlan_hwaccel_push_inside(skb);
- if (WARN_ON(!skb)) {
- err = -ENOMEM;
+ err = vxlan_build_hdr(skb, vs, min_headroom, md);
+ if (err)
goto err;
- }
-
- vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
- vxh->vx_flags = htonl(VXLAN_FLAGS);
- vxh->vx_vni = vni;
-
- skb_set_inner_protocol(skb, htons(ETH_P_TEB));
udp_tunnel6_xmit_skb(vs->sock, dst, skb, dev, saddr, daddr, prio,
ttl, src_port, dst_port);
@@ -1589,9 +1631,9 @@ err:
int vxlan_xmit_skb(struct vxlan_sock *vs,
struct rtable *rt, struct sk_buff *skb,
__be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
- __be16 src_port, __be16 dst_port, __be32 vni, bool xnet)
+ __be16 src_port, __be16 dst_port,
+ struct vxlan_metadata *md, bool xnet)
{
- struct vxlanhdr *vxh;
int min_headroom;
int err;
bool udp_sum = !vs->sock->sk->sk_no_check_tx;
@@ -1604,22 +1646,9 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
+ VXLAN_HLEN + sizeof(struct iphdr)
+ (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0);
- /* Need space for new headers (invalidates iph ptr) */
- err = skb_cow_head(skb, min_headroom);
- if (unlikely(err)) {
- kfree_skb(skb);
+ err = vxlan_build_hdr(skb, vs, min_headroom, md);
+ if (err)
return err;
- }
-
- skb = vlan_hwaccel_push_inside(skb);
- if (WARN_ON(!skb))
- return -ENOMEM;
-
- vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
- vxh->vx_flags = htonl(VXLAN_FLAGS);
- vxh->vx_vni = vni;
-
- skb_set_inner_protocol(skb, htons(ETH_P_TEB));
return udp_tunnel_xmit_skb(vs->sock, rt, skb, src, dst, tos,
ttl, df, src_port, dst_port, xnet);
@@ -1679,6 +1708,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
const struct iphdr *old_iph;
struct flowi4 fl4;
union vxlan_addr *dst;
+ struct vxlan_metadata md;
__be16 src_port = 0, dst_port;
u32 vni;
__be16 df = 0;
@@ -1749,11 +1779,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
+ md.vni = htonl(vni << 8);
+ md.gbp = skb->mark;
err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
fl4.saddr, dst->sin.sin_addr.s_addr,
- tos, ttl, df, src_port, dst_port,
- htonl(vni << 8),
+ tos, ttl, df, src_port, dst_port, &md,
!net_eq(vxlan->net, dev_net(vxlan->dev)));
if (err < 0) {
/* skb is already freed. */
@@ -1806,10 +1837,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
}
ttl = ttl ? : ip6_dst_hoplimit(ndst);
+ md.vni = htonl(vni << 8);
+ md.gbp = skb->mark;
err = vxlan6_xmit_skb(vxlan->vn_sock, ndst, skb,
dev, &fl6.saddr, &fl6.daddr, 0, ttl,
- src_port, dst_port, htonl(vni << 8),
+ src_port, dst_port, &md,
!net_eq(vxlan->net, dev_net(vxlan->dev)));
#endif
}
@@ -2210,6 +2243,11 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_UDP_CSUM] = { .type = NLA_U8 },
[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
[IFLA_VXLAN_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
+ [IFLA_VXLAN_EXTENSION] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy vxlan_ext_policy[IFLA_VXLAN_EXT_MAX + 1] = {
+ [IFLA_VXLAN_EXT_GBP] = { .type = NLA_FLAG, },
};
static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -2246,6 +2284,18 @@ static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
}
}
+ if (data[IFLA_VXLAN_EXTENSION]) {
+ int err;
+
+ err = nla_validate_nested(data[IFLA_VXLAN_EXTENSION],
+ IFLA_VXLAN_EXT_MAX, vxlan_ext_policy);
+ if (err < 0) {
+ pr_debug("invalid VXLAN extension configuration: %d\n",
+ err);
+ return -EINVAL;
+ }
+ }
+
return 0;
}
@@ -2400,6 +2450,18 @@ static void vxlan_sock_work(struct work_struct *work)
dev_put(vxlan->dev);
}
+static void configure_vxlan_exts(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *exts[IFLA_VXLAN_EXT_MAX+1];
+
+ /* Validated in vxlan_validate() */
+ if (nla_parse_nested(exts, IFLA_VXLAN_EXT_MAX, attr, NULL) < 0)
+ BUG();
+
+ if (exts[IFLA_VXLAN_EXT_GBP])
+ vxlan->exts |= VXLAN_EXT_GBP;
+}
+
static int vxlan_newlink(struct net *net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
{
@@ -2525,6 +2587,9 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
vxlan->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
+ if (data[IFLA_VXLAN_EXTENSION])
+ configure_vxlan_exts(vxlan, data[IFLA_VXLAN_EXTENSION]);
+
if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
vxlan->dst_port)) {
pr_info("duplicate VNI %u\n", vni);
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 3e98d31..66000d0 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,13 +11,60 @@
#define VNI_HASH_BITS 10
#define VNI_HASH_SIZE (1<<VNI_HASH_BITS)
+/*
+ * VXLAN Group Based Policy Extension:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |1|-|-|-|1|-|-|-|R|D|R|R|A|R|R|R| Group Policy ID |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | VXLAN Network Identifier (VNI) | Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * D = Don't Learn bit. When set, this bit indicates that the egress
+ * VTEP MUST NOT learn the source address of the encapsulated frame.
+ *
+ * A = Indicates that the group policy has already been applied to
+ * this packet. Policies MUST NOT be applied by devices when the
+ * A bit is set.
+ *
+ * [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
+ */
+struct vxlan_gbp {
+#ifdef __LITTLE_ENDIAN_BITFIELD
+ __u8 reserved_flags1:3,
+ policy_applied:1,
+ reserved_flags2:2,
+ dont_learn:1,
+ reserved_flags3:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 reserved_flags1:1,
+ dont_learn:1,
+ reserved_flags2:2,
+ policy_applied:1,
+ reserved_flags3:3;
+#else
+#error "Please fix <asm/byteorder.h>"
+#endif
+ __be16 policy_id;
+} __packed;
+
+/* skb->mark mapping
+ *
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|R|R|R|R|R|D|R|R|A|R|R|R| Group Policy ID |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ */
+#define VXLAN_GBP_DONT_LEARN (BIT(6) << 16)
+#define VXLAN_GBP_POLICY_APPLIED (BIT(3) << 16)
+#define VXLAN_GBP_ID_MASK (0xFFFF)
+
/* VXLAN protocol header:
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- * |R|R|R|R|I|R|R|R| Reserved |
+ * |G|R|R|R|I|R|R|R| Reserved |
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
* | VXLAN Network Identifier (VNI) | Reserved |
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
*
+ * G = 1 Group Policy (VXLAN-GBP)
* I = 1 VXLAN Network Identifier (VNI) present
*/
struct vxlanhdr {
@@ -26,24 +73,42 @@ struct vxlanhdr {
#ifdef __LITTLE_ENDIAN_BITFIELD
__u8 reserved_flags1:3,
vni_present:1,
- reserved_flags2:4;
+ reserved_flags2:3,
+ gbp_present:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
- __u8 reserved_flags2:4,
+ __u8 gbp_present:1,
+ reserved_flags2:3,
vni_present:1,
reserved_flags1:3;
#else
#error "Please fix <asm/byteorder.h>"
#endif
- __u8 vx_reserved1;
- __be16 vx_reserved2;
+ union {
+ /* NOTE: Offset 0 will be 1 byte aligned, so
+ * all member structs must be marked packed.
+ */
+ struct vxlan_gbp gbp;
+ struct {
+ __u8 vx_reserved1;
+ __be16 vx_reserved2;
+ } __packed;
+ };
};
__be32 vx_flags;
};
__be32 vx_vni;
};
+struct vxlan_metadata {
+ __be32 vni;
+ u32 gbp;
+};
+
struct vxlan_sock;
-typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb, __be32 key);
+typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb,
+ struct vxlan_metadata *md);
+
+#define VXLAN_EXT_GBP BIT(0)
/* per UDP socket information */
struct vxlan_sock {
@@ -78,7 +143,8 @@ void vxlan_sock_release(struct vxlan_sock *vs);
int vxlan_xmit_skb(struct vxlan_sock *vs,
struct rtable *rt, struct sk_buff *skb,
__be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
- __be16 src_port, __be16 dst_port, __be32 vni, bool xnet);
+ __be16 src_port, __be16 dst_port, struct vxlan_metadata *md,
+ bool xnet);
static inline netdev_features_t vxlan_features_check(struct sk_buff *skb,
netdev_features_t features)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index f7d0d2d..9f07bf5 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -370,10 +370,18 @@ enum {
IFLA_VXLAN_UDP_CSUM,
IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+ IFLA_VXLAN_EXTENSION,
__IFLA_VXLAN_MAX
};
#define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
+enum {
+ IFLA_VXLAN_EXT_UNSPEC,
+ IFLA_VXLAN_EXT_GBP,
+ __IFLA_VXLAN_EXT_MAX,
+};
+#define IFLA_VXLAN_EXT_MAX (__IFLA_VXLAN_EXT_MAX - 1)
+
struct ifla_vxlan_port_range {
__be16 low;
__be16 high;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d7c46b3..dd68c97 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -59,7 +59,8 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
}
/* Called with rcu_read_lock and BH disabled. */
-static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md)
{
struct ovs_tunnel_info tun_info;
struct vport *vport = vs->data;
@@ -68,7 +69,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
/* Save outer tunnel values */
iph = ip_hdr(skb);
- key = cpu_to_be64(ntohl(vx_vni) >> 8);
+ key = cpu_to_be64(ntohl(md->vni) >> 8);
ovs_flow_tun_info_init(&tun_info, iph,
udp_hdr(skb)->source, udp_hdr(skb)->dest,
key, TUNNEL_KEY, NULL, 0);
@@ -146,6 +147,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
struct vxlan_port *vxlan_port = vxlan_vport(vport);
__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
struct ovs_key_ipv4_tunnel *tun_key;
+ struct vxlan_metadata md;
struct rtable *rt;
struct flowi4 fl;
__be16 src_port;
@@ -178,12 +180,13 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
skb->ignore_df = 1;
src_port = udp_flow_src_port(net, skb, 0, 0, true);
+ md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
fl.saddr, tun_key->ipv4_dst,
tun_key->ipv4_tos, tun_key->ipv4_ttl, df,
src_port, dst_port,
- htonl(be64_to_cpu(tun_key->tun_id) << 8),
+ &md,
false);
if (err < 0)
ip_rt_put(rt);
--
1.9.3
^ permalink raw reply related
* [PATCH 1/6] vxlan: Allow for VXLAN extensions to be implemented
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
In-Reply-To: <cover.1420594925.git.tgraf@suug.ch>
The VXLAN receive code is currently conservative in what it accepts and
will reject any frame that uses any of the reserved VXLAN protocol fields.
The VXLAN draft specifies that "reserved fields MUST be set to zero on
transmit and ignored on receive.".
Retain the current conservative parsing behaviour by default but allows
these fields to be used by VXLAN extensions which are explicitly enabled
on the VXLAN socket respectively VXLAN net_device.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
drivers/net/vxlan.c | 29 +++++++++++++++++++----------
include/net/vxlan.h | 32 +++++++++++++++++++++++++++++---
2 files changed, 48 insertions(+), 13 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2ab0922..4d52aa9 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -65,7 +65,7 @@
#define VXLAN_VID_MASK (VXLAN_N_VID - 1)
#define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
-#define VXLAN_FLAGS 0x08000000 /* struct vxlanhdr.vx_flags required value. */
+#define VXLAN_FLAGS 0x08000000 /* struct vxlanhdr.vx_flags default value. */
/* UDP port for VXLAN traffic.
* The IANA assigned port is 4789, but the Linux default is 8472
@@ -1100,22 +1100,28 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (!pskb_may_pull(skb, VXLAN_HLEN))
goto error;
+ vs = rcu_dereference_sk_user_data(sk);
+ if (!vs)
+ goto drop;
+
/* Return packets with reserved bits set */
vxh = (struct vxlanhdr *)(udp_hdr(skb) + 1);
- if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
- (vxh->vx_vni & htonl(0xff))) {
- netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
- ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
- goto error;
+
+ /* For backwards compatibility, only allow reserved fields to be
+ * used by VXLAN extensions if explicitly requested.
+ */
+ if (vs->exts) {
+ if (!vxh->vni_present)
+ goto error_invalid_header;
+ } else {
+ if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
+ (vxh->vx_vni & htonl(0xff)))
+ goto error_invalid_header;
}
if (iptunnel_pull_header(skb, VXLAN_HLEN, htons(ETH_P_TEB)))
goto drop;
- vs = rcu_dereference_sk_user_data(sk);
- if (!vs)
- goto drop;
-
vs->rcv(vs, skb, vxh->vx_vni);
return 0;
@@ -1124,6 +1130,9 @@ drop:
kfree_skb(skb);
return 0;
+error_invalid_header:
+ netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
+ ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
error:
/* Return non vxlan pkt */
return 1;
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 903461a..3e98d31 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,10 +11,35 @@
#define VNI_HASH_BITS 10
#define VNI_HASH_SIZE (1<<VNI_HASH_BITS)
-/* VXLAN protocol header */
+/* VXLAN protocol header:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|I|R|R|R| Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | VXLAN Network Identifier (VNI) | Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * I = 1 VXLAN Network Identifier (VNI) present
+ */
struct vxlanhdr {
- __be32 vx_flags;
- __be32 vx_vni;
+ union {
+ struct {
+#ifdef __LITTLE_ENDIAN_BITFIELD
+ __u8 reserved_flags1:3,
+ vni_present:1,
+ reserved_flags2:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 reserved_flags2:4,
+ vni_present:1,
+ reserved_flags1:3;
+#else
+#error "Please fix <asm/byteorder.h>"
+#endif
+ __u8 vx_reserved1;
+ __be16 vx_reserved2;
+ };
+ __be32 vx_flags;
+ };
+ __be32 vx_vni;
};
struct vxlan_sock;
@@ -25,6 +50,7 @@ struct vxlan_sock {
struct hlist_node hlist;
vxlan_rcv_t *rcv;
void *data;
+ u32 exts;
struct work_struct del_work;
struct socket *sock;
struct rcu_head rcu;
--
1.9.3
^ permalink raw reply related
* [PATCH 0/6 net-next] VXLAN Group Policy Extension
From: Thomas Graf @ 2015-01-07 2:05 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar; +Cc: netdev, dev
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The extension is disabled by default and should be run on a distinct
port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
which ignore unknown reserved bits will be able to receive VXLAN-GBP
frames.
Simple usage example:
10.1.1.1:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
10.1.1.2:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
# iptables -I INPUT -m mark --mark 0x200 -j DROP
iproute2 [1] and OVS [2] support will be provided in separate patches.
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] https://github.com/tgraf/iproute2/tree/vxlan-gbp
[2] https://github.com/tgraf/ovs/tree/vxlan-gbp
Thomas Graf (6):
vxlan: Allow for VXLAN extensions to be implemented
vxlan: Group Policy extension
vxlan: Only bind to sockets with correct extensions enabled
vxlan: Fail build if VXLAN header is misdefined
openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
openvswitch: Support VXLAN Group Policy extension
drivers/net/vxlan.c | 221 +++++++++++++++++++++++++++------------
include/net/vxlan.h | 104 ++++++++++++++++--
include/uapi/linux/if_link.h | 8 ++
include/uapi/linux/openvswitch.h | 19 ++++
net/openvswitch/flow.c | 2 +-
net/openvswitch/flow.h | 14 +--
net/openvswitch/flow_netlink.c | 111 ++++++++++++--------
net/openvswitch/vport-vxlan.c | 89 +++++++++++++++-
8 files changed, 435 insertions(+), 133 deletions(-)
--
1.9.3
^ permalink raw reply
* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-07 1:56 UTC (permalink / raw)
To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20150105.195128.794605376092864881.davem@davemloft.net>
On Mon, Jan 5, 2015 at 4:51 PM, David Miller <davem@davemloft.net> wrote:
> The sysctl is kept so that scripts reading it don't suddenly stop
> working. We can't just remove sysctl values.
Interestingly, one of our scripts did break. It broke because now this
sysctl is only available in the global net namespace and not in the
child namespaces. If not breaking scripts is the fundamental logic in
keeping in sysctl intact, would you guys be open to accepting a patch
where we make this sysctl available for all net namespaces?
^ permalink raw reply
* RE: [PATCH net-next v1 2/3] ARM: imx: add FEC sleep mode callback function
From: fugang.duan @ 2015-01-07 1:41 UTC (permalink / raw)
To: Shawn Guo
Cc: davem@davemloft.net, netdev@vger.kernel.org,
bhutchings@solarflare.com, stephen@networkplumber.org
In-Reply-To: <20150106114807.GL24511@dragon>
From: Shawn Guo <shawn.guo@linaro.org> Sent: Tuesday, January 06, 2015 7:48 PM
> To: Duan Fugang-B38611
> Cc: davem@davemloft.net; netdev@vger.kernel.org;
> bhutchings@solarflare.com; stephen@networkplumber.org
> Subject: Re: [PATCH net-next v1 2/3] ARM: imx: add FEC sleep mode
> callback function
>
> On Wed, Dec 24, 2014 at 05:30:40PM +0800, Fugang Duan wrote:
> > i.MX6q/dl, i.MX6SX SOCs enet support sleep mode that magic packet can
> > wake up system in suspend status. For different SOCs, there have some
> > SOC specifical GPR register to set sleep on/off mode. So add these to
> > callback function for driver.
> >
> > Signed-off-by: Fugang Duan <B38611@freescale.com>
>
> I do not like this patch. In the end, this is just a GRP register bit
> setup per FEC driver need. Rather than messing up platform code for each
> SoC with the same pattern, I do not see why this can not be done by FEC
> driver itself.
>
> You can take a look at LDB driver (drivers/gpu/drm/imx/imx-ldb.c) to see
> how this can be done.
>
> Shawn
>
Hi, Shawn,
It is SOC related setting, not fec IP itself setting, and different SOC GPR setting is not different.
So I think it better to put it to platform code.
Regards,
Andy
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2015-01-07 1:35 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
Just a pile of random fixes, including:
1) Do not apply TSO limits to non-TSO packets, fix from Herbert
Xu.
2) MDI{,X} eeprom check in e100 driver is reversed, from
John W. Linville.
3) Missing error return assignments in several ethernet drivers,
from Julia Lawall.
4) Altera TSE device doesn't come back up after ifconfig down/up
sequence, fix from Kostya Belezko.
5) Add more cases to the check for whether the qmi_wwan device has
a bogus MAC address and needs to be assigned a random one. From
Kristian Evensen.
6) Fix interrupt hangs in CPSW, from Felipe Balbi.
7) Implement ndo_features_check in r8152 so that the stack doesn't
feed GSO packets which are outside of the chip's capabilities.
From Hayes Wang.
Please pull, thanks a lot!
The following changes since commit 2c90331cf5ed1d648a711b9483e173aaaf2c4a9b:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2014-12-30 10:45:47 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
for you to fetch changes up to 2abad79afa700e837cb4feed170141292e0720c0:
qla3xxx: don't allow never end busy loop (2015-01-06 17:41:36 -0500)
----------------------------------------------------------------
Andy Shevchenko (1):
qla3xxx: don't allow never end busy loop
Ben Pfaff (1):
openvswitch: Consistently include VLAN header in flow and port stats.
David S. Miller (4):
Merge branch 'master' of git://git.kernel.org/.../jkirsher/net
Merge branch 'mlx4-net'
Merge tag 'mac80211-for-davem-2015-01-06' of git://git.kernel.org/.../jberg/mac80211
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Felipe Balbi (1):
net: ethernet: cpsw: fix hangs with interrupts
Govindarajulu Varadarajan (1):
enic: free all rq buffs when allocation fails
Herbert Xu (1):
tcp: Do not apply TSO segment limit to non-TSO packets
Jack Morgenstein (1):
net/mlx4_core: Fix error flow in mlx4_init_hca()
Joe Perches (1):
i40e: Fix possible memory leak in i40e_dbg_dump_desc
Johannes Berg (1):
Revert "mac80211: Fix accounting of the tailroom-needed counter"
John W. Linville (1):
e100: fix typo in MDI/MDI-X eeprom check in e100_phy_init
Julia Lawall (4):
net: Xilinx: fix error return code
myri10ge: fix error return code
net: sun4i-emac: fix error return code
net: axienet: fix error return code
Kostya Belezko (1):
Altera TSE: Add missing phydev
Kristian Evensen (1):
qmi_wwan: Set random MAC on devices with buggy fw
Linus Lüssing (4):
batman-adv: fix delayed foreign originator recognition
batman-adv: fix counter for multicast supporting nodes
batman-adv: fix multicast counter when purging originators
batman-adv: fix potential TT client + orig-node memory leak
Maor Gottlieb (1):
net/mlx4_core: Correcly update the mtt's offset in the MR re-reg flow
Martin Hundebøll (1):
batman-adv: fix lock class for decoding hash in network-coding.c
Palik, Imre (1):
xen-netback: fixing the propagation of the transmit shaper timeout
Simon Wunderlich (1):
batman-adv: fix and simplify condition when bonding should be used
Todd Fujinaka (1):
igb: Remove unneeded FIXME
Yongjian Xu (1):
qlcnic: Fix return value in qlcnic_probe()
hayeswang (1):
r8152: support ndo_features_check
drivers/net/ethernet/allwinner/sun4i-emac.c | 4 +++-
drivers/net/ethernet/altera/altera_tse_main.c | 15 ++++++---------
drivers/net/ethernet/cisco/enic/enic_main.c | 6 ++++--
drivers/net/ethernet/intel/e100.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 4 +++-
drivers/net/ethernet/intel/igb/e1000_82575.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/main.c | 13 ++++---------
drivers/net/ethernet/mellanox/mlx4/mr.c | 9 +++++----
drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 4 +++-
drivers/net/ethernet/qlogic/qla3xxx.c | 8 +++-----
drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 1 +
drivers/net/ethernet/ti/cpsw.c | 19 ++++++++-----------
drivers/net/ethernet/xilinx/ll_temac_main.c | 2 ++
drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 2 ++
drivers/net/ethernet/xilinx/xilinx_emaclite.c | 1 +
drivers/net/usb/qmi_wwan.c | 10 +++++++---
drivers/net/usb/r8152.c | 17 +++++++++++++++++
drivers/net/xen-netback/xenbus.c | 1 +
include/net/mac80211.h | 7 ++-----
net/batman-adv/multicast.c | 11 +++++++----
net/batman-adv/network-coding.c | 2 +-
net/batman-adv/originator.c | 7 ++++---
net/batman-adv/routing.c | 6 ++++--
net/ipv4/tcp_output.c | 4 ++--
net/mac80211/key.c | 12 +++++++++---
net/openvswitch/flow.c | 5 +++--
net/openvswitch/vport.c | 2 +-
27 files changed, 105 insertions(+), 71 deletions(-)
^ permalink raw reply
* [net v2 3/3] i40e: Fix bug with TCP over IPv6 over VXLAN
From: Jeff Kirsher @ 2015-01-07 1:31 UTC (permalink / raw)
To: davem
Cc: Anjali Singhai, netdev, nhorman, sassmann, jogreene, Greg Rose,
Jeff Kirsher
In-Reply-To: <1420594317-6191-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Anjali Singhai <anjali.singhai@intel.com>
The driver was examining the outer protocol layer to set the inner protocol
layer checksum offload. In the case of TCP over IPV6 over an IPv4 based
VXLAN the inner checksum offloads would be set to look for IPv4/UDP instead
of IPv6/TCP. This code fixes that so that the driver will look at the
proper layer for encapsulation offload settings.
Signed-off-by: Anjali Singhai <anjali.singhai@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 24 +++++++++++-------------
1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 38c7638..cecb340 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1883,17 +1883,16 @@ static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
if (err < 0)
return err;
- if (protocol == htons(ETH_P_IP)) {
- iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
+ iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
+ ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
+
+ if (iph->version == 4) {
tcph = skb->encapsulation ? inner_tcp_hdr(skb) : tcp_hdr(skb);
iph->tot_len = 0;
iph->check = 0;
tcph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
0, IPPROTO_TCP, 0);
- } else if (skb_is_gso_v6(skb)) {
-
- ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb)
- : ipv6_hdr(skb);
+ } else if (ipv6h->version == 6) {
tcph = skb->encapsulation ? inner_tcp_hdr(skb) : tcp_hdr(skb);
ipv6h->payload_len = 0;
tcph->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
@@ -1989,13 +1988,9 @@ static void i40e_tx_enable_csum(struct sk_buff *skb, u32 tx_flags,
I40E_TX_CTX_EXT_IP_IPV4_NO_CSUM;
}
} else if (tx_flags & I40E_TX_FLAGS_IPV6) {
- if (tx_flags & I40E_TX_FLAGS_TSO) {
- *cd_tunneling |= I40E_TX_CTX_EXT_IP_IPV6;
+ *cd_tunneling |= I40E_TX_CTX_EXT_IP_IPV6;
+ if (tx_flags & I40E_TX_FLAGS_TSO)
ip_hdr(skb)->check = 0;
- } else {
- *cd_tunneling |=
- I40E_TX_CTX_EXT_IP_IPV4_NO_CSUM;
- }
}
/* Now set the ctx descriptor fields */
@@ -2005,7 +2000,10 @@ static void i40e_tx_enable_csum(struct sk_buff *skb, u32 tx_flags,
((skb_inner_network_offset(skb) -
skb_transport_offset(skb)) >> 1) <<
I40E_TXD_CTX_QW0_NATLEN_SHIFT;
-
+ if (this_ip_hdr->version == 6) {
+ tx_flags &= ~I40E_TX_FLAGS_IPV4;
+ tx_flags |= I40E_TX_FLAGS_IPV6;
+ }
} else {
network_hdr_len = skb_network_header_len(skb);
this_ip_hdr = ip_hdr(skb);
--
1.9.3
^ permalink raw reply related
* [net v2 2/3] i40e: Fix Rx checksum error counter
From: Jeff Kirsher @ 2015-01-07 1:31 UTC (permalink / raw)
To: davem
Cc: Anjali Singhai, netdev, nhorman, sassmann, jogreene, Greg Rose,
Jeff Kirsher
In-Reply-To: <1420594317-6191-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Anjali Singhai <anjali.singhai@intel.com>
The Rx port checksum error counter was incrementing incorrectly with
UDP encapsulated tunneled traffic. This patch fixes the problem so that
the port_rx_csum counter will show accurate statistics.
Signed-off-by: Anjali Singhai <anjali.singhai@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f145aaf..38c7638 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1325,9 +1325,7 @@ static inline void i40e_rx_checksum(struct i40e_vsi *vsi,
* so the total length of IPv4 header is IHL*4 bytes
* The UDP_0 bit *may* bet set if the *inner* header is UDP
*/
- if (ipv4_tunnel &&
- (decoded.inner_prot != I40E_RX_PTYPE_INNER_PROT_UDP) &&
- !(rx_status & (1 << I40E_RX_DESC_STATUS_UDP_0_SHIFT))) {
+ if (ipv4_tunnel) {
skb->transport_header = skb->mac_header +
sizeof(struct ethhdr) +
(ip_hdr(skb)->ihl * 4);
@@ -1337,15 +1335,19 @@ static inline void i40e_rx_checksum(struct i40e_vsi *vsi,
skb->protocol == htons(ETH_P_8021AD))
? VLAN_HLEN : 0;
- rx_udp_csum = udp_csum(skb);
- iph = ip_hdr(skb);
- csum = csum_tcpudp_magic(
- iph->saddr, iph->daddr,
- (skb->len - skb_transport_offset(skb)),
- IPPROTO_UDP, rx_udp_csum);
+ if ((ip_hdr(skb)->protocol == IPPROTO_UDP) &&
+ (udp_hdr(skb)->check != 0)) {
+ rx_udp_csum = udp_csum(skb);
+ iph = ip_hdr(skb);
+ csum = csum_tcpudp_magic(
+ iph->saddr, iph->daddr,
+ (skb->len - skb_transport_offset(skb)),
+ IPPROTO_UDP, rx_udp_csum);
- if (udp_hdr(skb)->check != csum)
- goto checksum_fail;
+ if (udp_hdr(skb)->check != csum)
+ goto checksum_fail;
+
+ } /* else its GRE and so no outer UDP header */
}
skb->ip_summed = CHECKSUM_UNNECESSARY;
--
1.9.3
^ permalink raw reply related
* [net v2 1/3] i40e: fix un-necessary Tx hangs
From: Jeff Kirsher @ 2015-01-07 1:31 UTC (permalink / raw)
To: davem; +Cc: Jesse Brandeburg, netdev, nhorman, sassmann, jogreene,
Jeff Kirsher
In-Reply-To: <1420594317-6191-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Jesse Brandeburg <jesse.brandeburg@intel.com>
When the driver was polling with interrupts disabled the hardware
will occasionally not write back descriptors. This patch causes
the driver to detect this situation and force an interrupt to
fire which will flush the stuck descriptor. Does not conflict
with napi because if we are already polling the napi_schedule is
ignored. Additionally the extra interrupts are rate limited, so
don't cause a burden to the CPU.
Change-ID: Iba4616d2a71288672a5f08e4512e2704b97335e8
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
v2: fixed a bug where the interrupt rate impacted 4 port workloads by
reducing throughput.
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 56 ++++++++++++++++++++++++-----
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 1 +
2 files changed, 49 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 04b4414..f145aaf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -658,6 +658,8 @@ static inline u32 i40e_get_head(struct i40e_ring *tx_ring)
return le32_to_cpu(*(volatile __le32 *)head);
}
+#define WB_STRIDE 0x3
+
/**
* i40e_clean_tx_irq - Reclaim resources after transmit completes
* @tx_ring: tx ring to clean
@@ -759,6 +761,18 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, int budget)
tx_ring->q_vector->tx.total_bytes += total_bytes;
tx_ring->q_vector->tx.total_packets += total_packets;
+ /* check to see if there are any non-cache aligned descriptors
+ * waiting to be written back, and kick the hardware to force
+ * them to be written back in case of napi polling
+ */
+ if (budget &&
+ !((i & WB_STRIDE) == WB_STRIDE) &&
+ !test_bit(__I40E_DOWN, &tx_ring->vsi->state) &&
+ (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+ tx_ring->arm_wb = true;
+ else
+ tx_ring->arm_wb = false;
+
if (check_for_tx_hang(tx_ring) && i40e_check_tx_hang(tx_ring)) {
/* schedule immediate reset if we believe we hung */
dev_info(tx_ring->dev, "Detected Tx Unit Hang\n"
@@ -777,13 +791,16 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, int budget)
netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
dev_info(tx_ring->dev,
- "tx hang detected on queue %d, resetting adapter\n",
+ "tx hang detected on queue %d, reset requested\n",
tx_ring->queue_index);
- tx_ring->netdev->netdev_ops->ndo_tx_timeout(tx_ring->netdev);
+ /* do not fire the reset immediately, wait for the stack to
+ * decide we are truly stuck, also prevents every queue from
+ * simultaneously requesting a reset
+ */
- /* the adapter is about to reset, no point in enabling stuff */
- return true;
+ /* the adapter is about to reset, no point in enabling polling */
+ budget = 1;
}
netdev_tx_completed_queue(netdev_get_tx_queue(tx_ring->netdev,
@@ -806,7 +823,25 @@ static bool i40e_clean_tx_irq(struct i40e_ring *tx_ring, int budget)
}
}
- return budget > 0;
+ return !!budget;
+}
+
+/**
+ * i40e_force_wb - Arm hardware to do a wb on noncache aligned descriptors
+ * @vsi: the VSI we care about
+ * @q_vector: the vector on which to force writeback
+ *
+ **/
+static void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector)
+{
+ u32 val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
+ I40E_PFINT_DYN_CTLN_SWINT_TRIG_MASK |
+ I40E_PFINT_DYN_CTLN_SW_ITR_INDX_ENA_MASK
+ /* allow 00 to be written to the index */;
+
+ wr32(&vsi->back->hw,
+ I40E_PFINT_DYN_CTLN(q_vector->v_idx + vsi->base_vector - 1),
+ val);
}
/**
@@ -1581,6 +1616,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
struct i40e_vsi *vsi = q_vector->vsi;
struct i40e_ring *ring;
bool clean_complete = true;
+ bool arm_wb = false;
int budget_per_ring;
if (test_bit(__I40E_DOWN, &vsi->state)) {
@@ -1591,8 +1627,10 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
/* Since the actual Tx work is minimal, we can give the Tx a larger
* budget and be more aggressive about cleaning up the Tx descriptors.
*/
- i40e_for_each_ring(ring, q_vector->tx)
+ i40e_for_each_ring(ring, q_vector->tx) {
clean_complete &= i40e_clean_tx_irq(ring, vsi->work_limit);
+ arm_wb |= ring->arm_wb;
+ }
/* We attempt to distribute budget to each Rx queue fairly, but don't
* allow the budget to go below 1 because that would exit polling early.
@@ -1603,8 +1641,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
clean_complete &= i40e_clean_rx_irq(ring, budget_per_ring);
/* If work not completed, return budget and polling will return */
- if (!clean_complete)
+ if (!clean_complete) {
+ if (arm_wb)
+ i40e_force_wb(vsi, q_vector);
return budget;
+ }
/* Work is done so exit the polling mode and re-enable the interrupt */
napi_complete(napi);
@@ -2198,7 +2239,6 @@ static void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
/* Place RS bit on last descriptor of any packet that spans across the
* 4th descriptor (WB_STRIDE aka 0x3) in a 64B cacheline.
*/
-#define WB_STRIDE 0x3
if (((i & WB_STRIDE) != WB_STRIDE) &&
(first <= &tx_ring->tx_bi[i]) &&
(first >= &tx_ring->tx_bi[i & ~WB_STRIDE])) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index e60d3ac..18b0023 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -241,6 +241,7 @@ struct i40e_ring {
unsigned long last_rx_timestamp;
bool ring_active; /* is ring online or not */
+ bool arm_wb; /* do something to arm write back */
/* stats structs */
struct i40e_queue_stats stats;
--
1.9.3
^ permalink raw reply related
* [net v2 0/3][pull request] Intel Wired LAN Driver Updates 2015-01-06
From: Jeff Kirsher @ 2015-01-07 1:31 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene
This series contains fixes to i40e only.
Jesse provides a fix for when the driver was polling with interrupts
disabled the hardware would occasionally not write back descriptors.
His fix causes the driver to detect this situation and force an interrupt
to fire which will flush the stuck descriptor.
Anjali provides a couple of fixes, the first corrects an issue where
the receive port checksum error counter was incrementing incorrectly with
UDP encapsulated tunneled traffic. The second fix resolves an issue where
the driver was examining the outer protocol layer to set the inner protocol
layer checksum offload. In the case of TCP over IPv6 over an IPv4 based
VXLAN, the inner checksum offloads would be set to look for IPv4/UDP
instead of IPv6/TCP, so fixed the issue so that the driver will look at
the proper layer for encapsulation offload settings.
v2: fixed a bug in patch 01 of the series, where the interrupt rate impacted
4 port workloads by reducing throughput.
The following are changes since commit 2abad79afa700e837cb4feed170141292e0720c0:
qla3xxx: don't allow never end busy loop
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net master
Anjali Singhai (2):
i40e: Fix Rx checksum error counter
i40e: Fix bug with TCP over IPv6 over VXLAN
Jesse Brandeburg (1):
i40e: fix un-necessary Tx hangs
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 104 +++++++++++++++++++---------
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 1 +
2 files changed, 73 insertions(+), 32 deletions(-)
--
1.9.3
^ permalink raw reply
* RE: TCP connection issues against Amazon S3
From: Lukas Tribus @ 2015-01-07 1:23 UTC (permalink / raw)
To: Erik Grinaker, Eric Dumazet
Cc: Yuchung Cheng, linux-kernel@vger.kernel.org, netdev
In-Reply-To: <3F608393-E5F1-4647-81BF-C6C740100934@bengler.no>
> This still doesn’t explain why it works with older kernels, but not newer ones.
Can you try the different 3.12-rc kernels? The information that this was
introduced in 3.12-rc1 as opposed to a specific -rc>1 releases may help
the guys here to pinpoint what exactly caused the behavior change on the
receiver side.
v3.12-rc1 to -rc7 is available as prebuild package on the ubuntu mainline kernel
archive [1] aswell.
-Lukas
[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply
* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: Alexei Starovoitov @ 2015-01-07 1:14 UTC (permalink / raw)
To: John Fastabend
Cc: Thomas Graf, Scott Feldman, Jiří Pírko,
Jamal Hadi Salim, simon.horman, netdev@vger.kernel.org,
David S. Miller, Andy Gospodarek
On Wed, Dec 31, 2014 at 11:45 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> + * [NET_FLOW_TABLE_IDENTIFIER_TYPE]
> + * [NET_FLOW_TABLE_IDENTIFIER]
> + * [NET_FLOW_TABLE_TABLES]
> + * [NET_FLOW_TABLE]
> + * [NET_FLOW_TABLE_ATTR_NAME]
> + * [NET_FLOW_TABLE_ATTR_UID]
> + * [NET_FLOW_TABLE_ATTR_SOURCE]
> + * [NET_FLOW_TABLE_ATTR_SIZE]
...
> + * Header definitions used to define headers with user friendly
> + * names.
> + *
> + * [NET_FLOW_TABLE_HEADERS]
> + * [NET_FLOW_HEADER]
> + * [NET_FLOW_HEADER_ATTR_NAME]
> + * [NET_FLOW_HEADER_ATTR_UID]
> + * [NET_FLOW_HEADER_ATTR_FIELDS]
> + * [NET_FLOW_HEADER_ATTR_FIELD]
> + * [NET_FLOW_FIELD_ATTR_NAME]
> + * [NET_FLOW_FIELD_ATTR_UID]
> + * [NET_FLOW_FIELD_ATTR_BITWIDTH]
> + * [NET_FLOW_HEADER_ATTR_FIELD]
> + * [...]
> + * [...]
> + * Action definitions supported by tables
> + *
> + * [NET_FLOW_TABLE_ACTIONS]
> + * [NET_FLOW_TABLE_ATTR_ACTIONS]
> + * [NET_FLOW_ACTION]
> + * [NET_FLOW_ACTION_ATTR_NAME]
> + * [NET_FLOW_ACTION_ATTR_UID]
> + * [NET_FLOW_ACTION_ATTR_SIGNATURE]
> + * [NET_FLOW_ACTION_ARG]
..
> + * Get Table Graph <Reply> description
> + *
> + * [NET_FLOW_TABLE_TABLE_GRAPH]
> + * [TABLE_GRAPH_NODE]
> + * [TABLE_GRAPH_NODE_UID]
> + * [TABLE_GRAPH_NODE_JUMP]
I think NET_FLOW prefix everywhere is too verbose.
Especially since you've missed it in the above 3.
and in patch 2 it is:
NET_FLOW_FLOW
which is kinda awkward.
Can you abbreviate it to NFL_ or something else ?
I couldn't find get_headers() and get_header_graph()
implementation on rocker side ?
Could you describe how put_header_graph() will look like?
When it comes to parsing I'm assuming that hw will fall
into N categories:
- that has get_headers() and get_header_graph() only
which would mean fixed parser
- above plus put_header_graph() which will allow to
rearrange some fixed sized headers ?
- above plus put_header() ?
I'm having a hard time envisioning how that would
look like.
- ... ?
also can we change a name from add_flow
to add_entry or add_rule ?
I think 'rule' fits better, since rule = field_ref+action
and one real TCP flow may need multiple rules
inserted into table, right?
The whole thing can still be called 'flow API'...
will there be a put_table_graph() ?
probably not, right? since as soon as HW supports
'goto' aciton, the meaning of table_graph is lost and
it's actually just a set of disconnected tables and the
way to jump from one into another is through 'goto'.
I think OVS guys are quiet, since they're skeptical
that headers+header_graph approach can work?
Would be great if they can share the experience...
^ permalink raw reply
* Re: [PATCH iproute2 -next] ip: route: add congestion control metric
From: Stephen Hemminger @ 2015-01-07 1:09 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: fw, netdev
In-Reply-To: <1420588357-17665-1-git-send-email-dborkman@redhat.com>
On Wed, 7 Jan 2015 00:52:37 +0100
Daniel Borkmann <dborkman@redhat.com> wrote:
> + } else if (matches(*argv, "congctl") == 0) {
> + char cc[16];
> + NEXT_ARG();
> + memset(cc, 0, sizeof(cc));
> + if (strcmp(*argv, "lock") == 0) {
> + mxlock |= (1<<RTAX_CC_ALGO);
Unneeded paren
> + NEXT_ARG();
> + }
> + strncpy(cc, *argv, sizeof(cc) - 1);
> + if (strlen(cc) == 0)
> + invarg("\"conctl\" value must be an algorithm name\n", *argv
Silently truncating the string is not odd. Can't we just let kernel impose
length restrictions.
^ permalink raw reply
* Re: Does the ordering of the fib_table_dump or /proc/net/fib_trie matter?
From: Alexander Duyck @ 2015-01-07 0:58 UTC (permalink / raw)
To: David Miller; +Cc: stephen, netdev
In-Reply-To: <20150106.165822.1294578064447416624.davem@davemloft.net>
On 01/06/2015 01:58 PM, David Miller wrote:
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Tue, 06 Jan 2015 12:30:06 -0800
>
>> The question I have is if that would screw up any user-space apps. I
>> know ip route can dump the list via "ip route show". I'm just wondering
>> if there would be any problem with default being the last entry instead
>> of the first entry?
> The ordering already changed once when we went from fib_hash to
> fib_trie, nobody should depend upon the ordering.
Okay good to hear. I kind of thought that was the case, but I wanted to
make sure before I went too far down this rabbit hole.
Thanks.
- Alex
^ permalink raw reply
* Re: [PATCH v8 34/50] vhost/net: virtio 1.0 byte swap
From: Alex Williamson @ 2015-01-06 23:55 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, David Miller, cornelia.huck, rusty, nab, pbonzini,
thuth, dahi, kvm, virtualization, netdev
In-Reply-To: <1417449619-24896-35-git-send-email-mst@redhat.com>
On Mon, 2014-12-01 at 18:05 +0200, Michael S. Tsirkin wrote:
> I had to add an explicit tag to suppress compiler warning:
> gcc isn't smart enough to notice that
> len is always initialized since function is called with size > 0.
I'm getting a panic inside a guest when this change is applied on the
host. I identified this patch via bisect and confirmed by reverting it
from v3.19-rc2. Guest is centos6. Thanks,
Alex
commit 8b38694a2dc8b18374310df50174f1e4376d6824
Author: Michael S. Tsirkin <mst@redhat.com>
Date: Fri Oct 24 14:19:48 2014 +0300
vhost/net: virtio 1.0 byte swap
I had to add an explicit tag to suppress compiler warning:
gcc isn't smart enough to notice that
len is always initialized since function is called with size > 0.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
XML chunk:
<interface type='direct'>
<mac address='52:54:00:64:f3:34'/>
<source dev='iscsinet0' mode='bridge'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
Panic log:
<1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
<1>IP: [<ffffffffa0079469>] virtnet_poll+0x4f9/0x910 [virtio_net]
<4>PGD 1aa2f4067 PUD 1aa2f5067 PMD 0
<4>Oops: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/virtio0/net/eth9/ifindex
<4>CPU 0
<4>Modules linked in: 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput microcode snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc igbvf nvidia(P)(U) i2c_core tg3 ptp pps_core virtio_balloon virtio_net virtio_console ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>
<4>Pid: 1374, comm: NetworkManager Tainted: P --------------- 2.6.32-431.23.3.el6.centos.plus.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996)
<4>RIP: 0010:[<ffffffffa0079469>] [<ffffffffa0079469>] virtnet_poll+0x4f9/0x910 [virtio_net]
<4>RSP: 0018:ffff880028203e48 EFLAGS: 00010246
<4>RAX: ffff8801a3383d00 RBX: ffff8801a6aaf480 RCX: ffff8801aa20b6e0
<4>RDX: 00000000000000c0 RSI: ffff8801a3383c00 RDI: ffff8801a3383cc0
<4>RBP: ffff880028203ed8 R08: 000000000000009e R09: ffff8801aa1d800c
<4>R10: 0000000000000218 R11: 0000000000000000 R12: ffff8801aa20b6e0
<4>R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
<4>FS: 00007febf114d800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 0000000000000010 CR3: 00000001aa793000 CR4: 00000000000006f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process NetworkManager (pid: 1374, threadinfo ffff8801a74ba000, task ffff8801a8d56040)
<4>Stack:
<4> ffff8801aa1d8000 000000000000009e ffff8801aa20b6e0 ffff8801aa20b718
<4><d> ffff8801aa20b780 ffff8801aa1d800c ffff8801a6aaf4b8 ffff8801aa20b020
<4><d> 0000000000000080 ffff8801aa20b708 0000000000000001 00001f5981a830c8
<4>Call Trace:
<4> <IRQ>
<4> [<ffffffff8146ae33>] net_rx_action+0x103/0x2f0
<4> [<ffffffff8107a5f1>] __do_softirq+0xc1/0x1e0
<4> [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
<4> [<ffffffff8100c30c>] call_softirq+0x1c/0x30
<4> <EOI>
<4> [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
<4> [<ffffffff8107b2ea>] local_bh_enable+0x9a/0xb0
<4> [<ffffffffa007813a>] virtnet_napi_enable+0x4a/0x60 [virtio_net]
<4> [<ffffffffa0078ebf>] virtnet_open+0x4f/0x60 [virtio_net]
<4> [<ffffffff81467691>] dev_open+0xa1/0x100
<4> [<ffffffff81466751>] dev_change_flags+0xa1/0x1d0
<4> [<ffffffff81474a59>] do_setlink+0x169/0x8b0
<4> [<ffffffff814770b6>] ? rtnl_fill_ifinfo+0x946/0xcb0
<4> [<ffffffff812a3d24>] ? nla_parse+0x34/0x110
<4> [<ffffffff8147659e>] rtnl_setlink+0xee/0x130
<4> [<ffffffff81475b67>] rtnetlink_rcv_msg+0x2d7/0x340
<4> [<ffffffff81231e14>] ? socket_has_perm+0x74/0x90
<4> [<ffffffff81475890>] ? rtnetlink_rcv_msg+0x0/0x340
<4> [<ffffffff814910a9>] netlink_rcv_skb+0xa9/0xd0
<4> [<ffffffff81475875>] rtnetlink_rcv+0x25/0x40
<4> [<ffffffff81490cdb>] netlink_unicast+0x2db/0x320
<4> [<ffffffff81491750>] netlink_sendmsg+0x2c0/0x3d0
<4> [<ffffffff814520c3>] sock_sendmsg+0x123/0x150
<4> [<ffffffff81453d73>] ? sock_recvmsg+0x133/0x160
<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff81136941>] ? lru_cache_add_lru+0x21/0x40
<4> [<ffffffff8115522d>] ? page_add_new_anon_rmap+0x9d/0xf0
<4> [<ffffffff8114aeef>] ? handle_pte_fault+0x4af/0xb00
<4> [<ffffffff81451f14>] ? move_addr_to_kernel+0x64/0x70
<4> [<ffffffff814538b6>] __sys_sendmsg+0x406/0x420
<4> [<ffffffff8104a98c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff814523d9>] ? sys_sendto+0x139/0x190
<4> [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
<4> [<ffffffff81453ad9>] sys_sendmsg+0x49/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 83 e0 00 00 00 00 10 00 00 48 03 93 d0 00 00 00 66 83 42 04 01 8b 93 cc 00 00 00 48 8b b3 d0 00 00 00 80 4c 16 10 20 44 2b 68 0c <4d> 8b 76 10 75 89 e9 d1 fd ff ff 0f 1f 40 00 a8 02 74 0d 0f b6
<1>RIP [<ffffffffa0079469>] virtnet_poll+0x4f9/0x910 [virtio_net]
<4> RSP <ffff880028203e48>
<4>CR2: 0000000000000010
^ permalink raw reply
* [PATCH iproute2 -next] ip: route: add congestion control metric
From: Daniel Borkmann @ 2015-01-06 23:52 UTC (permalink / raw)
To: stephen; +Cc: fw, netdev
This patch adds configuration and dumping of congestion control metric
for ip route, for example:
ip route add <dst> dev foo congctl [lock] dctcp
Reference: http://thread.gmane.org/gmane.linux.network/344733
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
include/linux/rtnetlink.h | 2 ++
ip/iproute.c | 24 +++++++++++++++++++++---
man/man8/ip-route.8.in | 19 ++++++++++++++++++-
3 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9aa5c2f..ac4af97 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
#define RTAX_INITRWND RTAX_INITRWND
RTAX_QUICKACK,
#define RTAX_QUICKACK RTAX_QUICKACK
+ RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
__RTAX_MAX
};
diff --git a/ip/iproute.c b/ip/iproute.c
index 5a496a9..705d4b5 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -53,6 +53,7 @@ static const char *mx_names[RTAX_MAX+1] = {
[RTAX_RTO_MIN] = "rto_min",
[RTAX_INITRWND] = "initrwnd",
[RTAX_QUICKACK] = "quickack",
+ [RTAX_CC_ALGO] = "congctl",
};
static void usage(void) __attribute__((noreturn));
@@ -80,8 +81,7 @@ static void usage(void)
fprintf(stderr, " [ window NUMBER] [ cwnd NUMBER ] [ initcwnd NUMBER ]\n");
fprintf(stderr, " [ ssthresh NUMBER ] [ realms REALM ] [ src ADDRESS ]\n");
fprintf(stderr, " [ rto_min TIME ] [ hoplimit NUMBER ] [ initrwnd NUMBER ]\n");
- fprintf(stderr, " [ features FEATURES ]\n");
- fprintf(stderr, " [ quickack BOOL ]\n");
+ fprintf(stderr, " [ features FEATURES ] [ quickack BOOL ] [ congctl NAME ]\n");
fprintf(stderr, "TYPE := [ unicast | local | broadcast | multicast | throw |\n");
fprintf(stderr, " unreachable | prohibit | blackhole | nat ]\n");
fprintf(stderr, "TABLE_ID := [ local | main | default | all | NUMBER ]\n");
@@ -545,10 +545,12 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
fprintf(fp, " %s", mx_names[i]);
else
fprintf(fp, " metric %d", i);
+
if (mxlock & (1<<i))
fprintf(fp, " lock");
+ if (i != RTAX_CC_ALGO)
+ val = *(unsigned*)RTA_DATA(mxrta[i]);
- val = *(unsigned*)RTA_DATA(mxrta[i]);
switch (i) {
case RTAX_FEATURES:
print_rtax_features(fp, val);
@@ -573,6 +575,10 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
fprintf(fp, " %gs", val/1e3);
else
fprintf(fp, " %ums", val);
+ break;
+ case RTAX_CC_ALGO:
+ fprintf(fp, " %s", (char *)RTA_DATA(mxrta[i]));
+ break;
}
}
}
@@ -925,6 +931,18 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
if (quickack != 1 && quickack != 0)
invarg("\"quickack\" value should be 0 or 1\n", *argv);
rta_addattr32(mxrta, sizeof(mxbuf), RTAX_QUICKACK, quickack);
+ } else if (matches(*argv, "congctl") == 0) {
+ char cc[16];
+ NEXT_ARG();
+ memset(cc, 0, sizeof(cc));
+ if (strcmp(*argv, "lock") == 0) {
+ mxlock |= (1<<RTAX_CC_ALGO);
+ NEXT_ARG();
+ }
+ strncpy(cc, *argv, sizeof(cc) - 1);
+ if (strlen(cc) == 0)
+ invarg("\"conctl\" value must be an algorithm name\n", *argv);
+ rta_addattr_l(mxrta, sizeof(mxbuf), RTAX_CC_ALGO, cc, strlen(cc));
} else if (matches(*argv, "rttvar") == 0) {
unsigned win;
NEXT_ARG();
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index 89960c1..9d32e2d 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -116,7 +116,9 @@ replace " } "
.B features
.IR FEATURES " ] [ "
.B quickack
-.IR BOOL " ]"
+.IR BOOL " ] [ "
+.B congctl
+.IR NAME " ]"
.ti -8
.IR TYPE " := [ "
@@ -433,6 +435,21 @@ sysctl is set to 0.
Enable or disable quick ack for connections to this destination.
.TP
+.BI congctl " NAME " "(3.20+ only)"
+.TP
+.BI "congctl lock" " NAME " "(3.20+ only)"
+Sets a specific TCP congestion control algorithm only for a given destination.
+If not specified, Linux keeps the current global default TCP congestion control
+algorithm, or the one set from the application. If the modifier
+.B lock
+is not used, an application may nevertheless overwrite the suggested congestion
+control algorithm for that destination. If the modifier
+.B lock
+is used, then an application is not allowed to overwrite the specified congestion
+control algorithm for that destination, thus it will be enforced/guaranteed to
+use the proposed algorithm.
+
+.TP
.BI advmss " NUMBER " "(2.3.15+ only)"
the MSS ('Maximal Segment Size') to advertise to these
destinations when establishing TCP connections. If it is not given,
--
1.9.0
^ permalink raw reply related
* Re: [PATCH 1/1] update ip-sysctl.txt documentation
From: Ani Sinha @ 2015-01-06 23:50 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <20150106154824.6f2c9a75@urahara>
On Tue, Jan 6, 2015 at 3:48 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 6 Jan 2015 14:02:29 -0800
> Ani Sinha <ani@arista.com> wrote:
>
>> route/max_size - INTEGER
>> - Maximum number of routes allowed in the kernel. Increase
>> - this when using large numbers of interfaces and/or routes.
>> + Post linux kernel 3.6, this is depricated for ipv4 as route cache is no
>> + longer used. For ipv6, this is used to limit the maximum number of ipv6
>> + routes allowed in the kernel. Increase this when using large
>> numbers of
>> + interfaces and/or routes.
>
> 1. You used mailer with line wrap which broke the patch.
I used git send-email. Not sure what else to use.
>
> 2. The spelling is not correct 'depricated'
Fixed in the latest patch.
^ permalink raw reply
* Re: [PATCH 1/1] update ip-sysctl.txt documentation
From: Stephen Hemminger @ 2015-01-06 23:48 UTC (permalink / raw)
To: Ani Sinha; +Cc: netdev@vger.kernel.org
In-Reply-To: <CAOxq_8OGx9VgSaEimAbNZSWjihNqNBXoVg0m8EPRaNX5jLXZiw@mail.gmail.com>
On Tue, 6 Jan 2015 14:02:29 -0800
Ani Sinha <ani@arista.com> wrote:
> route/max_size - INTEGER
> - Maximum number of routes allowed in the kernel. Increase
> - this when using large numbers of interfaces and/or routes.
> + Post linux kernel 3.6, this is depricated for ipv4 as route cache is no
> + longer used. For ipv6, this is used to limit the maximum number of ipv6
> + routes allowed in the kernel. Increase this when using large
> numbers of
> + interfaces and/or routes.
1. You used mailer with line wrap which broke the patch.
2. The spelling is not correct 'depricated'
^ permalink raw reply
* [Patch net-next] doc: fix the compile error of txtimestamp.c
From: Cong Wang @ 2015-01-06 23:45 UTC (permalink / raw)
To: netdev; +Cc: carlos, vlee, davem, Cong Wang
In-Reply-To: <1420587932-8733-1-git-send-email-xiyou.wangcong@gmail.com>
Vinson reported:
HOSTCC Documentation/networking/timestamping/txtimestamp
Documentation/networking/timestamping/txtimestamp.c:64:8: error:
redefinition of ‘struct in6_pktinfo’
struct in6_pktinfo {
^
In file included from /usr/include/arpa/inet.h:23:0,
from Documentation/networking/timestamping/txtimestamp.c:33:
/usr/include/netinet/in.h:456:8: note: originally defined here
struct in6_pktinfo
^
After we sync with libc header, we don't need this ugly hack any more.
Reported-by: Vinson Lee <vlee@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
Documentation/networking/timestamping/txtimestamp.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c
index 876f71c..8778e68 100644
--- a/Documentation/networking/timestamping/txtimestamp.c
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -59,14 +59,6 @@
#include <time.h>
#include <unistd.h>
-/* ugly hack to work around netinet/in.h and linux/ipv6.h conflicts */
-#ifndef in6_pktinfo
-struct in6_pktinfo {
- struct in6_addr ipi6_addr;
- int ipi6_ifindex;
-};
-#endif
-
/* command line parameters */
static int cfg_proto = SOCK_STREAM;
static int cfg_ipproto = IPPROTO_TCP;
--
1.8.3.1
^ permalink raw reply related
* [Patch net-next] ipv6: fix redefinition of in6_pktinfo and ip6_mtuinfo
From: Cong Wang @ 2015-01-06 23:45 UTC (permalink / raw)
To: netdev; +Cc: carlos, vlee, davem, Cong Wang
Both netinet/in.h and linux/ipv6.h define these two structs,
if we include both of them, we got:
/usr/include/linux/ipv6.h:19:8: error: redefinition of ‘struct in6_pktinfo’
struct in6_pktinfo {
^
In file included from /usr/include/arpa/inet.h:22:0,
from txtimestamp.c:33:
/usr/include/netinet/in.h:524:8: note: originally defined here
struct in6_pktinfo
^
In file included from txtimestamp.c:40:0:
/usr/include/linux/ipv6.h:24:8: error: redefinition of ‘struct ip6_mtuinfo’
struct ip6_mtuinfo {
^
In file included from /usr/include/arpa/inet.h:22:0,
from txtimestamp.c:33:
/usr/include/netinet/in.h:531:8: note: originally defined here
struct ip6_mtuinfo
^
So similarly to what we did for in6_addr, we need to sync with
libc header on their definitions.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
include/uapi/linux/ipv6.h | 5 ++++-
include/uapi/linux/libc-compat.h | 6 ++++++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index e863d08..b9b1b7d 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -15,16 +15,19 @@
* *under construction*
*/
-
+#if __UAPI_DEF_IN6_PKTINFO
struct in6_pktinfo {
struct in6_addr ipi6_addr;
int ipi6_ifindex;
};
+#endif
+#if __UAPI_DEF_IP6_MTUINFO
struct ip6_mtuinfo {
struct sockaddr_in6 ip6m_addr;
__u32 ip6m_mtu;
};
+#endif
struct in6_ifreq {
struct in6_addr ifr6_addr;
diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index e28807a..fa673e9 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -70,6 +70,8 @@
#define __UAPI_DEF_IPV6_MREQ 0
#define __UAPI_DEF_IPPROTO_V6 0
#define __UAPI_DEF_IPV6_OPTIONS 0
+#define __UAPI_DEF_IN6_PKTINFO 0
+#define __UAPI_DEF_IP6_MTUINFO 0
#else
@@ -84,6 +86,8 @@
#define __UAPI_DEF_IPV6_MREQ 1
#define __UAPI_DEF_IPPROTO_V6 1
#define __UAPI_DEF_IPV6_OPTIONS 1
+#define __UAPI_DEF_IN6_PKTINFO 1
+#define __UAPI_DEF_IP6_MTUINFO 1
#endif /* _NETINET_IN_H */
@@ -106,6 +110,8 @@
#define __UAPI_DEF_IPV6_MREQ 1
#define __UAPI_DEF_IPPROTO_V6 1
#define __UAPI_DEF_IPV6_OPTIONS 1
+#define __UAPI_DEF_IN6_PKTINFO 1
+#define __UAPI_DEF_IP6_MTUINFO 1
/* Definitions for xattr.h */
#define __UAPI_DEF_XATTR 1
--
1.8.3.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox