Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH/RFC iproute2/net-next 0/3] tc: flower: allow control of tree traversal on packet parse errors
From: Simon Horman @ 2017-04-28 12:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Jamal Hadi Salim, Cong Wang, Dinan Gunawardena,
	netdev, oss-drivers, Simon Horman

Hi,

this series is intended to allow control how the tree of qdisc, classes and
filters is further traversed if an error is encountered when parsing the
packet in order to match the cls_flower filters at a particular prio.

Please see the changelog of the last patch of this series for a more
detailed description.


Simon Horman (3):
  tc: flower: update headers for TCA_FLOWER_KEY_MPLS*
  tc: flower: update headers for TCA_FLOWER_HEADER_PARSE_ERR_ACT
  tc: flower: allow control of tree traversal on packet parse errors

 include/linux/pkt_cls.h |  7 +++++++
 man/man8/tc-flower.8    | 29 +++++++++++++++++++++++++++--
 tc/f_flower.c           | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 2 deletions(-)

-- 
2.12.2.816.g2cccc81164

^ permalink raw reply

* [PATCH/RFC net-next 3/4] net/sched: cls_flower: do not match if dissection fails
From: Simon Horman @ 2017-04-28 12:00 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim, Cong Wang
  Cc: Dinan Gunawardena, netdev, oss-drivers, Simon Horman
In-Reply-To: <20170428120035.15984-1-simon.horman@netronome.com>

If the flow skb_flow_dissect() returns an error it indicates that
dissection was incomplete for some reason. Matching using the result of an
incomplete dissection may cause unexpected results. For example:

* A match on zero layer 4 ports will also match packets truncated at
  the end of the IP header; that is packets where ports are missing are
  treated the same way as packets with zero ports.
* Likewise, a match on zero ICMP code or type will also match packets
  truncated at the end of the IP header; that is packets where the ICMP
  type and code are missing will be treated the same way as packets with
  zero ICMP code and type.

Separate patches to the flow dissector are required in order for it to
return errors in the above cases.

Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 net/sched/cls_flower.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 3ecf07666df3..cc6b3e7cf03b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -187,7 +187,8 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	 * so do it rather here.
 	 */
 	skb_key.basic.n_proto = skb->protocol;
-	skb_flow_dissect(skb, &head->dissector, &skb_key, 0);
+	if (!skb_flow_dissect(skb, &head->dissector, &skb_key, 0))
+		return -1;

 	fl_set_masked_key(&skb_mkey, &skb_key, &head->mask);

-- 
2.12.2.816.g2cccc81164

^ permalink raw reply related

* [PATCH/RFC net-next 4/4] net/sched: cls_flower: allow control of tree traversal on packet parse errors
From: Simon Horman @ 2017-04-28 12:00 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim, Cong Wang
  Cc: Dinan Gunawardena, netdev, oss-drivers, Simon Horman
In-Reply-To: <20170428120035.15984-1-simon.horman@netronome.com>

Allow control how the tree of qdisc, classes and filters is further
traversed if an error is encountered when parsing the packet in order to
match the cls_flower filters at a particular prio.

By default continue to the next filter, the behaviour without this patch.

A use-case for this is to allow configuration of dropping of packets with
truncated headers.

For example, the following drops IPv4 packets that cannot be parsed by the
flow dissector up to the end of the UDP ports - e.g. because they are
truncated, and instantiates a continue action based on the port for packets
that can be parsed.

 # tc qdisc del dev eth0 ingress; tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: flower \
       indev eth0 ip_proto udp dst_port 80 header_parse_err_action drop \
       action continue

Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 include/uapi/linux/pkt_cls.h |  2 ++
 net/sched/cls_flower.c       | 46 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index f1129e383b2a..b722a85bcfa1 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -437,6 +437,8 @@ enum {
 	TCA_FLOWER_KEY_MPLS_TC,		/* u8 - 3 bits */
 	TCA_FLOWER_KEY_MPLS_LABEL,	/* be32 - 20 bits */
 
+	TCA_FLOWER_HEADER_PARSE_ERR_ACT,
+
 	__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index cc6b3e7cf03b..4d2d91b0d532 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -67,13 +67,14 @@ struct cls_fl_head {
 	struct fl_flow_mask mask;
 	struct flow_dissector dissector;
 	u32 hgen;
-	bool mask_assigned;
+	bool assigned;
 	struct list_head filters;
 	struct rhashtable_params ht_params;
 	union {
 		struct work_struct work;
 		struct rcu_head	rcu;
 	};
+	int err_action;
 };
 
 struct cls_fl_filter {
@@ -188,7 +189,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	 */
 	skb_key.basic.n_proto = skb->protocol;
 	if (!skb_flow_dissect(skb, &head->dissector, &skb_key, 0))
-		return -1;
+		return head->err_action;
 
 	fl_set_masked_key(&skb_mkey, &skb_key, &head->mask);
 
@@ -317,7 +318,7 @@ static void fl_destroy_sleepable(struct work_struct *work)
 {
 	struct cls_fl_head *head = container_of(work, struct cls_fl_head,
 						work);
-	if (head->mask_assigned)
+	if (head->assigned)
 		rhashtable_destroy(&head->ht);
 	kfree(head);
 	module_put(THIS_MODULE);
@@ -425,6 +426,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_MPLS_BOS]	= { .type = NLA_U8 },
 	[TCA_FLOWER_KEY_MPLS_TC]	= { .type = NLA_U8 },
 	[TCA_FLOWER_KEY_MPLS_LABEL]	= { .type = NLA_U32 },
+	[TCA_FLOWER_HEADER_PARSE_ERR_ACT] = { .type = NLA_U32 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -779,13 +781,15 @@ static void fl_init_dissector(struct cls_fl_head *head,
 	skb_flow_dissector_init(&head->dissector, keys, cnt);
 }
 
-static int fl_check_assign_mask(struct cls_fl_head *head,
-				struct fl_flow_mask *mask)
+static int fl_check_assign_mask_and_err_action(struct cls_fl_head *head,
+					       struct fl_flow_mask *mask,
+					       int err_action)
 {
 	int err;
 
-	if (head->mask_assigned) {
-		if (!fl_mask_eq(&head->mask, mask))
+	if (head->assigned) {
+		if (!fl_mask_eq(&head->mask, mask) ||
+		    head->err_action != err_action)
 			return -EINVAL;
 		else
 			return 0;
@@ -798,7 +802,8 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
 	if (err)
 		return err;
 	memcpy(&head->mask, mask, sizeof(head->mask));
-	head->mask_assigned = true;
+	head->assigned = true;
+	head->err_action = err_action;
 
 	fl_init_dissector(head, mask);
 
@@ -871,7 +876,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_fl_filter *fnew;
 	struct nlattr **tb;
 	struct fl_flow_mask mask = {};
-	int err;
+	int err, err_action;
 
 	if (!tca[TCA_OPTIONS])
 		return -EINVAL;
@@ -918,11 +923,28 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 		}
 	}
 
+	if (tb[TCA_FLOWER_HEADER_PARSE_ERR_ACT]) {
+		err_action = nla_get_u32(tb[TCA_FLOWER_HEADER_PARSE_ERR_ACT]);
+
+		switch (err_action) {
+		case TC_ACT_UNSPEC:
+		case TC_ACT_OK:
+		case TC_ACT_SHOT:
+			break;
+		default:
+			err = -EINVAL;
+			goto errout;
+		}
+
+	} else {
+		err_action = TC_ACT_UNSPEC;
+	}
+
 	err = fl_set_parms(net, tp, fnew, &mask, base, tb, tca[TCA_RATE], ovr);
 	if (err)
 		goto errout;
 
-	err = fl_check_assign_mask(head, &mask);
+	err = fl_check_assign_mask_and_err_action(head, &mask, err_action);
 	if (err)
 		goto errout;
 
@@ -1309,6 +1331,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags))
 		goto nla_put_failure;
 
+	if (head->err_action != TC_ACT_UNSPEC &&
+	    nla_put_u32(skb, TCA_FLOWER_HEADER_PARSE_ERR_ACT, head->err_action))
+		goto nla_put_failure;
+
 	if (tcf_exts_dump(skb, &f->exts))
 		goto nla_put_failure;
 
-- 
2.12.2.816.g2cccc81164

^ permalink raw reply related

* [PATCH/RFC net-next 2/4] flow dissector: return error on icmp dissection under-run
From: Simon Horman @ 2017-04-28 12:00 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim, Cong Wang
  Cc: Dinan Gunawardena, netdev, oss-drivers, Simon Horman
In-Reply-To: <20170428120035.15984-1-simon.horman@netronome.com>

Return an error from __skb_flow_dissect() if insufficient packet data is
present when dissecting icmp type and code.

Without this change the absence of the ICMP type and code in truncated
ICMPv4 or IPVPv6 packets is treated the same as the presence of a code and
type of value of zero.  As a result the flower classifier is unable to
differentiate between these two cases which may lead to unexpected matching
of truncated packets.

The approach taken here is to return an error if the IP protocol indicates
ICMP, and the type and code data is not present in the packet - an error
return value from __skb_header_pointer().

This should only effect the flower classifier as it is the only user of
W_DISSECTOR_KEY_ICMP.  The behavioural update for flower only takes effect
with a separate patch to have it refuse to match if dissection fails.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 net/core/flow_dissector.c | 65 +++++++++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 31 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index b3bf4886f71f..496afd7b3051 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -58,28 +58,6 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
 /**
- * skb_flow_get_be16 - extract be16 entity
- * @skb: sk_buff to extract from
- * @poff: offset to extract at
- * @data: raw buffer pointer to the packet
- * @hlen: packet header length
- *
- * The function will try to retrieve a be32 entity at
- * offset poff
- */
-static __be16 skb_flow_get_be16(const struct sk_buff *skb, int poff,
-				void *data, int hlen)
-{
-	__be16 *u, _u;
-
-	u = __skb_header_pointer(skb, poff, sizeof(_u), data, hlen, &_u);
-	if (u)
-		return *u;
-
-	return 0;
-}
-
-/**
  * __skb_flow_get_ports - extract the upper layer ports and return them
  * @skb: sk_buff to extract the ports from
  * @thoff: transport header offset
@@ -353,6 +331,29 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
 	return FLOW_DISSECT_RET_OUT_PROTO_AGAIN;
 }
 
+static enum flow_dissect_ret
+__skb_flow_dissect_icmp(const struct sk_buff *skb,
+			struct flow_dissector *flow_dissector,
+			void *target_container, void *data, int nhoff, int hlen)
+{
+	struct flow_dissector_key_icmp *key_icmp;
+	__be16 *u, _u;
+
+	if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ICMP))
+		return FLOW_DISSECT_RET_OUT_GOOD;
+
+	u = __skb_header_pointer(skb, nhoff, sizeof(_u), data, hlen, &_u);
+	if (!u)
+		return FLOW_DISSECT_RET_OUT_BAD;
+
+	key_icmp = skb_flow_dissector_target(flow_dissector,
+					     FLOW_DISSECTOR_KEY_ICMP,
+					     target_container);
+	key_icmp->icmp = *u;
+
+	return FLOW_DISSECT_RET_OUT_GOOD;
+}
+
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
@@ -379,7 +380,6 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
-	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	bool skip_vlan = false;
@@ -694,6 +694,17 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	case IPPROTO_MPLS:
 		proto = htons(ETH_P_MPLS_UC);
 		goto mpls;
+	case IPPROTO_ICMP:
+	case NEXTHDR_ICMP:
+		switch (__skb_flow_dissect_icmp(skb, flow_dissector,
+						target_container, data,
+						nhoff, hlen)) {
+		case FLOW_DISSECT_RET_OUT_GOOD:
+			goto out_good;
+		case FLOW_DISSECT_RET_OUT_BAD:
+		default:
+			goto out_bad;
+		}
 	default:
 		break;
 	}
@@ -708,14 +719,6 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 			goto out_bad;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_ICMP)) {
-		key_icmp = skb_flow_dissector_target(flow_dissector,
-						     FLOW_DISSECTOR_KEY_ICMP,
-						     target_container);
-		key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data, hlen);
-	}
-
 out_good:
 	ret = true;
 
-- 
2.12.2.816.g2cccc81164

^ permalink raw reply related

* [PATCH/RFC net-next 1/4] flow dissector: return error on port dissection under-run
From: Simon Horman @ 2017-04-28 12:00 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim, Cong Wang
  Cc: Dinan Gunawardena, netdev, oss-drivers, Simon Horman
In-Reply-To: <20170428120035.15984-1-simon.horman@netronome.com>

Return an error from __skb_flow_dissect() if insufficient packet data is
present when dissecting layer 4 ports.

Without this change the absence of ports in truncated - e.g. UDP - packets
is treated the same way as the presence of ports with a value of zero.  As
a result the flower classifier is unable to differentiate between these two
cases which may lead to unexpected matching of truncated packets.

The approach taken here is to only return an error if the offset of ports
for the previously dissected IP protocol is known - a non error return from
proto_ports_offset() - and the port data is not present in the packet - an
error return value from __skb_header_pointer().

The behaviour for callers of __skb_flow_get_ports() is changed but the only
callers are skb_flow_get_ports() and the flow dissector.  The former has
been updated so that its behaviour is unchanged.  Behavioural change of the
latter is the intended purpose of this patch but will only take effect with
a separate patch to have it refuse to match if dissection fails.

This change will lead to behavioural changes of the users of the dissector
with FLOW_DISSECTOR_KEY_PORTS - flower, and users of
flow_keys_dissector_keys[] and flow_keys_dissector_symmetric_keys[].  The
behavioural change for *_keys[] changes seem reasonable as the change will
should only be for truncated packets.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 include/linux/skbuff.h    | 11 ++++++++---
 net/core/flow_dissector.c | 40 ++++++++++++++++++++++++++--------------
 2 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 81ef53f06534..0ad9b3955829 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1108,13 +1108,18 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
 u32 skb_get_poff(const struct sk_buff *skb);
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 		   const struct flow_keys *keys, int hlen);
-__be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-			    void *data, int hlen_proto);
+bool __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
+			  void *data, int hlen_proto, __be32 *ports);
 
 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
 					int thoff, u8 ip_proto)
 {
-	return __skb_flow_get_ports(skb, thoff, ip_proto, NULL, 0);
+	__be32 ports;
+
+	if (__skb_flow_get_ports(skb, thoff, ip_proto, NULL, 0, &ports))
+		return ports;
+	else
+		return 0;
 }
 
 void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 28d94bce4df8..b3bf4886f71f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -86,30 +86,41 @@ static __be16 skb_flow_get_be16(const struct sk_buff *skb, int poff,
  * @ip_proto: protocol for which to get port offset
  * @data: raw buffer pointer to the packet, if NULL use skb->data
  * @hlen: packet header length, if @data is NULL use skb_headlen(skb)
+ * @ports: pointer to return ports in
  *
  * The function will try to retrieve the ports at offset thoff + poff where poff
- * is the protocol port offset returned from proto_ports_offset
+ * is the protocol port offset returned from proto_ports_offset.
+ *
+ * Returns false on error, true otherwise.
  */
-__be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-			    void *data, int hlen)
+bool __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
+			  void *data, int hlen, __be32 *ports)
 {
 	int poff = proto_ports_offset(ip_proto);
+	__be32 *p, _p;
+
+	/* proto_ports_offset returning an error indicates that ip_proto is
+	 * not known to have ports. This is not considered an error here.
+	 * Rather it is considered that the flow key of the caller may use
+	 * the default value of port fields: 0.
+	 */
+	if (poff < 0) {
+		*ports = 0;
+		return true;
+	}
 
 	if (!data) {
 		data = skb->data;
 		hlen = skb_headlen(skb);
 	}
 
-	if (poff >= 0) {
-		__be32 *ports, _ports;
+	p = __skb_header_pointer(skb, thoff + poff, sizeof(_p),
+				 data, hlen, &_p);
+	if (!p)
+		return false;
+	*ports = *p;
 
-		ports = __skb_header_pointer(skb, thoff + poff,
-					     sizeof(_ports), data, hlen, &_ports);
-		if (ports)
-			return *ports;
-	}
-
-	return 0;
+	return true;
 }
 EXPORT_SYMBOL(__skb_flow_get_ports);
 
@@ -692,8 +703,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		key_ports = skb_flow_dissector_target(flow_dissector,
 						      FLOW_DISSECTOR_KEY_PORTS,
 						      target_container);
-		key_ports->ports = __skb_flow_get_ports(skb, nhoff, ip_proto,
-							data, hlen);
+		if (!__skb_flow_get_ports(skb, nhoff, ip_proto, data, hlen,
+					  &key_ports->ports))
+			goto out_bad;
 	}
 
 	if (dissector_uses_key(flow_dissector,
-- 
2.12.2.816.g2cccc81164

^ permalink raw reply related

* [PATCH/RFC net-next 0/4] net/sched: cls_flower: avoid false matching of truncated packets
From: Simon Horman @ 2017-04-28 12:00 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim, Cong Wang
  Cc: Dinan Gunawardena, netdev, oss-drivers

Hi,

this series is intended to avoid false-positives which match
truncated packets against flower classifiers which match on:
* zero L4 ports or;
* zero ICMP code or type

This requires updating the flow dissector to return an error in such cases
and updating flower to not match on the result of a failed dissection.

In the case of UDP this results in a behavioural change to users of
flow_keys_dissector_keys[] and flow_keys_dissector_symmetric_keys[] -
dissection will fail on truncated packets where the IP protocol of the
packets indicates ports should be present (according to skb_flow_get_ports()).

The last patch of the series builds on the above to allow users to specify
a policy for how to handle packets whose dissection fails.

I will separately provide RFC patches to iproute2 to allow exercising the
last patch.

Simon Horman (4):
  flow dissector: return error on port dissection under-run
  flow dissector: return error on icmp dissection under-run
  net/sched: cls_flower: do not match if dissection fails
  net/sched: cls_flower: allow control of tree traversal on packet parse
    errors

 include/linux/skbuff.h       |  11 +++--
 include/uapi/linux/pkt_cls.h |   2 +
 net/core/flow_dissector.c    | 105 ++++++++++++++++++++++++-------------------
 net/sched/cls_flower.c       |  47 ++++++++++++++-----
 4 files changed, 107 insertions(+), 58 deletions(-)

-- 
2.12.2.816.g2cccc81164

^ permalink raw reply

* Re: [REGRESSION next-20170426] Commit 09515ef5ddad ("of/acpi: Configure dma operations at probe time for platform/amba/pci bus devices") causes oops in mvneta
From: Sricharan R @ 2017-04-28 11:56 UTC (permalink / raw)
  To: Ralph Sennhauser
  Cc: Rafael J. Wysocki, Joerg Roedel, Bjorn Helgaas, linux-acpi,
	linux-kernel, linux-pci, Thomas Petazzoni, netdev
In-Reply-To: <20170428081919.21bb569d@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

Hi Ralph,

<snip..>

>>>>>
>>>>> Commit 09515ef5ddad ("of/acpi: Configure dma operations at probe
>>>>> time for platform/amba/pci bus devices") causes a kernel panic as
>>>>> in the log below on an armada-385. Reverting the commit fixes the
>>>>> issue.
>>>>>
>>>>> Regards
>>>>> Ralph    
>>>>
>>>> Somehow not getting a obvious clue on whats going wrong with the
>>>> logs below. From the log and looking in to dts, the drivers seems
>>>> to the one for "marvell,armada-370-neta".  
>>>
>>> Correct.
>>>   
>>>> Issue looks the data from the dma
>>>> has gone bad and subsequently referring the wrong data has resulted
>>>> in the crash. Looks like the dma_masks is the one going wrong.
>>>> Can i get some logs from mvneta_probe, about dev->dma_mask,
>>>> dev->coherent_dma_mask and dev->dma_ops with and without the patch
>>>> to see whats the difference ?  
>>>
>>> Not sure I understood what exactly you are after. Might be faster to
>>> just send me a patch with all debug print statements you like to
>>> see. 
>>
>> Attached the patch with debug prints.
>>
>> Regards,
>>  Sricharan
>>
> 
> Hi Sricharan
> 
> With commit 09515ef5ddad
> 
> [    1.288962] mvneta f1070000.ethernet: dev->dma_mask 0xffffffff
> [    1.294827] mvneta f1070000.ethernet: dev->coherent_dma_mask 0xffffffff
> [    1.301472] mvneta f1070000.ethernet: dev->dma_ops 0x40b00c0601460
> 
> [    1.322047] mvneta f1034000.ethernet: dev->dma_mask 0xffffffff
> [    1.327904] mvneta f1034000.ethernet: dev->coherent_dma_mask 0xffffffff
> [    1.334549] mvneta f1034000.ethernet: dev->dma_ops 0x40b00c0601460
> 
> 
> With the patch reverted, the build that works
> 
> [    1.289001] mvneta f1070000.ethernet: dev->dma_mask 0xffffffff
> [    1.294866] mvneta f1070000.ethernet: dev->coherent_dma_mask 0xffffffff
> [    1.301511] mvneta f1070000.ethernet: dev->dma_ops 0x40b00c06014a8
> 
> [    1.317005] mvneta f1034000.ethernet: dev->dma_mask 0xffffffff
> [    1.322867] mvneta f1034000.ethernet: dev->coherent_dma_mask 0xffffffff
> [    1.329508] mvneta f1034000.ethernet: dev->dma_ops 0x40b00c06014a8
> 

My bad, i think it is this patch missing [1], attached it as well.
Infact, this was in the series initially and got acked to get merged
separately well before the series. I should have sent this to Russell.
I will do this now. If this fixes up the issue,
i will take this patch separately, while this series gets tested
on -next.

[1] https://patchwork.kernel.org/patch/9362113/

-- 
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

[-- Attachment #2: 0001-arm-dma-mapping-Don-t-override-dma_ops-in-arch_setup.patch --]
[-- Type: text/plain, Size: 2001 bytes --]

From be36ea5f2c7d1c28dc8f829b5d2c817826481086 Mon Sep 17 00:00:00 2001
From: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Date: Fri, 15 May 2015 02:00:02 +0300
Subject: [PATCH] arm: dma-mapping: Don't override dma_ops in
 arch_setup_dma_ops()

The arch_setup_dma_ops() function is in charge of setting dma_ops with a
call to set_dma_ops(). set_dma_ops() is also called from

- highbank and mvebu bus notifiers
- dmabounce (to be replaced with swiotlb)
- arm_iommu_attach_device

(arm_iommu_attach_device is itself called from IOMMU and bus master
device drivers)

To allow the arch_setup_dma_ops() call to be moved from device add time
to device probe time we must ensure that dma_ops already setup by any of
the above callers will not be overriden.

Aftering replacing dmabounce with swiotlb, converting IOMMU drivers to
of_xlate and taking care of highbank and mvebu, the workaround should be
removed.

[Rebased on top of 4.11-rc8]
Signed-off-by: Sricharan R <sricharan@codeaurora.org>
Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
---
 arch/arm/mm/dma-mapping.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 0268584..c742dfd 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -2408,6 +2408,15 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 	const struct dma_map_ops *dma_ops;
 
 	dev->archdata.dma_coherent = coherent;
+
+	/*
+	 * Don't override the dma_ops if they have already been set. Ideally
+	 * this should be the only location where dma_ops are set, remove this
+	 * check when all other callers of set_dma_ops will have disappeared.
+	 */
+	if (dev->dma_ops)
+		return;
+
 	if (arm_setup_iommu_dma_ops(dev, dma_base, size, iommu))
 		dma_ops = arm_get_iommu_dma_map_ops(coherent);
 	else
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related

* Re: Network cooling device and how to control NIC speed on thermal condition
From: Andrew Lunn @ 2017-04-28 11:56 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel
In-Reply-To: <CAHKzcEO-75r5Gmew9x1eprh9ACf_NQKCHQg9svOLM1OmVX0JgQ@mail.gmail.com>

> I collect SoC temp every a few secs. Meantime, I use ethtool -s ethX
> speed <speed> to manipulate link speed and to see how it impacts SoC
> temp. My 4 PHYs and switch are integrated into SoC and I always
> change link speed for all PHYs , no traffic on the link for this test.
> Starting with 1Gb/s and then scaling down to 100 Mb/s and then to
> 10Mb/s, I see significant  ~10 *C drop in temp while link is set to
> 10Mb/s.

Is that a realistic test? No traffic over the network? If you are
hitting your thermal limit, to me that means one of two things:

1) The device is under very heavy load, consuming a lot of power to do
   what it needs to to.

2) Your device is idle, no packets are flowing, but your thermal
   design is wrong, so that it cannot dissipate enough heat.

It seems to me, you are more interested in 1). But your quick test is
more about 2).

I would be more interested in do quick tests of switching 8Gbps,
4Gbps, 2Gbps, 1Gbps, 512Mbps, 256Bps, ... What effect does this have
on temperature?

> So, throttling link speed can really help to dissipate heat
> significantly when the platform is under threat.
> 
> Renegotiating link speed costs something I agree, it also impacts user
> experience, but such a thermal condition will not occur often I
> believe.

It is a heavy handed approach, and you have to be careful. There are
some devices which don't work properly, e.g. if you try to negotiate
1000 half duplex, you might find the link just breaks.

Doing this via packet filtering, dropping packets, gives you a much
finer grained control and is a lot less disruptive. But it assumes
handling packets is what it causing you heat problems, not the links
themselves.

	Andrew

^ permalink raw reply

* Re: [ISSUE: sky2 - rx error] Link stops working under heavy traffic load connected to a mv88e6176
From: Rafa Corvillo @ 2017-04-28 11:54 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Stephen Hemminger, netdev
In-Reply-To: <20170427130450.GL17172@lunn.ch>

On 27/04/17 15:04, Andrew Lunn wrote:
> On Thu, Apr 27, 2017 at 02:05:51PM +0200, Rafa Corvillo wrote:
>> On 25/04/17 17:27, Stephen Hemminger wrote:
>>> On Fri, 21 Apr 2017 14:39:00 +0200
>>> Rafa Corvillo <rafael.corvillo@aoifes.com> wrote:
>>>
>>>> We are working in an ARMv7 embedded system running kernel 4.9 (LEDE build).
>>>> It is an imx6 board with 2 ethernet interfaces. One of them is connected to
>>>> a Marvell switch.
>>>>
>>>> The schema of the system is the following:
>>>>
>
> Hi Rafa
>
> Your ASCII art got messed up somewhere. Is this the correct
> reconstruction?

Yes, this is the schema.

>
>     +-------------------+ eth0
>     |                   +--+
>     |                   |  |
>     | Embedded system   +--+
>     |                   |
>     |      ARMv7        |
>     |                   | Marvell 88E8057(sky2)     +-------------+
>     |                   +--+                     +--+             +--+ eth1
>     |                   |  +---------------------+  |             |  +------+
>     |                   +--+      CPU port       +--+ mv88e6176   +--+
>     +------+--+---------+                           |             |
> emulated  |  |                                     |             |
> GPIO      +--+                                  +--+             +--+ eth2
> MDIO       +-----------------------------------+   |             |  +------+
>                                  MDIO            +--+             +--+
>                                                     +-------------+
>
> I assume you are using DSA? Since this is LEDE, it could be swconfig,
> but the bridge configuration you mentioned would not make sense for
> swconfig.

Yes, we use DSA driver. We don't use swconfig to configure the Marvell 
switch. Our board has two ethernet interfaces (eth0 and marvell) using 
sky2 driver. The marvell interface is connected to an external Marvell 
switch (mv88e6176) with four ethernet ports (but we only use two of 
them, eth1 and eth2). The Marvell switch is configured with the MDIO 
protocol, that we emulate through GPIOS (mdio-gpio kernel module), and 
the DSA driver is used to works with the Marvell switch.

We have the ethernet interfaces in the same bridge:

config interface 'lan'
         option type 'bridge'
         option ifname 'eth0 eth1 eth2'
         option proto 'static'
         option ipaddr '192.168.1.100'
         option netmask '255.255.255.0'
         option ip6assign '60'

root@LEDE:/# brctl show
bridge name     bridge id               STP enabled     interfaces
br-lan          7fff.00d01274f069       no              eth0
                                                         eth1
                                                         eth2
root@LEDE:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
group default qlen 1
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel 
master br-lan state DOWN group default qlen 1000
     link/ether 00:d0:12:74:f0:69 brd ff:ff:ff:ff:ff:ff
3: ifb0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
qlen 32
     link/ether be:80:bc:5e:63:c3 brd ff:ff:ff:ff:ff:ff
4: ifb1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
qlen 32
     link/ether 0a:1d:8d:06:e3:5d brd ff:ff:ff:ff:ff:ff
5: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN group default qlen 1
     link/gre 0.0.0.0 brd 0.0.0.0
6: gretap0@NONE: <BROADCAST,MULTICAST> mtu 1462 qdisc noop state DOWN 
group default qlen 1000
     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
7: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
group default qlen 1000
     link/ether e2:0b:10:b8:b7:b0 brd ff:ff:ff:ff:ff:ff
8: teql0: <NOARP> mtu 1500 qdisc noop state DOWN group default qlen 100
     link/void
9: can0: <NOARP,ECHO> mtu 16 qdisc noop state DOWN group default qlen 10
     link/can
10: marvell: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
state UP group default qlen 1000
     link/ether aa:64:73:91:09:a9 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::a864:73ff:fe91:9a9/64 scope link
        valid_lft forever preferred_lft forever
11: eth1@marvell: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc 
noqueue master br-lan switchid 00000000 state LOWERLAYERDOWN group 
default qlen 1000
     link/ether aa:64:73:91:09:a9 brd ff:ff:ff:ff:ff:ff
12: eth2@marvell: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc 
noqueue master br-lan switchid 00000000 state UP group default qlen 1000
     link/ether aa:64:73:91:09:a9 brd ff:ff:ff:ff:ff:ff
13: eth3@marvell: <BROADCAST,MULTICAST> mtu 1500 qdisc noop switchid 
00000000 state DOWN group default qlen 1000
     link/ether aa:64:73:91:09:a9 brd ff:ff:ff:ff:ff:ff
14: eth4@marvell: <BROADCAST,MULTICAST> mtu 1500 qdisc noop switchid 
00000000 state DOWN group default qlen 1000
     link/ether aa:64:73:91:09:a9 brd ff:ff:ff:ff:ff:ff
15: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
state UP group default qlen 1000
     link/ether 00:d0:12:74:f0:69 brd ff:ff:ff:ff:ff:ff
     inet 192.168.1.100/24 brd 192.168.1.255 scope global br-lan
        valid_lft forever preferred_lft forever
     inet6 fd7b:a43b:e93e::1/60 scope global noprefixroute
        valid_lft forever preferred_lft forever
     inet6 fe80::2d0:12ff:fe74:f069/64 scope link
        valid_lft forever preferred_lft forever


We have this configuration working on a kernel 4.1 and including patches 
to upgrade dsa/mv88e6xxx to kernel version 4.3 (5acf4d0, Wed, 27 May 
2015 15:32:15 -0700) "[PATCH] blk: rq_data_dir() should not return a 
boolean."

>
>>>> If I connect the eth1/eth2, the link is up and I can do ping through it.
>>>> But, once
>>>> I start sending a heavy traffic load the link fails and the kernel sends the
>>>> following messages:
>>>>
>>>> [   48.557140] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.564964] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.572110] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.579263] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.586417] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.593573] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   48.600718] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   54.877567] net_ratelimit: 6 callbacks suppressed
>>>> [   54.882293] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>> [   61.413552] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>>>> length 1518
>>>
>>> The status error bits are in sky2.h
>>> 0x5f20010 is
>>>       05f2 frame length => 1522
>>>       0010 Too long err
>>>
>>> That means the packet was longer than the configured MTU.
>>> You are probably getting packets with VLAN tag but have not configured
>>> a VLAN.
>
> Since you are using DSA, you will have DSA tags enabled on frames
> to/from the switch. This adds an extra 8 byte header in the frame.  My
> guess is, it is this header, not the VLAN tag which is causing you MTU
> issues.

But it is strange because, as I have said above, we have the same 
configuration working properly on a kernel 4.1 (with OpenWrt), and we 
have the MTU set to 1500.

>
> I think this is the first time i've seen sky2 used in a DSA
> setup. mv643xx or mvneta is generally what is used, when using Marvell
> chipsets. These drivers are more lenient about MTU, and are happy to
> pass frames with additional headers.
>

We use the mv88e6xxx (as our switch is mv88e6176) and it depends on DSA 
driver in the kernel (isn't it?).

>> Thanks for the information. I have increased the MTU value to 1550
>> (workaround) and it works if sends traffic (with iperf) from my
>> computer to the unit. But, if I send traffic outside the unit, I get
>> a new error message and link goes down:
>
> Changing the MTU like this is not a good fix. It will allow you to
> receive frames which are bigger, but it also means the local network
> stack will generate bigger frames to be transmitted. You probably need
> to modify the sky2 driver to allow it to receive frames bigger than
> the interface MTU, by about 8 bytes.

Should the DSA driver remove the DSA tags before pass the frames to sky2 
interface?

>
>> [ 4901.032989] sky2 0000:04:00.0 marvell: tx timeout
>> [ 4904.722670] sky2 0000:04:00.0 marvell: Link is up at 1000 Mbps,
>> full duplex, flow control both
>
> Between the sky2 and the switch, do you have two back-to-back PHYs or
> are you connecting the RGMII interfaces together?

I think that we have two back-to-back PHYs, but I am going to double 
check this with the hardware team.

Thanks,
Rafa

>
>      Andrew
>

^ permalink raw reply

* Re: prog ID and next steps. Was: [RFC net-next 0/2] Introduce bpf_prog ID and iteration
From: Hannes Frederic Sowa @ 2017-04-28 11:50 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau, netdev
  Cc: Daniel Borkmann, kernel-team, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Thomas Graf
In-Reply-To: <40cf6893-4702-4773-1aaa-7dfdc51c6212@fb.com>

Hello Alexei,

On 28.04.2017 03:11, Alexei Starovoitov wrote:
> On 4/27/17 6:36 AM, Hannes Frederic Sowa wrote:
>> On 27.04.2017 08:24, Martin KaFai Lau wrote:
>>> This patchset introduces the bpf_prog ID and a new bpf cmd to
>>> iterate all bpf_prog in the system.
>>>
>>> It is still incomplete.  The idea can be extended to bpf_map.
>>>
>>> Martin KaFai Lau (2):
>>>   bpf: Introduce bpf_prog ID
>>>   bpf: Test for bpf_prog ID and BPF_PROG_GET_NEXT_ID
>>
>> Thanks Martin, I like the approach.
>>
>> I think the progid is also much more suitable to be used in kallsyms
>> because it handles collisions correctly and let's correctly walk the
>> chain (for example imaging loading two identical programs but install
>> them at different hooks, kallsysms doesn't allow to find out which
>> program is installed where).
> 
> i disagree re: kallsyms. The goal of prog_tag is to let program writers
> understand which program is running in a stable way.

But exactly it doesn't let program writers do that, it just confuses them:

---

jit on:

perf record -e bpf_redirect -agR

The unwinder walks the stack, extracts address of upper function and
sends it to user space (perf) or handles it inside the kernel/kallsyms
(ftrace).

User takes tag of bpf program and wants to inspect related maps to the
program. Unfortunately the tag is not unique and thus we need to expand
the tag back to all possible programs with the same tag and expand that
to the union of all possible maps that those programs reference again.

That is what we present to the application developer. I would seriously
be very confused.

If application developer doesn't trust perf and uses instruction pointer
value from the stack directly he can't find out which program there is,
because fdinfo e.g. doesn't show the actual address of where the program
is allocated. I would use /dev/kmem now.

---

jit off:

perf probe -a '__bpf_prog_run ctx insn'
perf probe -a 'bpf_redirect flags ifindex'
perf record -e bpf_redirect -agR

Situation doesn't change. We do get the insn pointer thus have a unique
id for the program. That's it, no further introspection. I can read
/dev/kmem now.

---

Personally I wouldn't rely on such infrastructure.

My proposal would be to maybe hash a map id into the program, so instead
of replacing the user space file descriptor with zero, take a map id
(like discussed below) or an inode number of the map into the register
and hash with that, so that those program have unique identifiers.

Otherwise construct kallsym entries with prog id instead of tag.

I think that the hash should try to reassemble some kind of identity
function and mapping two programs to the same tag, that do something
completely differently is not good (based on we don't include the map).

Also I do think in future the difference between non-jit and jit
operation in regards to tracing should also be lifted. We could add a
manual tracing point into the interpreter for reporting the same event
as if the program was jitted.

Debugging should not be that different based on the sysctl flags.

> id is assigned dynamically and not suitable for that purpose.
> 
>> It would help a lot if you could pass the prog_id back during program
>> creation, otherwise it will be kind of difficult to get a hold on which
>> program is where. ;)
> 
> yes, but not a creation time. bpf_prog_load command will keep returning
> an FD and all operations on programs will be allowed with FD only.

Yes, yes, yes, fd should be primary return value, no questions asked.

If it is not possible to return back additional data via the bpf
syscall, add an operation to get id from fd. This is also fine.

> Think of this 'ID' as program handle or program pointer.
I do. My first idea was to use the inode of the bpffs as prog id and
just allocate it regardless, but it seems to be okay how you do it.
Maybe it even is better because you control the (re)cycling of numbers.

> In other words it's obfuscated kernel 'struct bpf_prog *' given to
> user space, so that user space can later convert this ID into FD.
> The other patch (not shown) will take ID from user space and will
> convert it to FD if prog->aux->user is the same or root.

Sure, what about tag -> id? Tag is being reported from tracing and thus
should be one of the starting points to explore which programs are running.

> We tried really hard to keep everything FD based. Unfortunately
> netlink is not suitable to pass FDs, so to query TC and XDP
> we either have to invent a way to install FD from netlink in recvmsg()
> or pass something that can be converted to FD later.
> That's what program ID is solving.

Hmm, I am not sure why netlink is not suitable to pass up file
descriptors. We certainly pass down file descriptors with netlink. But I
am fine with either bpf syscall or netlink. But also the introduction of
bpf prog id depends a bit on this reasoning.

> This set of patches look trivial with simple use of idr,
> but it took us long time to get there.
> We tried to use 64-bit ID to avoid wrap around issue, but association
> between ID and bpf_prog needs to be kept somewhere. The obvious
> answer is rhashtable, but it cannot be iterated easily.
> Like we'd need to dump the whole thing through bpf syscall which
> is not practical.
> Then we tried to use 32-bit idr's id + 32-bit timestamp/random.
> It works better, but then we hit the issue that bpf_prog_get_next_id
> cannot be iterated in a stable way when programs are being deleted
> while user space iterates over the whole list.
> So at the end we scraped all the fancy things and went with
> simple 32-bit ID allocated in _cyclic_ way via idr.
> The reason for cyclic is to avoid prog delete/create races,
> so ID seen by user space stays stable for 2B ids.
> We were concerned that somebody might try to load/delete
> a program 2B times to cause the counter to wrap around, but
> it turned out not to be an issue. In that sense prog ID is similar
> to PID.

Or to ifindex. I don't have concerns and think this should be okay.

> So more complete picture of what we're trying to do:
> - new bpf_get_fd_from_id syscall cmd will be used to convert
>   prog ID into prog FD
> - tc/xdp/sockets/tracing attachment points will return prog ID
> - existing bpf_map_lookup() cmd from prog_array will be returning
>   prog ID
> - bpf_prog_next_id syscall cmd (this patch) is used to iterate
>   over all prog IDs
> - new bpf_prog_get_info syscall cmd (based on prog FD) will be used
>   to get all or partial info about the program that kernel knows about

Sounds all good to me.

> Example usage:
> - if user space want to see instructions of all loaded programs
>   it can use a loop like:
> while (!bpf_prog_get_next_id(next_id, &next_id)) {
>    int fd = bpf_prog_get_fd_from_id(next_id);
>    struct bpf_prog_info info;
>    bpf_prog_get_info(fd, &info, flags);
>    // look into info.insns[]
>    close(fd);
> }
> 
> - if user space want to see prog_tag of xdp program attached to eth0
>   // netlink sendmsg() into ifindex of eth0 that returns prog ID
>    int fd = bpf_prog_get_fd_from_id(id_from_netlink);
>    struct bpf_prog_info info;
>    bpf_prog_get_info(fd, &info, flags);
>    // look into info.prog_tag
>    close(fd);

No problems with the above examples. We would also like to be able to
close the loop from the tracing side as well as outlined above.

> the 'flags' argument of bpf_prog_get_info() will be used
> to tell kernel which info about the program needs to be dumped.
> Otherwise if kernel always dumps everything about the program,
> it will make the syscall too slow and too cumbersome.
> Possible combinations:
> - prog_type, prog_tag, license, prog ID
> - array of prog instructions
> - array of map IDs
> Here we'll introduce similar IDs for maps and
> bpf_map_get_info() syscall cmd that will return map_type, map_id, sizes.
> If user wants to iterate over all elements of the map, they can
> use map_fd = bpf_map_get_fd_from_id(map_id); command
> and later use existing bpf_map_get_next_key+bpf_map_lookup_elem.
> 
> We believe this way the user space will be able to see _everything_
> about bpf programs and maps and can pick and choose whether
> it wants to see only programs or only maps or partial info
> about progs (without instructions) and so on.
> 
> Once we have CTF (debug info) available for maps and progs,
> we will extend bpf_prog_get_info() and bpf_map_get_info()
> commands to optionally return that as well.
> 

I think this is all non controversial!

Thanks,
Hannes

^ permalink raw reply

* Re: rhashtable - Cap total number of entries to 2^31
From: Christian Borntraeger @ 2017-04-28 11:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Florian Fainelli, David Miller, fw, netdev, Thomas Graf,
	Stephen Rothwell, Linux-Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <20170428113145.GA7843@gondor.apana.org.au>

On 04/28/2017 01:31 PM, Herbert Xu wrote:
> On Fri, Apr 28, 2017 at 12:23:15PM +0200, Christian Borntraeger wrote:
>>
>> I can reproduce this boot failure on s390 bisected to 
>> commit 6d684e54690caef45cf14051ddeb7c71beeb681b
>>    rhashtable: Cap total number of entries to 2^31
>> in linux-next from Apr 28
> 
> It should go away with
> 
> https://patchwork.ozlabs.org/patch/756233/
> 
> Thanks,
> 
Yes it does.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

would be nice to have it in the next linux-next version.

^ permalink raw reply

* Re: [PATCH v6 3/5] rxrpc: check return value of skb_to_sgvec always
From: Sabrina Dubroca @ 2017-04-28 11:41 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: netdev, linux-kernel, David.Laight, kernel-hardening, davem
In-Reply-To: <20170425184734.26563-3-Jason@zx2c4.com>

2017-04-25, 20:47:32 +0200, Jason A. Donenfeld wrote:
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  net/rxrpc/rxkad.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
> index 4374e7b9c7bf..dcf46c9c3ece 100644
> --- a/net/rxrpc/rxkad.c
> +++ b/net/rxrpc/rxkad.c
[...]
> @@ -429,7 +432,8 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb,
>  	}
>  

Adding a few more lines of context:

	sg = _sg;
	if (unlikely(nsg > 4)) {
		sg = kmalloc(sizeof(*sg) * nsg, GFP_NOIO);
		if (!sg)
			goto nomem;
	}

>  	sg_init_table(sg, nsg);
> -	skb_to_sgvec(skb, sg, offset, len);
> +	if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0))
> +		goto nomem;

You're leaking sg when nsg > 4, you'll need to add this:

	if (sg != _sg)
		kfree(sg);

BTW, when you resubmit, please Cc: the maintainers of the files you're
changing for each patch, so that they can review this stuff. And send
patch 1 to all of them, otherwise they might be surprised that we even
need <0 checking after calls to skb_to_sgvec.

You might also want to add a cover letter.

-- 
Sabrina

^ permalink raw reply

* Re: rhashtable - Cap total number of entries to 2^31
From: Herbert Xu @ 2017-04-28 11:31 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Florian Fainelli, David Miller, fw, netdev, Thomas Graf,
	Stephen Rothwell, Linux-Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <acb22f3f-8286-01f2-e536-0ab44eb06b34@de.ibm.com>

On Fri, Apr 28, 2017 at 12:23:15PM +0200, Christian Borntraeger wrote:
>
> I can reproduce this boot failure on s390 bisected to 
> commit 6d684e54690caef45cf14051ddeb7c71beeb681b
>    rhashtable: Cap total number of entries to 2^31
> in linux-next from Apr 28

It should go away with

https://patchwork.ozlabs.org/patch/756233/

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* 31868 netdev
From: stef.ryckmans @ 2017-04-28 11:03 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 051948408.zip --]
[-- Type: application/zip, Size: 2842 bytes --]

^ permalink raw reply

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Jesper Dangaard Brouer @ 2017-04-28 10:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Gospodarek, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, brouer
In-Reply-To: <53e9dd2f-f40a-b43b-99c9-62f5ce3a665c@fb.com>

On Thu, 27 Apr 2017 16:31:14 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> > When registering/attaching a XDP/bpf program, we would just send the
> > file-descriptor for this port-map along (like we do with the bpf_prog
> > FD). Plus, it own ingress-port number this program is in the port-map.
> >
> > It is not clear to me, in-which-data-structure on the kernel-side we
> > store this reference to the port-map and ingress-port. As today we only
> > have the "raw" struct bpf_prog pointer. I see several options:
> >
> > 1. Create a new xdp_prog struct that contains existing bpf_prog,
> > a port-map pointer and ingress-port. (IMHO easiest solution)
> >
> > 2. Just create a new pointer to port-map and store it in driver rx-ring
> > struct (like existing bpf_prog), but this create a race-challenge
> > replacing (cmpxchg) the program (or perhaps it's not a problem as it
> > runs under rcu and RTNL-lock).
> >
> > 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> > fast-way to access it.  I assume it will be accessible via
> > bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.  
> 
> I'm not sure I completely follow the 3 proposals.
> Are you suggesting to have only one netdev_array per program?

Yes, but I can see you have a more clever idea below.

> Why not to allow any number like we do for tailcall+prog_array, etc?

> We can teach verifier to allow new helper
>  bpf_tx_port(netdev_array, port_num);
> to only be used with netdev_array map type.
> It will fetch netdevice pointer from netdev_array[port_num]
> and will tx the packet into it.

I love it. 

I just don't like the "netdev" part of the name "netdev_array" as one
basic ideas of a port tabel, is that a port can be anything that can
consume a XDP_buff packet.  This generalization allow us to move code
out of the drivers.  We might be on the same page, as I do imagine that
netdev_array or port_array is just a struct bpf_map pointer, and the
bpf_map->map_type will tell us that this bpf_map contains net_device
pointers.  Thus, when later introducing a new type of redirect (like to
a socket or remote-CPU) then we just add a new bpf_map_type for this,
without needing to change anything in the drivers, right?

Do you imagine that bpf-side bpf_tx_port() returns XDP_REDIRECT?
Or does it return if the call was successful (e.g validate port_num
existed in map)?

On the kernel side, we need to receive this info "port_array" and
"port_num", given you don't provide the call a xdp_buff/ctx, then I
assume you want the per-CPU temp-store solution.  Then during the
XDP_REDIRECT action we call a core redirect function that based on the
bpf_map_type does a lookup, and find the net_device ptr.

> We can make it similar to bpf_tail_call(), so that program will
> finish on successful bpf_tx_port() or
> make it into 'delayed' tx which will be executed when program finishes.
> Not sure which approach is better.

I know you are talking about something slightly different, about
delaying TX.

But I want to mention (as I've done before) that it is important (for
me) that we get bulking working/integrated.   I imagine the driver will
call a function that will delay the TX/redirect action and at the end
of the NAPI cycle have a function that flush packets, bulk per
destination port.

I was wondering where to store these delayed TX packets, but now that
we have an associated bpf_map data-structure (netdev_array), I'm thinking
about storing packets (ordered by port) inside that.  And then have a
bpf_tx_flush(netdev_array) call in the driver (for every port-table-map
seen, which will likely be small).

> We can also extend this netdev_array into broadcast/multicast. Like
> bpf_tx_allports(&netdev_array);
> call from the program will xmit the packet to all netdevices
> in that 'netdev_array' map type.

When broadcasting you often don't want to broadcast the packet out of
the incoming interface.  How can you support this?

Normally you would know your ingress port, and then excluded that port
in the broadcast.  But with many netdev_array's how do the program know
it's own ingress port.

> The map-in-map support can be trivially extended to allow netdev_array,
> then the program can create N multicast groups of netdevices.
> Each multicast group == one netdev_array map.
> The user space will populate a hashmap with these netdev_arrays and
> bpf kernel side can select dynamically which multicast group to use
> to send the packets to.
> bpf kernel side may look like:
> struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
> if (!netdev_array)
>    ...
> if (my_condition)
>     bpf_tx_allports(netdev_array);  /* broadcast to all netdevices */
> else
>     bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
> 
> that's an artificial example. Just trying to point out
> that we shouldn't restrict the feature too soon.

I like how you solve the multicast problem.  (But I do need to learn
some more of the inner-workings of bpf map-in-map to follow this
completely).

Thanks a lot for all this input, I got a much more clear picture of how
I can/should implement this :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH iproute2 net-next] bpf: add support for generic xdp
From: Daniel Borkmann @ 2017-04-28 10:42 UTC (permalink / raw)
  To: stephen; +Cc: alexei.starovoitov, davem, netdev, Daniel Borkmann

Follow-up to commit c7272ca72009 ("bpf: add initial support for
attaching xdp progs") to also support generic XDP. This adds an
indicator for loaded generic XDP programs when programs are loaded
as shown in c7272ca72009, but the driver still lacks native XDP
support.

  # ip link
  [...]
  3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc [...]
      link/ether 0c:c4:7a:03:f9:25 brd ff:ff:ff:ff:ff:ff
  [...]

In case the driver does support native XDP, but the user wants
to load the program as generic XDP (e.g. for testing purposes),
then this can be done with the same semantics as in c7272ca72009,
but with 'xdpgeneric' instead of 'xdp' command for loading:

  # ip -force link set dev eno1 xdpgeneric obj xdp.o

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 ( Requires a header update to pull in XDP_FLAGS_SKB_MODE. )

 ip/iplink.c           |  7 +++++--
 ip/iplink_xdp.c       | 46 +++++++++++++++++++++++++++++++++-------------
 ip/xdp.h              |  2 +-
 man/man8/ip-link.8.in | 19 +++++++++++++++++--
 4 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 866ad72..96b0da3 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -606,9 +606,12 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 			if (get_integer(&mtu, *argv, 0))
 				invarg("Invalid \"mtu\" value\n", *argv);
 			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
-		} else if (strcmp(*argv, "xdp") == 0) {
+		} else if (strcmp(*argv, "xdpgeneric") == 0 ||
+			   strcmp(*argv, "xdp") == 0) {
+			bool generic = strcmp(*argv, "xdpgeneric") == 0;
+
 			NEXT_ARG();
-			if (xdp_parse(&argc, &argv, req))
+			if (xdp_parse(&argc, &argv, req, generic))
 				exit(-1);
 		} else if (strcmp(*argv, "netns") == 0) {
 			NEXT_ARG();
diff --git a/ip/iplink_xdp.c b/ip/iplink_xdp.c
index a81ed97..4a3343f 100644
--- a/ip/iplink_xdp.c
+++ b/ip/iplink_xdp.c
@@ -19,41 +19,56 @@
 
 extern int force;
 
+struct xdp_req {
+	struct iplink_req *req;
+	__u32 flags;
+};
+
 static void xdp_ebpf_cb(void *raw, int fd, const char *annotation)
 {
-	__u32 flags = !force ? XDP_FLAGS_UPDATE_IF_NOEXIST : 0;
-	struct iplink_req *req = raw;
-	struct rtattr *xdp;
+	struct xdp_req *xdp = raw;
+	struct iplink_req *req = xdp->req;
+	struct rtattr *xdp_attr;
 
-	xdp = addattr_nest(&req->n, sizeof(*req), IFLA_XDP);
+	xdp_attr = addattr_nest(&req->n, sizeof(*req), IFLA_XDP);
 	addattr32(&req->n, sizeof(*req), IFLA_XDP_FD, fd);
-	addattr32(&req->n, sizeof(*req), IFLA_XDP_FLAGS, flags);
-	addattr_nest_end(&req->n, xdp);
+	if (xdp->flags)
+		addattr32(&req->n, sizeof(*req), IFLA_XDP_FLAGS, xdp->flags);
+	addattr_nest_end(&req->n, xdp_attr);
 }
 
 static const struct bpf_cfg_ops bpf_cb_ops = {
 	.ebpf_cb = xdp_ebpf_cb,
 };
 
-static int xdp_delete(struct iplink_req *req)
+static int xdp_delete(struct xdp_req *xdp)
 {
-	xdp_ebpf_cb(req, -1, NULL);
+	xdp_ebpf_cb(xdp, -1, NULL);
 	return 0;
 }
 
-int xdp_parse(int *argc, char ***argv, struct iplink_req *req)
+int xdp_parse(int *argc, char ***argv, struct iplink_req *req, bool generic)
 {
 	struct bpf_cfg_in cfg = {
 		.argc = *argc,
 		.argv = *argv,
 	};
+	struct xdp_req xdp = {
+		.req = req,
+	};
 
 	if (*argc == 1) {
 		if (strcmp(**argv, "none") == 0 ||
 		    strcmp(**argv, "off") == 0)
-			return xdp_delete(req);
+			return xdp_delete(&xdp);
 	}
-	if (bpf_parse_common(BPF_PROG_TYPE_XDP, &cfg, &bpf_cb_ops, req))
+
+	if (!force)
+		xdp.flags |= XDP_FLAGS_UPDATE_IF_NOEXIST;
+	if (generic)
+		xdp.flags |= XDP_FLAGS_SKB_MODE;
+
+	if (bpf_parse_common(BPF_PROG_TYPE_XDP, &cfg, &bpf_cb_ops, &xdp))
 		return -1;
 
 	*argc = cfg.argc;
@@ -64,12 +79,17 @@ int xdp_parse(int *argc, char ***argv, struct iplink_req *req)
 void xdp_dump(FILE *fp, struct rtattr *xdp)
 {
 	struct rtattr *tb[IFLA_XDP_MAX + 1];
+	__u32 flags = 0;
 
 	parse_rtattr_nested(tb, IFLA_XDP_MAX, xdp);
+
 	if (!tb[IFLA_XDP_ATTACHED] ||
 	    !rta_getattr_u8(tb[IFLA_XDP_ATTACHED]))
 		return;
 
-	fprintf(fp, "xdp ");
-	/* More to come here in future for 'ip -d link' (digest, etc) ... */
+	if (tb[IFLA_XDP_FLAGS])
+		flags = rta_getattr_u32(tb[IFLA_XDP_FLAGS]);
+
+	fprintf(fp, "xdp%s ",
+		flags & XDP_FLAGS_SKB_MODE ? "generic" : "");
 }
diff --git a/ip/xdp.h b/ip/xdp.h
index bc69645..1b95e0f 100644
--- a/ip/xdp.h
+++ b/ip/xdp.h
@@ -3,7 +3,7 @@
 
 #include "utils.h"
 
-int xdp_parse(int *argc, char ***argv, struct iplink_req *req);
+int xdp_parse(int *argc, char ***argv, struct iplink_req *req, bool generic);
 void xdp_dump(FILE *fp, struct rtattr *tb);
 
 #endif /* __XDP__ */
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a5ddfe7..52571b7 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -126,7 +126,7 @@ ip-link \- network device configuration
 .RB "[ " port_guid " eui64 ] ]"
 .br
 .in -9
-.RB "[ " xdp  " { " off " | "
+.RB "[ { " xdp " | " xdpgeneric  " } { " off " | "
 .br
 .in +8
 .BR object
@@ -1572,8 +1572,23 @@ which may impact security and/or performance. (e.g. VF multicast promiscuous mod
 
 .TP
 .B xdp object "|" pinned "|" off
-set (or unset) a XDP ("express data path") BPF program to run on every
+set (or unset) a XDP ("eXpress Data Path") BPF program to run on every
 packet at driver level.
+.B ip link
+output will indicate a
+.B xdp
+flag for the networking device. If the driver does not have native XDP
+support, the kernel will fall back to a slower, driver-independent "generic"
+XDP variant. The
+.B ip link
+output will in that case indicate
+.B xdpgeneric
+instead of
+.B xdp
+only. If the driver does have native XDP support, but the program is
+loaded under
+.B xdpgeneric object "|" pinned
+then the kernel will use the generic XDP variant instead of the native one.
 
 .B off
 (or
-- 
1.9.3

^ permalink raw reply related

* Re: rhashtable - Cap total number of entries to 2^31
From: Christian Borntraeger @ 2017-04-28 10:23 UTC (permalink / raw)
  To: Florian Fainelli, Herbert Xu, David Miller
  Cc: fw, netdev, Thomas Graf, Stephen Rothwell,
	Linux-Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <56843a86-9a09-16e8-acec-05a80396f282@gmail.com>

On 04/28/2017 12:21 AM, Florian Fainelli wrote:
> On 04/27/2017 02:16 PM, Florian Fainelli wrote:
>> Hi Herbert,
>>
>> On 04/26/2017 10:44 PM, Herbert Xu wrote:
>>> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote:
>>>> From: Florian Westphal <fw@strlen.de>
>>>> Date: Tue, 25 Apr 2017 16:17:49 +0200
>>>>
>>>>> I'd have less of an issue with this if we'd be talking about
>>>>> something computationally expensive, but this is about storing
>>>>> an extra value inside a struct just to avoid one "shr" in insert path...
>>>>
>>>> Agreed, this shift is probably filling an available cpu cycle :-)
>>>
>>> OK, but we need to have an extra field for another reason anyway.
>>> The problem is that we're not capping the total number of elements
>>> in the hashtable when max_size is not set, this means that nelems
>>> can overflow which will cause havoc with the automatic shrinking
>>> when it tries to fit 2^32 entries into a minimum-sized table.
>>>
>>> So I'm taking that hole back for now :)
>>>
>>> ---8<---
>>> When max_size is not set or if it set to a sufficiently large
>>> value, the nelems counter can overflow.  This would cause havoc
>>> with the automatic shrinking as it would then attempt to fit a
>>> huge number of entries into a tiny hash table.
>>>
>>> This patch fixes this by adding max_elems to struct rhashtable
>>> to cap the number of elements.  This is set to 2^31 as nelems is
>>> not a precise count.  This is sufficiently smaller than UINT_MAX
>>> that it should be safe.
>>>
>>> When max_size is set max_elems will be lowered to at most twice
>>> max_size as is the status quo.
>>>
>>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>>
>> This commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b
>>
>> makes my ARMv7 (32-bit) system panic on boot with the log below. I can
>> test net-next (or net) and report back if you want me to test anything.
>> Thanks!
> 
> And another on with a QEMU guest:
> 
> [    0.389212] NET: Registered protocol family 16
> [    0.388807] Kernel panic - not syncing: rtnetlink_init: cannot
> initialize rtnetlink
> [    0.388807]
> [    0.389445] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.11.0-rc8-02077-ge221c1f0fe25 #1
> [    0.389745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
> [    0.390219] Call Trace:
> [    0.391406]  dump_stack+0x51/0x78
> [    0.391585]  panic+0xc7/0x20e
> [    0.391740]  ? register_pernet_operations+0xa1/0xd0
> [    0.392031]  rtnetlink_init+0x22/0x1a0
> [    0.392190]  netlink_proto_init+0x168/0x184
> [    0.392359]  ? ptp_classifier_init+0x26/0x30
> [    0.392528]  ? netlink_net_init+0x2e/0x2e
> [    0.392692]  do_one_initcall+0x54/0x190
> [    0.392852]  ? parse_args+0x248/0x400
> [    0.393033]  kernel_init_freeable+0x127/0x1b6
> [    0.393208]  ? kernel_init_freeable+0x1b6/0x1b6
> [    0.393389]  ? rest_init+0x70/0x70
> [    0.393533]  kernel_init+0x9/0x100
> [    0.393676]  ret_from_fork+0x29/0x40
> [    0.394555] ---[ end Kernel panic - not syncing: rtnetlink_init:
> cannot initialize rtnetlink
> [    0.394555]
> 
> I traced this down to:
> 
> rtnetlink_net_init()
>   netlink_kernel_create()
>      netlink_insert()
> 	__netlink_insert()
> 	   rhashtable_lookup_insert_key()
> 	      __rhashtable_insert_fast()
>                 rht_grow_above_max()
> 
> And indeed we have:
> 
> ht->nelemts = 0
> ht->max_elems = 0
> 
> such that rht_grow_above_max() returns true.
> 
> With your commit we actually take this branch:
> 
> if (ht->p.max_size < ht->max_elems / 2)
> 	ht->max_elems = ht->p.max_size * 2;
> 
> since max_size = 0 we have max_elems = 0 as well.
> 
> Candidate fix #1:
> 
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 45f89369c4c8..ad9020e1609c 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -329,7 +329,7 @@ static inline bool rht_grow_above_100(const struct
> rhashtable *ht,
>  static inline bool rht_grow_above_max(const struct rhashtable *ht,
>                                       const struct bucket_table *tbl)
>  {
> -       return atomic_read(&ht->nelems) >= ht->max_elems;
> +       return ht->p.max_size && atomic_read(&ht->nelems) >= ht->max_elems;
>  }
> 
> Candidate fix #2:
> 
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 751630bbe409..6b4f07760fec 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
> 
>         /* Cap total entries at 2^31 to avoid nelems overflow. */
>         ht->max_elems = 1u << 31;
> -       if (ht->p.max_size < ht->max_elems / 2)
> +       if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
>                 ht->max_elems = ht->p.max_size * 2;
> 
>         ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
> 
> Number #2 does not introduce an additional conditional on the fastpath,
> so I suppose that would be what we would prefer?
> 
>>
>> [    0.158619] futex hash table entries: 1024 (order: 4, 65536 bytes)
>> [    0.166386] NET: Registered protocol family 16
>> [    0.179596] Kernel panic - not syncing: rtnetlink_init: cannot
>> initialize rtnetlink
>> [    0.179596]
>> [    0.189350] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
>> 4.11.0-rc8-02028-g6d684e54690c #37
>> [    0.197908] Hardware name: Broadcom STB (Flattened Device Tree)
>> [    0.204254] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
>> (show_stack+0x10/0x14)
>> [    0.212447] [<c020b294>] (show_stack) from [<c04bc454>]
>> (dump_stack+0x90/0xa4)
>> [    0.220144] [<c04bc454>] (dump_stack) from [<c02ab684>]
>> (panic+0xf0/0x270)
>> [    0.227460] [<c02ab684>] (panic) from [<c0c2705c>]
>> (rtnetlink_init+0x24/0x1d4)
>> [    0.235145] [<c0c2705c>] (rtnetlink_init) from [<c0c27630>]
>> (netlink_proto_init+0x124/0x148)
>> [    0.244124] [<c0c27630>] (netlink_proto_init) from [<c02017f8>]
>> (do_one_initcall+0x40/0x168)
>> [    0.253072] [<c02017f8>] (do_one_initcall) from [<c0c00dfc>]
>> (kernel_init_freeable+0x164/0x200)
>> [    0.262304] [<c0c00dfc>] (kernel_init_freeable) from [<c087bfd8>]
>> (kernel_init+0x8/0x110)
>> [    0.270970] [<c087bfd8>] (kernel_init) from [<c0207fa8>]
>> (ret_from_fork+0x14/0x2c)
>> [    0.279014] CPU1: stopping
>> [    0.281916] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
>> 4.11.0-rc8-02028-g6d684e54690c #37
>> [    0.290499] Hardware name: Broadcom STB (Flattened Device Tree)
>> [    0.296796] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
>> (show_stack+0x10/0x14)
>> [    0.305018] [<c020b294>] (show_stack) from [<c04bc454>]
>> (dump_stack+0x90/0xa4)
>> [    0.312684] [<c04bc454>] (dump_stack) from [<c020e984>]
>> (handle_IPI+0x170/0x190)
>> [    0.320531] [<c020e984>] (handle_IPI) from [<c020144c>]
>> (gic_handle_irq+0x88/0x8c)
>> [    0.328586] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
>> (__irq_svc+0x58/0x74)
>> [    0.336543] Exception stack(0xee055f68 to 0xee055fb0)
>> [    0.341938] 5f60:                   00000001 00000000 ee055fc0
>> c0219b60 ee054000 c1603cc8
>> [    0.350661] 5f80: c1603c6c 00000000 00000000 c1486188 ee055fc0
>> c1603cd4 c1483408 ee055fb8
>> [    0.359323] 5fa0: c0208a40 c0208a44 60000013 ffffffff
>> [    0.364745] [<c020bd78>] (__irq_svc) from [<c0208a44>]
>> (arch_cpu_idle+0x38/0x3c)
>> [    0.372613] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
>> (do_idle+0x168/0x204)
>> [    0.380479] [<c0255e98>] (do_idle) from [<c02561ac>]
>> (cpu_startup_entry+0x18/0x1c)
>> [    0.388493] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
>> [    0.395687] CPU3: stopping
>> [    0.398606] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
>> 4.11.0-rc8-02028-g6d684e54690c #37
>> [    0.407242] Hardware name: Broadcom STB (Flattened Device Tree)
>> [    0.413564] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
>> (show_stack+0x10/0x14)
>> [    0.421795] [<c020b294>] (show_stack) from [<c04bc454>]
>> (dump_stack+0x90/0xa4)
>> [    0.429495] [<c04bc454>] (dump_stack) from [<c020e984>]
>> (handle_IPI+0x170/0x190)
>> [    0.437394] [<c020e984>] (handle_IPI) from [<c020144c>]
>> (gic_handle_irq+0x88/0x8c)
>> [    0.445475] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
>> (__irq_svc+0x58/0x74)
>> [    0.453406] Exception stack(0xee059f68 to 0xee059fb0)
>> [    0.458792] 9f60:                   00000001 00000000 ee059fc0
>> c0219b60 ee058000 c1603cc8
>> [    0.467489] 9f80: c1603c6c 00000000 00000000 c1486188 ee059fc0
>> c1603cd4 c1483408 ee059fb8
>> [    0.476177] 9fa0: c0208a40 c0208a44 60000013 ffffffff
>> [    0.481581] [<c020bd78>] (__irq_svc) from [<c0208a44>]
>> (arch_cpu_idle+0x38/0x3c)
>> [    0.489474] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
>> (do_idle+0x168/0x204)
>> [    0.497331] [<c0255e98>] (do_idle) from [<c02561ac>]
>> (cpu_startup_entry+0x18/0x1c)
>> [    0.505369] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
>> [    0.512562] CPU2: stopping
>> [    0.515463] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
>> 4.11.0-rc8-02028-g6d684e54690c #37
>> [    0.524047] Hardware name: Broadcom STB (Flattened Device Tree)
>> [    0.530368] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
>> (show_stack+0x10/0x14)
>> [    0.538573] [<c020b294>] (show_stack) from [<c04bc454>]
>> (dump_stack+0x90/0xa4)
>> [    0.546195] [<c04bc454>] (dump_stack) from [<c020e984>]
>> (handle_IPI+0x170/0x190)
>> [    0.554050] [<c020e984>] (handle_IPI) from [<c020144c>]
>> (gic_handle_irq+0x88/0x8c)
>> [    0.562096] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
>> (__irq_svc+0x58/0x74)
>> [    0.570044] Exception stack(0xee057f68 to 0xee057fb0)
>> [    0.575465] 7f60:                   00000001 00000000 ee057fc0
>> c0219b60 ee056000 c1603cc8
>> [    0.584145] 7f80: c1603c6c 00000000 00000000 c1486188 ee057fc0
>> c1603cd4 c1483408 ee057fb8
>> [    0.592806] 7fa0: c0208a40 c0208a44 60000013 ffffffff
>> [    0.598220] [<c020bd78>] (__irq_svc) from [<c0208a44>]
>> (arch_cpu_idle+0x38/0x3c)
>> [    0.606103] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
>> (do_idle+0x168/0x204)
>> [    0.613960] [<c0255e98>] (do_idle) from [<c02561ac>]
>> (cpu_startup_entry+0x18/0x1c)
>> [    0.621990] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
>> [    0.629201] ---[ end Kernel panic - not syncing: rtnetlink_init:
>> cannot initialize rtnetlink
>> [    0.629201]
>>
> 
> 

I can reproduce this boot failure on s390 bisected to 
commit 6d684e54690caef45cf14051ddeb7c71beeb681b
   rhashtable: Cap total number of entries to 2^31
in linux-next from Apr 28

[    0.452478] NET: Registered protocol family 16
[    0.477867] Kernel panic - not syncing: rtnetlink_init: cannot initialize rtnetlink
[    0.477867] 
[    0.477869] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc8-02028-g6d684e5 #490
[    0.477870] Hardware name: IBM              2964 NC9              704              (KVM)
[    0.477871] Stack:
[    0.477871]        00000002743efb30 00000002743efbc0 0000000000000003 0000000000000000
[    0.477873]        00000002743efc60 00000002743efbd8 00000002743efbd8 0000000000000020
[    0.477875]        0000000000f4444e 0000000000000020 000000000000000a 000000000000000a
[    0.477877]        000000000000000c 00000002743efc28 0000000000000000 0000000000000000
[    0.477878]        0000000000958d60 00000000001125c4 00000002743efbc0 00000002743efc18
[    0.477880] Call Trace:
[    0.477882] ([<000000000011247a>] show_trace+0x62/0x78)
[    0.477883]  [<0000000000112568>] show_stack+0x68/0xe0 
[    0.477886]  [<0000000000687d46>] dump_stack+0x7e/0xb0 
[    0.477887]  [<000000000028353c>] panic+0x104/0x240 
[    0.477890]  [<0000000000ea9934>] rtnetlink_init+0x3c/0x1b8 
[    0.477951]  [<0000000000eab500>] netlink_proto_init+0x170/0x198 
[    0.477953]  [<000000000010024c>] do_one_initcall+0x4c/0x148 
[    0.477954]  [<0000000000e59d3a>] kernel_init_freeable+0x1ea/0x2a0 
[    0.477957]  [<000000000094404a>] kernel_init+0x2a/0x148 
[    0.477959]  [<000000000094e35e>] kernel_thread_starter+0x6/0xc 
[    0.477960]  [<000000000094e358>] kernel_thread_starter+0x0/0xc 

^ permalink raw reply

* Re: [PATCH net-next v5 1/2] net: hns: support deferred probe when can not obtain irq
From: Matthias Brugger @ 2017-04-28 10:17 UTC (permalink / raw)
  To: Yankejian, davem, salil.mehta, yisen.zhuang, lipeng321, zhouhuiru,
	huangdaode
  Cc: netdev, linuxarm
In-Reply-To: <1493362187-51671-2-git-send-email-yankejian@huawei.com>



On 28/04/17 08:49, Yankejian wrote:
> From: lipeng <lipeng321@huawei.com>
>
> In the hip06 and hip07 SoCs, the interrupt lines from the
> DSAF controllers are connected to mbigen hw module.
> The mbigen module is probed with module_init, and, as such,
> is not guaranteed to probe before the HNS driver. So we need
> to support deferred probe.
>
> Signed-off-by: lipeng <lipeng321@huawei.com>
> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
> Reviewed-by: Matthias Brugger <mbrugger@suse.com>

Looks good now, so you can keep my Reviewed-by.

> ---
> change log:
> V4 -> V5:
> 1. Float on net-next;
>
> V3 -> V4:
> 1. Delete redundant commit message;
> 2. add Reviewed-by: Matthias Brugger <mbrugger@suse.com>;
>
> V2 -> V3:
> 1. Check return value when  platform_get_irq in hns_rcb_get_cfg;
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 4 +++-
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c | 8 +++++++-
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h | 2 +-
>  3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> index eba406b..93e71e2 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> @@ -510,7 +510,9 @@ int hns_ppe_init(struct dsaf_device *dsaf_dev)
>
>  		hns_ppe_get_cfg(dsaf_dev->ppe_common[i]);
>
> -		hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
> +		ret = hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
> +		if (ret)
> +			goto get_cfg_fail;
>  	}
>
>  	for (i = 0; i < HNS_PPE_COM_NUM; i++)
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> index c20a0f4..e2e2853 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> @@ -492,7 +492,7 @@ static int hns_rcb_get_base_irq_idx(struct rcb_common_cb *rcb_common)
>   *hns_rcb_get_cfg - get rcb config
>   *@rcb_common: rcb common device
>   */
> -void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
> +int hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
>  {
>  	struct ring_pair_cb *ring_pair_cb;
>  	u32 i;
> @@ -517,10 +517,16 @@ void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
>  		ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] =
>  		is_ver1 ? platform_get_irq(pdev, base_irq_idx + i * 2 + 1) :
>  			  platform_get_irq(pdev, base_irq_idx + i * 3);
> +		if ((ring_pair_cb->virq[HNS_RCB_IRQ_IDX_TX] == -EPROBE_DEFER) ||
> +		    (ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] == -EPROBE_DEFER))
> +			return -EPROBE_DEFER;
> +
>  		ring_pair_cb->q.phy_base =
>  			RCB_COMM_BASE_TO_RING_BASE(rcb_common->phy_base, i);
>  		hns_rcb_ring_pair_get_cfg(ring_pair_cb);
>  	}
> +
> +	return 0;
>  }
>
>  /**
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> index a664ee8..6028164 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> @@ -121,7 +121,7 @@ struct rcb_common_cb {
>  void hns_rcb_common_free_cfg(struct dsaf_device *dsaf_dev, u32 comm_index);
>  int hns_rcb_common_init_hw(struct rcb_common_cb *rcb_common);
>  void hns_rcb_start(struct hnae_queue *q, u32 val);
> -void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common);
> +int hns_rcb_get_cfg(struct rcb_common_cb *rcb_common);
>  void hns_rcb_get_queue_mode(enum dsaf_mode dsaf_mode,
>  			    u16 *max_vfn, u16 *max_q_per_vf);
>
>

^ permalink raw reply

* [PATCH net] ipv4: Don't pass IP fragments to upper layer GRO handlers.
From: Steffen Klassert @ 2017-04-28  8:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Upper layer GRO handlers can not handle IP fragments, so
exit GRO processing in this case.

This fixes ESP GRO because the packet must be reassembled
before we can decapsulate, otherwise we get authentication
failures.

It also aligns IPv4 to IPv6 where packets with fragmentation
headers are not passed to upper layer GRO handlers.

Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/af_inet.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6b1fc6e..13a9a32 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1343,6 +1343,9 @@ struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 	if (*(u8 *)iph != 0x45)
 		goto out_unlock;

+	if (ip_is_fragment(iph))
+		goto out_unlock;
+
 	if (unlikely(ip_fast_csum((u8 *)iph, 5)))
 		goto out_unlock;

-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/2] ipvs: change comparison on sync_refresh_period
From: Simon Horman @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Aaron Conole, Simon Horman
In-Reply-To: <20170428101159.9810-1-horms@verge.net.au>

From: Aaron Conole <aconole@bytheb.org>

The sync_refresh_period variable is unsigned, so it can never be < 0.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_sync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 30d6b2cc00a0..0e5b64a75da0 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -520,7 +520,7 @@ static int ip_vs_sync_conn_needed(struct netns_ipvs *ipvs,
 		if (!(cp->flags & IP_VS_CONN_F_TEMPLATE) &&
 		    pkts % sync_period != sysctl_sync_threshold(ipvs))
 			return 0;
-	} else if (sync_refresh_period <= 0 &&
+	} else if (!sync_refresh_period &&
 		   pkts != sysctl_sync_threshold(ipvs))
 		return 0;
 
-- 
2.12.2.816.g2cccc81164


^ permalink raw reply related

* [PATCH 1/2] ipvs: remove unused function ip_vs_set_state_timeout
From: Simon Horman @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Aaron Conole, Simon Horman
In-Reply-To: <20170428101159.9810-1-horms@verge.net.au>

From: Aaron Conole <aconole@bytheb.org>

There are no in-tree callers of this function and it isn't exported.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/net/ip_vs.h              |  2 --
 net/netfilter/ipvs/ip_vs_proto.c | 22 ----------------------
 2 files changed, 24 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 632082300e77..4f4f786255ef 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1349,8 +1349,6 @@ int ip_vs_protocol_init(void);
 void ip_vs_protocol_cleanup(void);
 void ip_vs_protocol_timeout_change(struct netns_ipvs *ipvs, int flags);
 int *ip_vs_create_timeout_table(int *table, int size);
-int ip_vs_set_state_timeout(int *table, int num, const char *const *names,
-			    const char *name, int to);
 void ip_vs_tcpudp_debug_packet(int af, struct ip_vs_protocol *pp,
 			       const struct sk_buff *skb, int offset,
 			       const char *msg);
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 8ae480715cea..ca880a3ad033 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -193,28 +193,6 @@ ip_vs_create_timeout_table(int *table, int size)
 }
 
 
-/*
- *	Set timeout value for state specified by name
- */
-int
-ip_vs_set_state_timeout(int *table, int num, const char *const *names,
-			const char *name, int to)
-{
-	int i;
-
-	if (!table || !name || !to)
-		return -EINVAL;
-
-	for (i = 0; i < num; i++) {
-		if (strcmp(names[i], name))
-			continue;
-		table[i] = to * HZ;
-		return 0;
-	}
-	return -ENOENT;
-}
-

^ permalink raw reply related

* [GIT PULL 0/2] Third Round of IPVS Updates for v4.12
From: Simon Horman @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman

Hi Pablo,

please consider these enhancements to IPVS for v4.12.
If it is too late for v4.12 then please consider them for v4.13.

* Remove unused function
* Correct comparison of unsigned value

The following changes since commit 9a08ecfe74d7796ddc92ec312d3b7eaeba5a7c22:

  netfilter: don't attach a nat extension by default (2017-04-26 09:30:22 +0200)

are available in the git repository at:

  http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git tags/ipvs3-for-v4.12

for you to fetch changes up to fb90e8dedb465bd06512f718b139ed8680d26dbe:

  ipvs: change comparison on sync_refresh_period (2017-04-28 12:00:10 +0200)

----------------------------------------------------------------
Aaron Conole (2):
      ipvs: remove unused function ip_vs_set_state_timeout
      ipvs: change comparison on sync_refresh_period

 include/net/ip_vs.h              |  2 --
 net/netfilter/ipvs/ip_vs_proto.c | 22 ----------------------
 net/netfilter/ipvs/ip_vs_sync.c  |  2 +-
 3 files changed, 1 insertion(+), 25 deletions(-)

^ permalink raw reply

* [PATCH 1/1] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
From: Simon Horman @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Paolo Abeni, Simon Horman
In-Reply-To: <20170428101154.9750-1-horms@verge.net.au>

From: Paolo Abeni <pabeni@redhat.com>

When creating a new ipvs service, ipv6 addresses are always accepted
if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family
is not explicitly checked.

This allows the user-space to configure ipvs services even if the
system is booted with ipv6.disable=1. On specific configuration, ipvs
can try to call ipv6 routing code at setup time, causing the kernel to
oops due to fib6_rules_ops being NULL.

This change addresses the issue adding a check for the ipv6
module being enabled while validating ipv6 service operations and
adding the same validation for dest operations.

According to git history, this issue is apparently present since
the introduction of ipv6 support, and the oops can be triggered
since commit 09571c7ae30865ad ("IPVS: Add function to determine
if IPv6 address is local")

Fixes: 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is local")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 5aeb0dde6ccc..4d753beaac32 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3078,6 +3078,17 @@ static int ip_vs_genl_dump_services(struct sk_buff *skb,
 	return skb->len;
 }
 
+static bool ip_vs_is_af_valid(int af)
+{
+	if (af == AF_INET)
+		return true;
+#ifdef CONFIG_IP_VS_IPV6
+	if (af == AF_INET6 && ipv6_mod_enabled())
+		return true;
+#endif
+	return false;
+}
+
 static int ip_vs_genl_parse_service(struct netns_ipvs *ipvs,
 				    struct ip_vs_service_user_kern *usvc,
 				    struct nlattr *nla, int full_entry,
@@ -3104,11 +3115,7 @@ static int ip_vs_genl_parse_service(struct netns_ipvs *ipvs,
 	memset(usvc, 0, sizeof(*usvc));
 
 	usvc->af = nla_get_u16(nla_af);
-#ifdef CONFIG_IP_VS_IPV6
-	if (usvc->af != AF_INET && usvc->af != AF_INET6)
-#else
-	if (usvc->af != AF_INET)
-#endif
+	if (!ip_vs_is_af_valid(usvc->af))
 		return -EAFNOSUPPORT;
 
 	if (nla_fwmark) {
@@ -3610,6 +3617,11 @@ static int ip_vs_genl_set_cmd(struct sk_buff *skb, struct genl_info *info)
 		if (udest.af == 0)
 			udest.af = svc->af;
 
+		if (!ip_vs_is_af_valid(udest.af)) {
+			ret = -EAFNOSUPPORT;
+			goto out;
+		}
+
 		if (udest.af != svc->af && cmd != IPVS_CMD_DEL_DEST) {
 			/* The synchronization protocol is incompatible
 			 * with mixed family services
-- 
2.12.2.816.g2cccc81164


^ permalink raw reply related

* [GIT PULL v2 0/1] IPVS Fixes for v4.11
From: Simon Horman @ 2017-04-28 10:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman

Hi Pablo,

please consider this fix to IPVS for v4.11.
Or if it is too late for v4.11 please consider it for v4.12.
I would also like it considered for stable.

* Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
  to avoid oops caused by IPVS accesing IPv6 routing code in such
  circumstances.

Change since v1 of pull request:
* Rebase on nf
* Correct URL; it should be ipvs not ipvs-next


The following changes since commit 9dd2ab609eef736d5639e0de1bcc2e71e714b28e:

  netfilter: Wrong icmp6 checksum for ICMPV6_TIME_EXCEED in reverse SNATv6 path (2017-04-25 11:10:38 +0200)

are available in the git repository at:

  http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git ipvs-fixes-for-v4.11

for you to fetch changes up to 1442f6f7c1b77de1c508318164a527e240c24a4d:

  ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled (2017-04-28 12:04:35 +0200)

----------------------------------------------------------------
Paolo Abeni (1):
      ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled

 net/netfilter/ipvs/ip_vs_ctl.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

^ permalink raw reply

* pull request (net-next): ipsec-next 2017-04-28
From: Steffen Klassert @ 2017-04-28  8:42 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev

Just one patch to fix a misplaced spin_unlock_bh in an error path.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit e2989ee9746b3f2e78d1a39bbc402d884e8b8bf1:

  bpf, doc: update list of architectures that do eBPF JIT (2017-04-23 15:56:48 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to e892d2d40445a14a19530a2be8c489b87bcd7c19:

  esp: Fix misplaced spin_unlock_bh. (2017-04-24 07:56:31 +0200)

----------------------------------------------------------------
Steffen Klassert (1):
      esp: Fix misplaced spin_unlock_bh.

 net/ipv4/esp4.c | 6 +-----
 net/ipv6/esp6.c | 6 +-----
 2 files changed, 2 insertions(+), 10 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox