Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 10:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1480960639.18162.556.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
> 
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thank you for all the good work.

After all your improvement, I see the cacheline miss in inet_recvmsg()
as a major perf offender for the user space process in the udp flood
scenario due to skc_rxhash sharing the same sk_drops cacheline.

Using an udp-specific drop counter (and an sk_drops accessor to wrap
sk_drops access where needed), we could avoid such cache miss. With that
- patch for udp.h only below - I get 3% improvement on top of all the
pending udp patches, and the gain should be more relevant after the 2
queues rework. What do you think ?

Cheers,

Paolo.
---
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..2bbf5db 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,11 @@ struct udp_sock {
        unsigned int     corkflag;      /* Cork is required */
        __u8             encap_type;    /* Is this an Encapsulation socket? */
        unsigned char    no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-                        no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+                        no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+                        pcflag:6;      /* UDP-Lite specific, moved here to */
+                                       /* fill an hole, marks socket as */
+                                       /* UDP-Lite if > 0    */
+
        /*
         * Following member retains the information to create a UDP header
         * when the socket is uncorked.
@@ -64,8 +68,7 @@ struct udp_sock {
 #define UDPLITE_BIT      0x1           /* set by udplite proto init function */
 #define UDPLITE_SEND_CC  0x2           /* set via udplite setsockopt         */
 #define UDPLITE_RECV_CC  0x4           /* set via udplite setsocktopt        */
-       __u8             pcflag;        /* marks socket as UDP-Lite if > 0    */
-       __u8             unused[3];
+
        /*
         * For encapsulation sockets.
         */
@@ -79,6 +82,9 @@ struct udp_sock {
        int                     (*gro_complete)(struct sock *sk,
                                                struct sk_buff *skb,
                                                int nhoff);
+
+       /* since we are prone to drops, avoid dirtying any sk cacheline */
+       atomic_t                drops ____cacheline_aligned_in_smp;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)

^ permalink raw reply related

* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Xin Long @ 2016-12-06 10:38 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <20161206082233.GD1984@nanopsycho>

[-- Attachment #1: Type: text/plain, Size: 2719 bytes --]

On Tue, Dec 6, 2016 at 4:22 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>>Now when users add a nic to team dev, the option 'enable' of the port
>>is true by default, as team_port_enable enables it after dev_open in
>>team_port_add.
>>
>>But even if the port_dev has no carrier, like it's cable was unpluged,
>>the port is still enabled. It leads to that team dev couldn't work well
>>if this port was chosen to connect, and has no chance to change to use
>>other ports if link_watch is ethtool.
>>
>>This patch is to enable the port only when the port_dev has carrier in
>>team_port_add.
>>
>>v1 -> v2:
>>  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
>>  bool now.
>>
>>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>---
>> drivers/net/team/team.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>>index a380649..4bc0103 100644
>>--- a/drivers/net/team/team.c
>>+++ b/drivers/net/team/team.c
>>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>       struct net_device *dev = team->dev;
>>       struct team_port *port;
>>       char *portname = port_dev->name;
>>+      bool linkup;
>>       int err;
>>
>>       if (port_dev->flags & IFF_LOOPBACK) {
>>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>
>>       port->index = -1;
>>       list_add_tail_rcu(&port->list, &team->port_list);
>>-      team_port_enable(team, port);
>>+      linkup = netif_carrier_ok(port_dev);
>>+      if (linkup)
>>+              team_port_enable(team, port);
>
> This is obviously wrong change. It you use a simple setup without
> userspace part (e.g. round robin), The port gets never enabled.
> team_port_enabl is called from here and team_port_en_option_set.
yes, without userspace part, that would be a problem.

>
> By default, all ports should be enabled. Only in case the userspace
> daemon decides to disable, it does so.
>
> Could you send me the exact configuration where you see the issue?
attachment is the scripts,

# ./setup.sh
# ip netns exec ns1 ./team1.sh
# ./team0.sh
# ping 192.168.11.3

the issue is team0 cannot switch to veth2, even if the veth0 has no carrier.

> This should be definitelly fixed in user part.
now it can disable/enable it in lw_ethtool_event_watch_port_changed()
when receive event from kernel, but at the beginning, the first event from
adding port will not going to team_port_en_option_set, as new_link_up
== common_ppriv->link_up in lw_ethtool_event_watch_port_changed().

maybe it should be fixed there ?

>
> Thanks.

[-- Attachment #2: team-scripts.tar.gz --]
[-- Type: application/x-gzip, Size: 561 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Simon Horman @ 2016-12-06 10:51 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <CALx6S37f-TFAhAidri12iD0KnhFE2kYuVnsXsj7yjivk-ovRiw@mail.gmail.com>

On Mon, Dec 05, 2016 at 09:29:45AM -0800, Tom Herbert wrote:
> On Mon, Dec 5, 2016 at 6:23 AM, Simon Horman <simon.horman@netronome.com> wrote:
> > On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
> >> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
> >> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
> >> >>port dissection code as although ICMP is not a transport protocol and their
> >> >>type and code are not ports this allows sharing of both code and storage.
> >> >>
> >> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >
> > ...
> >
> >> > Digging into this a bit more. I think it would be much nice not to mix
> >> > up l4 ports and icmp stuff.
> >> >
> >> > How about to have FLOW_DISSECTOR_KEY_ICMP
> >> > and
> >> > struct flow_dissector_key_icmp {
> >> >         u8 type;
> >> >         u8 code;
> >> > };
> >> >
> >> > The you can make this structure and struct flow_dissector_key_ports into
> >> > an union in struct flow_keys.
> >> >
> >> > Looks much cleaner to me.
> >
> > Hi Jiri,
> >
> > I wondered about that too. The main reason that I didn't do this the first
> > time around is that I wasn't sure what to do to differentiate between ports
> > and icmp in fl_init_dissector() given that using a union would could to
> > mask bits being set in both if they are set in either.
> >
> > I've taken a stab at addressing that below along with implementing your
> > suggestion. If the treatment in fl_init_dissector() is acceptable
> > perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
> > too?
> >
> >> I agree, this patch adds to many conditionals into the fast path for
> >> ICMP handling. Neither is there much point in using type and code as
> >> input to the packet hash.
> >
> > Hi Tom,
> >
> > I'm not sure that I follow this. A packet may be ICMP or not regardless of
> > if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
> > not. I'd appreciate some guidance here.
> >
> > First-pass at implementing Jiri's suggestion follows:
> >

...

> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > index 0584b4bb4390..33e5fa2b3e87 100644
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c

...

> > @@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
> >                 .offset = offsetof(struct flow_keys, ports),
> >         },
> >         {
> > +               .key_id = FLOW_DISSECTOR_KEY_ICMP,
> > +               .offset = offsetof(struct flow_keys, icmp),
> > +       },
> 
> This is the problem I was referring to. This adds ICMP to the keys for
> computing the skb hash and the ICMP type and code are in a union with
> port numbers so they are in the range over which the hash is computed.
> This means that we are treating type and code as some sort of flow and
> and so different type/code between the same src/dst could follow
> different paths in ECMP. For the purposes of computing a packet hash I
> don't think we want ICMP information included. This might be a matter
> of not putting FLOW_DISSECTOR_KEY_ICMP in flow_keys_dissector_keys,
> where we need this information we could define another structure that
> has all the same fields as in flow_keys_dissector_keys and include
> FLOW_DISSECTOR_KEY_ICMP. Alternatively, the flow_dissector_key_icmp
> could also be outside of the area that is used in the hash: that is no
> in flow_keys_hash_start(keys) to flow_keys_hash_length(keys), keyval);

...

> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > index c96b2a349779..f4ba69bd362f 100644
> > --- a/net/sched/cls_flower.c
> > +++ b/net/sched/cls_flower.c
> > @@ -38,7 +38,10 @@ struct fl_flow_key {
> >                 struct flow_dissector_key_ipv4_addrs ipv4;
> >                 struct flow_dissector_key_ipv6_addrs ipv6;
> >         };
> > -       struct flow_dissector_key_ports tp;
> > +       union {
> > +               struct flow_dissector_key_ports tp;
> > +               struct flow_dissector_key_icmp icmp;
> > +       };
> 
> When an ICMP packet is encapsulated within UDP then icmp overwrites
> valid port information with is a stronger signal of the flow (unless
> FLOW_DISSECTOR_F_STOP_AT_ENCAP is set). This is another reason not to
> use a union with ports.

...

Hi Tom,

thanks for your explanation. I think I have a clearer picture now.

I have reworked things to try to address your concerns.
In particular:

* FLOW_DISSECTOR_KEY_ICMP is no longer added to flow_keys_dissector_keys.
  I don't think it is needed at all for the use-case I currently have in
  mind, which is classifying packets using tc_flower. And adding it there
  was an error on my part.

* I stopped using a union for ports and icmp. At the very least this seems
  to make it easier to reason about things as we no longer need to consider
  that a port value may in fact be an ICMP type or code.

  This seems to allow us avoid adding a number of is_icmp checks (as I believe
  you pointed out earlier). And should also address the problem you state
  immediately above.

* I have also placed icmp outside of the range flow_keys_hash_start(keys)
  to flow_keys_hash_length(keys), keyval). This is because I now see no
  value of having it inside that range and it both avoids any chance of
  contaminating the hash with ICMP values and hashing over unwanted
  (hopefully zero) values.

  This required an update to flow_keys_hash_length() as the bound
  of the end of fields the hash is no longer the end of struct flow_keys.
  It seems clean but I wonder if there are similar assumptions lurking
  elsewhere.

I have lightly tested this for the tc_flower case (only).

Incremental patch on top of this series:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index a6f75cfb2bf7..8029dd4912b6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3181,8 +3181,7 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
 	} else {
 		return false;
 	}
-	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
-	    proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
 		fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
 
 	return true;
@@ -3210,8 +3209,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
 		return bond_eth_hash(skb);
 
 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
-	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
-	    flow_keys_are_icmp_any(&flow))
+	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
 		hash = bond_eth_hash(skb);
 	else
 		hash = (__force u32)flow.ports.ports;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 44a8f69a9198..9c535fbccf2c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1094,11 +1094,6 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
 			    void *data, int hlen_proto);
 
-static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto)
-{
-	return flow_protos_are_icmp_any(skb->protocol, ip_proto);
-}
-
 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
 					int thoff, u8 ip_proto)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 5540dfa18872..058c9df8a4d8 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
 
 /**
  * flow_dissector_key_ports:
- *	@ports: port numbers of Transport header or
- *		type and code of ICMP header
+ *	@ports: port numbers of Transport header
  *		ports: source (high) and destination (low) port numbers
  *		src: source port number
  *		dst: destination port number
- *		icmp: ICMP type (high) and code (low)
- *		type: ICMP type
- *		type: ICMP code
  */
 struct flow_dissector_key_ports {
 	union {
@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
 			__be16 src;
 			__be16 dst;
 		};
+	};
+};
+
+/**
+ * flow_dissector_key_icmp:
+ *	@ports: type and code of ICMP header
+ *		icmp: ICMP type (high) and code (low)
+ *		type: ICMP type
+ *		code: ICMP code
+ */
+struct flow_dissector_key_icmp {
+	union {
 		__be16 icmp;
 		struct {
 			u8 type;
@@ -115,7 +123,6 @@ struct flow_dissector_key_ports {
 	};
 };
 
-
 /**
  * struct flow_dissector_key_eth_addrs:
  * @src: source Ethernet address
@@ -133,6 +140,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
@@ -173,11 +181,16 @@ struct flow_keys {
 	struct flow_dissector_key_keyid keyid;
 	struct flow_dissector_key_ports ports;
 	struct flow_dissector_key_addrs addrs;
+#define FLOW_KEYS_HASH_END_FIELD addrs
+	struct flow_dissector_key_icmp icmp;
 };
 
 #define FLOW_KEYS_HASH_OFFSET		\
 	offsetof(struct flow_keys, FLOW_KEYS_HASH_START_FIELD)
 
+#define FLOW_KEYS_HASH_END		\
+	offsetofend(struct flow_keys, FLOW_KEYS_HASH_END_FIELD)
+
 __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
@@ -233,8 +246,7 @@ static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
 
 static inline bool flow_keys_have_l4(const struct flow_keys *keys)
 {
-	return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) ||
-		keys->tags.flow_label;
+	return (keys->ports.ports || keys->tags.flow_label);
 }
 
 u32 flow_hash_from_keys(struct flow_keys *keys);
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0584b4bb4390..ed6d46475343 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	struct flow_dissector_key_keyid *key_keyid;
@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		break;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_PORTS)) {
-		key_ports = skb_flow_dissector_target(flow_dissector,
-						      FLOW_DISSECTOR_KEY_PORTS,
-						      target_container);
-		if (flow_protos_are_icmp_any(proto, ip_proto))
-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
-							    hlen);
-		else
+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_ICMP)) {
+			key_icmp = skb_flow_dissector_target(flow_dissector,
+							     FLOW_DISSECTOR_KEY_ICMP,
+							     target_container);
+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
+							   hlen);
+		}
+	} else {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_PORTS)) {
+			key_ports = skb_flow_dissector_target(flow_dissector,
+							      FLOW_DISSECTOR_KEY_PORTS,
+							      target_container);
 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
 								ip_proto, data,
 								hlen);
+		}
 	}
 
 out_good:
@@ -613,9 +621,10 @@ static inline const u32 *flow_keys_hash_start(const struct flow_keys *flow)
 static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 {
 	size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs);
-	BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32));
+	BUILD_BUG_ON((FLOW_KEYS_HASH_END - FLOW_KEYS_HASH_OFFSET) %
+		     sizeof(u32));
 	BUILD_BUG_ON(offsetof(typeof(*flow), addrs) !=
-		     sizeof(*flow) - sizeof(flow->addrs));
+		     FLOW_KEYS_HASH_END - sizeof(flow->addrs));
 
 	switch (flow->control.addr_type) {
 	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
@@ -628,7 +637,7 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 		diff -= sizeof(flow->addrs.tipcaddrs);
 		break;
 	}
-	return (sizeof(*flow) - diff) / sizeof(u32);
+	return (FLOW_KEYS_HASH_END - diff) / sizeof(u32);
 }
 
 __be32 flow_get_u32_src(const struct flow_keys *flow)
@@ -745,8 +754,7 @@ void make_flow_keys_digest(struct flow_keys_digest *digest,
 
 	data->n_proto = flow->basic.n_proto;
 	data->ip_proto = flow->basic.ip_proto;
-	if (flow_keys_have_l4(flow))
-		data->ports = flow->ports.ports;
+	data->ports = flow->ports.ports;
 	data->src = flow->addrs.v4addrs.src;
 	data->dst = flow->addrs.v4addrs.dst;
 }
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 408313e33bbe..6575aba87630 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -96,7 +96,7 @@ static u32 flow_get_proto(const struct sk_buff *skb,
 static u32 flow_get_proto_src(const struct sk_buff *skb,
 			      const struct flow_keys *flow)
 {
-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
+	if (flow->ports.ports)
 		return ntohs(flow->ports.src);
 
 	return addr_fold(skb->sk);
@@ -105,7 +105,7 @@ static u32 flow_get_proto_src(const struct sk_buff *skb,
 static u32 flow_get_proto_dst(const struct sk_buff *skb,
 			      const struct flow_keys *flow)
 {
-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
+	if (flow->ports.ports)
 		return ntohs(flow->ports.dst);
 
 	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c96b2a349779..56df0368125a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -39,6 +39,7 @@ struct fl_flow_key {
 		struct flow_dissector_key_ipv6_addrs ipv6;
 	};
 	struct flow_dissector_key_ports tp;
+	struct flow_dissector_key_icmp icmp;
 	struct flow_dissector_key_keyid enc_key_id;
 	union {
 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
@@ -511,19 +512,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
 			       sizeof(key->tp.dst));
 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	}
 
 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
@@ -634,6 +639,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_PORTS, tp);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+			     FLOW_DISSECTOR_KEY_ICMP, icmp);
+	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
@@ -1000,24 +1007,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 				  sizeof(key->tp.dst))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 
 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply related

* [PATCH V4 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-06 11:09 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm

This patch introduces the RX checksum function to check the
status of the hardware calculated checksum and its error and
appropriately convey status to the upper stack in skb->ip_summed
field.

In hardware, we only support checksum for the following
protocols:
1) IPv4,
2) TCP(over IPv4 or IPv6),
3) UDP(over IPv4 or IPv6),
4) SCTP(over IPv4 or IPv6)
but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and
L4(TCP, UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.

Hardware limitation:
Our present hardware RX Descriptor lacks L3/L4 checksum
"Status & Error" bit (which usually can be used to indicate whether
checksum was calculated by the hardware and if there was any error
encountered during checksum calculation).

Software workaround:
We do get info within the RX descriptor about the kind of
L3/L4 protocol coming in the packet and the error status. These
errors might not just be checksum errors but could be related to
version, length of IPv4, UDP, TCP etc.
Because there is no-way of knowing if it is a L3/L4 error due
to bad checksum or any other L3/L4 error, we will not (cannot)
convey hardware checksum status(CHECKSUM_UNNECESSARY) for such
cases to upper stack and will not maintain the RX L3/L4 checksum
counters as well.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Change Log:
Patch V4: Accidentally floated a wrong V3 patch. corrected this:
          Link: https://lkml.org/lkml/2016/12/5/604
Patch V3: Re-structured the code.
Patch V2: Addressed the comment by David Miller
          Link: https://www.spinics.net/lists/netdev/msg406697.html
Patch V1: This patch is a result of the comments given by
          David Miller <davem@davemloft.net>
          Link: https://lkml.org/lkml/2016/6/15/27
---
 drivers/net/ethernet/hisilicon/hns/hnae.h     |    2 +
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   76 ++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h
index 09602f1..8016854 100644
--- a/drivers/net/ethernet/hisilicon/hns/hnae.h
+++ b/drivers/net/ethernet/hisilicon/hns/hnae.h
@@ -99,6 +99,8 @@ enum hnae_led_state {
 #define HNS_RX_FLAG_L3ID_IPV6 0x1
 #define HNS_RX_FLAG_L4ID_UDP 0x0
 #define HNS_RX_FLAG_L4ID_TCP 0x1
+#define HNS_RX_FLAG_L4ID_SCTP 0x3
+
 
 #define HNS_TXD_ASID_S 0
 #define HNS_TXD_ASID_M (0xff << HNS_TXD_ASID_S)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 255fede..c02449b 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -565,6 +565,71 @@ static void get_rx_desc_bnum(u32 bnum_flag, int *out_bnum)
 				   HNS_RXD_BUFNUM_M, HNS_RXD_BUFNUM_S);
 }
 
+static void hns_nic_rx_checksum(struct hns_nic_ring_data *ring_data,
+				struct sk_buff *skb, u32 flag)
+{
+	struct net_device *netdev = ring_data->napi.dev;
+	u32 l3id;
+	u32 l4id;
+
+	/* check if RX checksum offload is enabled */
+	if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
+		return;
+
+	/* In hardware, we only support checksum for the following protocols:
+	 * 1) IPv4,
+	 * 2) TCP(over IPv4 or IPv6),
+	 * 3) UDP(over IPv4 or IPv6),
+	 * 4) SCTP(over IPv4 or IPv6)
+	 * but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and L4(TCP,
+	 * UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.
+	 *
+	 * Hardware limitation:
+	 * Our present hardware RX Descriptor lacks L3/L4 checksum "Status &
+	 * Error" bit (which usually can be used to indicate whether checksum
+	 * was calculated by the hardware and if there was any error encountered
+	 * during checksum calculation).
+	 *
+	 * Software workaround:
+	 * We do get info within the RX descriptor about the kind of L3/L4
+	 * protocol coming in the packet and the error status. These errors
+	 * might not just be checksum errors but could be related to version,
+	 * length of IPv4, UDP, TCP etc.
+	 * Because there is no-way of knowing if it is a L3/L4 error due to bad
+	 * checksum or any other L3/L4 error, we will not (cannot) convey
+	 * checksum status for such cases to upper stack and will not maintain
+	 * the RX L3/L4 checksum counters as well.
+	 */
+
+	l3id = hnae_get_field(flag, HNS_RXD_L3ID_M, HNS_RXD_L3ID_S);
+	l4id = hnae_get_field(flag, HNS_RXD_L4ID_M, HNS_RXD_L4ID_S);
+
+	/*  check L3 protocol for which checksum is supported */
+	if ((l3id != HNS_RX_FLAG_L3ID_IPV4) && (l3id != HNS_RX_FLAG_L3ID_IPV6))
+		return;
+
+	/* check for any(not just checksum)flagged L3 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L3E_B)))
+		return;
+
+	/* we do not support checksum of fragmented packets */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_FRAG_B)))
+		return;
+
+	/*  check L4 protocol for which checksum is supported */
+	if ((l4id != HNS_RX_FLAG_L4ID_TCP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_UDP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_SCTP))
+		return;
+
+	/* check for any(not just checksum)flagged L4 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L4E_B)))
+		return;
+
+	/* now, this has to be a packet with valid RX checksum */
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+}
+
 static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 			       struct sk_buff **out_skb, int *out_bnum)
 {
@@ -683,13 +748,10 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	ring->stats.rx_pkts++;
 	ring->stats.rx_bytes += skb->len;
 
-	if (unlikely(hnae_get_bit(bnum_flag, HNS_RXD_L3E_B) ||
-		     hnae_get_bit(bnum_flag, HNS_RXD_L4E_B))) {
-		ring->stats.l3l4_csum_err++;
-		return 0;
-	}
-
-	skb->ip_summed = CHECKSUM_UNNECESSARY;
+	/* indicate to upper stack if our hardware has already calculated
+	 * the RX checksum
+	 */
+	hns_nic_rx_checksum(ring_data, skb, bnum_flag);
 
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related

* Re: bpf bounded loops. Was: [flamebait] xdp
From: Hannes Frederic Sowa @ 2016-12-06 11:35 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov
  Cc: Tom Herbert, Thomas Graf, Linux Kernel Network Developers,
	Daniel Borkmann, David S. Miller
In-Reply-To: <aafa452d-26dd-d58b-2649-21ccce9370a4@solarflare.com>

On 05.12.2016 17:54, Edward Cree wrote:
> On 05/12/16 16:50, Hannes Frederic Sowa wrote:
>> On 05.12.2016 17:40, Edward Cree wrote:
>>> I may be completely mistaken here, but can't the verifier unroll the loop 'for
>>> verification' without it actually being unrolled in the program?
>>> I.e., any "proof that the loop terminates" should translate into "rewrite of
>>> the directed graph to make it a DAG, possibly duplicating a lot of insns", and
>>> you feed the rewritten graph to the verifier, while using the original loopy
>>> version as the actual program to store and later execute.
>>> Then the verifier happily checks things like array indices being valid, without
>>> having to know about the bounded loops.
>> That is what is already happening. E.g. __builtin_memset is expanded up
>> to 128 rounds (which is a lot) but at some point llvm doesn't do enoug
>> unrolling of that.
>>
>> The BPF target configures that in
>> http://llvm.org/docs/doxygen/html/BPFISelLowering_8cpp_source.html on
>> line 166-169.
> I think you're talking about the _compiler_ unrolling loops before it
> submits the program to the kernel.  I'm talking about having the _verifier_
> unroll them, so that we can execute the original (non-unrolled) version.
> Or am I misunderstanding?

Ah, in the verifier this would be part of flow control analysis what we
are talking about in the other part of this thread.

Bye,
Hannes

^ permalink raw reply

* Re: arp_filter and IPv6 ND
From: Hannes Frederic Sowa @ 2016-12-06 12:07 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev
In-Reply-To: <CAAeewD9CDQZp0u6L6JtZ96cboKDCRB=yY92F+-ufsA8=OEgyUw@mail.gmail.com>

On 03.12.2016 15:21, Saku Ytti wrote:
> On 2 December 2016 at 20:39, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> Hey,
> 
>> E.g. you can use IP addresses bound to other interfaces to send replys
>> on another interface. This can be useful if you have a limited amount of
>> IP addresses on the system but much more interfaces. Especially if they
>> are limited in scope, like in IPv6.
>>
>> Basically Cisco's feature of "unnumbered interface" is always provided
>> in Linux. And there are certainly cases where you would want to use it,
>> e.g. emulate private-vlan feature for network separation.
> 
> Got it, thanks, the explanation makes sense. And indeed it's valid
> case, but also it is the exception, not the rule. I think it would be
> entirely change the default and people who want 'unnumbered' style
> behaviour (like some BRAS scenarios), will know how to and why to
> configure it.

The limited ip address scenario is actually more common for normal
routers. ;)

In retrospect I don't know what what would win if the decision would be
made again. Mostly all operating systems switched to strong end host
model over time, Linux remaining in the weak host end camp alone. It
probably is also easier to go from strong end to weak end system by
policy than vice versa, so I would probably also picked strong end
system semantics by default today.

>> Also in the BGP setup, you might have it easier to establish loopback
>> neighbor contact by just using static on-link routes, without caring
>> about more complex numbering there (otherwise you pretty soon introduce
>> OSPF or some other routing protocol to do the recursive forward resolution).
> 
> The BGP is running on-link, it's just that the BGP is advertising loop
> of Linux. Why the loop ends up in ND cache, I don't know.

Did you check neighbor advertisements and solicitations with tcpdump?

Did you have force_tllao, ndisc_notify enabled? Which source address
does the BGP/TCP connection use?

>>> Grand, not that I feel comfortable writing it. I'd rather see the
>>> whole suppression functionality moved to neighbour.c from being AFI
>>> specific.
>>
>> Yes sure, please provide a patch. A separate sysctl is necessary anyway
>> because the current one is within the ipv4 procfs directory hierarchy.
> 
> Sorry, not a comfortable C programmer, I'm pretty confident I could
> get it working, but I'm more confident that patch would be entirely
> rejected and rewritten by someone who knows what they are doing.
> I see no reason not to have AFI specific toggle, just logic and code
> should be AFI agnostic, like GC (ARP/ND cache time) stuff in
> neighbour.c is nicely done. Frankly whole ARP/ND code could do with
> refactoring to make arp.c and ndisc.c more wire-format stuff and
> behavioural code more in neighbour.c.

Let's first see what the real problem is.

^ permalink raw reply

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 12:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1481017989.6225.21.camel@redhat.com>

On Tue, 2016-12-06 at 10:53 +0100, Paolo Abeni wrote:
> On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > while holding receive queue lock, plus another one if packet is
> > dequeued, since we need to change skb->next->prev
> > 
> > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > 
> > skb->peeked is only needed to make sure 0-length packets are properly
> > handled while MSG_PEEK is operated.
> > 
> > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > non-zero offset" support added by Sam Kumar makes this not possible.
> 
> I'm wondering if peeking with offset is going to complicate the 2 queues
> patch, too.
> 
> > This patch avoids one cache line miss during the locked section, when
> > skb->len and skb->peeked do not have to be read.
> > 
> > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/datagram.c |   19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  	if (error)
> >  		goto no_packet;
> >  
> > +	*peeked = 0;
> >  	do {
> >  		/* Again only user level code calls this function, so nothing
> >  		 * interrupt level will suddenly eat the receive_queue.
> > @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  		spin_lock_irqsave(&queue->lock, cpu_flags);
> >  		skb_queue_walk(queue, skb) {
> >  			*last = skb;
> > -			*peeked = skb->peeked;
> >  			if (flags & MSG_PEEK) {
> >  				if (_off >= skb->len && (skb->len || _off ||
> >  							 skb->peeked)) {
> >  					_off -= skb->len;
> >  					continue;
> >  				}
> > -
> > -				skb = skb_set_peeked(skb);
> > -				error = PTR_ERR(skb);
> > -				if (IS_ERR(skb)) {
> > -					spin_unlock_irqrestore(&queue->lock,
> > -							       cpu_flags);
> > -					goto no_packet;
> > +				if (!skb->len) {
> > +					skb = skb_set_peeked(skb);
> > +					if (IS_ERR(skb)) {
> > +						error = PTR_ERR(skb);
> > +						spin_unlock_irqrestore(&queue->lock,
> > +								       cpu_flags);
> > +						goto no_packet;
> > +					}
> >  				}
> 
> I don't understand why we can avoid setting skb->peek if len > 0. I
> think that will change the kernel behavior if:
> - peek with offset is set
> - 3 skbs with len > 0 are enqueued
> - the u/s peek (with offset) the second one
> - the u/s disable peeking with offset and peeks 2 more skbs.
> 
> With the current code in the last step the u/s is going to peek the 1#
> and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
> missing something ? Probably the new behavior is more correct, but still
> is a change. 

Please ignore the above dumb comment. I misread the 'skip condition'.

I'm fine with the patch in its current form.

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply

* Re: [PATCH v2] netfilter: avoid warn and OOM killer on vmalloc call
From: Pablo Neira Ayuso @ 2016-12-06 12:21 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: andreyknvl, fw, nhorman, netfilter-devel, netdev, linux-kernel
In-Reply-To: <e10fd8b5206bfde5526e3bbf2b5ff9af584b0a6f.1480671893.git.marcelo.leitner@gmail.com>

On Fri, Dec 02, 2016 at 07:46:38AM -0200, Marcelo Ricardo Leitner wrote:
> Andrey Konovalov reported that this vmalloc call is based on an
> userspace request and that it's spewing traces, which may flood the logs
> and cause DoS if abused.
> 
> Florian Westphal also mentioned that this call should not trigger OOM
> killer.
> 
> This patch brings the vmalloc call in sync to kmalloc and disables the
> warn trace on allocation failure and also disable OOM killer invocation.
> 
> Note, however, that under such stress situation, other places may
> trigger OOM killer invocation.

Unless anyone has an objection, I'm going to place this in nf-next.

Thanks.

> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Cc: Florian Westphal <fw@strlen.de>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  net/netfilter/x_tables.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
>  	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>  		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
>  	if (!info) {
> -		info = vmalloc(sz);
> +		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> +				     __GFP_NORETRY | __GFP_HIGHMEM,
> +				 PAGE_KERNEL);
>  		if (!info)
>  			return NULL;
>  	}
> -- 
> 2.9.3
> 

^ permalink raw reply

* Re: [PATCH v2 net-next v2 4/4] net: dsa: mv88e6xxx: add PPU operations
From: Stefan Eichenberger @ 2016-12-06 12:27 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Richard Cochran
In-Reply-To: <874m2ic95g.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

On 5 December 2016 at 23:18, Vivien Didelot
<vivien.didelot@savoirfairelinux.com> wrote:
> Stefan Eichenberger <eichest@gmail.com> writes:
>
>> Hi Vivien,
>>
>> On Mon, Dec 05, 2016 at 11:27:03AM -0500, Vivien Didelot wrote:
>>> @@ -3266,6 +3220,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
>>>      .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
>>>      .g1_set_egress_port = mv88e6095_g1_set_egress_port,
>>>      .mgmt_rsvd2cpu = mv88e6095_g2_mgmt_rsvd2cpu,
>>> +    .ppu_enable = mv88e6185_g1_ppu_enable,
>>> +    .ppu_disable = mv88e6185_g1_ppu_disable,
>>>      .reset = mv88e6185_g1_reset,
>>>  };
>>
>> The mv88e6097 should use the indirect access to the phys, bit 14 in g1
>> control is marked as reserved. They write in the datasheet that
>> disabling the PPU is still supported but indirect access via g2 should
>> be used because disabling the PPU  is no longer recommended.
>
> Ho ok thanks, I respin a v3 right away with this removed and with
> mv88e6352_g1_reset instead.

Perfect thank you, now it's fine.

Regards,
Stefan

^ permalink raw reply

* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Jiri Pirko @ 2016-12-06 12:27 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <CADvbK_e01bz-PGmDrdxztyzLqmR5od3rahO960Oe2OhYQ6fp9w@mail.gmail.com>

Tue, Dec 06, 2016 at 11:38:53AM CET, lucien.xin@gmail.com wrote:
>On Tue, Dec 6, 2016 at 4:22 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>>>Now when users add a nic to team dev, the option 'enable' of the port
>>>is true by default, as team_port_enable enables it after dev_open in
>>>team_port_add.
>>>
>>>But even if the port_dev has no carrier, like it's cable was unpluged,
>>>the port is still enabled. It leads to that team dev couldn't work well
>>>if this port was chosen to connect, and has no chance to change to use
>>>other ports if link_watch is ethtool.
>>>
>>>This patch is to enable the port only when the port_dev has carrier in
>>>team_port_add.
>>>
>>>v1 -> v2:
>>>  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
>>>  bool now.
>>>
>>>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>>---
>>> drivers/net/team/team.c | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>>>index a380649..4bc0103 100644
>>>--- a/drivers/net/team/team.c
>>>+++ b/drivers/net/team/team.c
>>>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>>       struct net_device *dev = team->dev;
>>>       struct team_port *port;
>>>       char *portname = port_dev->name;
>>>+      bool linkup;
>>>       int err;
>>>
>>>       if (port_dev->flags & IFF_LOOPBACK) {
>>>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>>
>>>       port->index = -1;
>>>       list_add_tail_rcu(&port->list, &team->port_list);
>>>-      team_port_enable(team, port);
>>>+      linkup = netif_carrier_ok(port_dev);
>>>+      if (linkup)
>>>+              team_port_enable(team, port);
>>
>> This is obviously wrong change. It you use a simple setup without
>> userspace part (e.g. round robin), The port gets never enabled.
>> team_port_enabl is called from here and team_port_en_option_set.
>yes, without userspace part, that would be a problem.
>
>>
>> By default, all ports should be enabled. Only in case the userspace
>> daemon decides to disable, it does so.
>>
>> Could you send me the exact configuration where you see the issue?
>attachment is the scripts,
>
># ./setup.sh
># ip netns exec ns1 ./team1.sh
># ./team0.sh
># ping 192.168.11.3
>
>the issue is team0 cannot switch to veth2, even if the veth0 has no carrier.
>
>> This should be definitelly fixed in user part.
>now it can disable/enable it in lw_ethtool_event_watch_port_changed()
>when receive event from kernel, but at the beginning, the first event from
>adding port will not going to team_port_en_option_set, as new_link_up
>== common_ppriv->link_up in lw_ethtool_event_watch_port_changed().
>
>maybe it should be fixed there ?

Yes please.

>
>>
>> Thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Jiri Pirko @ 2016-12-06 12:26 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tom Herbert, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <20161206105145.GA27020@penelope.horms.nl>

Tue, Dec 06, 2016 at 11:51:46AM CET, simon.horman@netronome.com wrote:
>On Mon, Dec 05, 2016 at 09:29:45AM -0800, Tom Herbert wrote:
>> On Mon, Dec 5, 2016 at 6:23 AM, Simon Horman <simon.horman@netronome.com> wrote:
>> > On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
>> >> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
>> >> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
>> >> >>port dissection code as although ICMP is not a transport protocol and their
>> >> >>type and code are not ports this allows sharing of both code and storage.
>> >> >>
>> >> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>> >
>> > ...
>> >
>> >> > Digging into this a bit more. I think it would be much nice not to mix
>> >> > up l4 ports and icmp stuff.
>> >> >
>> >> > How about to have FLOW_DISSECTOR_KEY_ICMP
>> >> > and
>> >> > struct flow_dissector_key_icmp {
>> >> >         u8 type;
>> >> >         u8 code;
>> >> > };
>> >> >
>> >> > The you can make this structure and struct flow_dissector_key_ports into
>> >> > an union in struct flow_keys.
>> >> >
>> >> > Looks much cleaner to me.
>> >
>> > Hi Jiri,
>> >
>> > I wondered about that too. The main reason that I didn't do this the first
>> > time around is that I wasn't sure what to do to differentiate between ports
>> > and icmp in fl_init_dissector() given that using a union would could to
>> > mask bits being set in both if they are set in either.
>> >
>> > I've taken a stab at addressing that below along with implementing your
>> > suggestion. If the treatment in fl_init_dissector() is acceptable
>> > perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
>> > too?
>> >
>> >> I agree, this patch adds to many conditionals into the fast path for
>> >> ICMP handling. Neither is there much point in using type and code as
>> >> input to the packet hash.
>> >
>> > Hi Tom,
>> >
>> > I'm not sure that I follow this. A packet may be ICMP or not regardless of
>> > if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
>> > not. I'd appreciate some guidance here.
>> >
>> > First-pass at implementing Jiri's suggestion follows:
>> >
>
>...
>
>> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> > index 0584b4bb4390..33e5fa2b3e87 100644
>> > --- a/net/core/flow_dissector.c
>> > +++ b/net/core/flow_dissector.c
>
>...
>
>> > @@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
>> >                 .offset = offsetof(struct flow_keys, ports),
>> >         },
>> >         {
>> > +               .key_id = FLOW_DISSECTOR_KEY_ICMP,
>> > +               .offset = offsetof(struct flow_keys, icmp),
>> > +       },
>> 
>> This is the problem I was referring to. This adds ICMP to the keys for
>> computing the skb hash and the ICMP type and code are in a union with
>> port numbers so they are in the range over which the hash is computed.
>> This means that we are treating type and code as some sort of flow and
>> and so different type/code between the same src/dst could follow
>> different paths in ECMP. For the purposes of computing a packet hash I
>> don't think we want ICMP information included. This might be a matter
>> of not putting FLOW_DISSECTOR_KEY_ICMP in flow_keys_dissector_keys,
>> where we need this information we could define another structure that
>> has all the same fields as in flow_keys_dissector_keys and include
>> FLOW_DISSECTOR_KEY_ICMP. Alternatively, the flow_dissector_key_icmp
>> could also be outside of the area that is used in the hash: that is no
>> in flow_keys_hash_start(keys) to flow_keys_hash_length(keys), keyval);
>
>...
>
>> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> > index c96b2a349779..f4ba69bd362f 100644
>> > --- a/net/sched/cls_flower.c
>> > +++ b/net/sched/cls_flower.c
>> > @@ -38,7 +38,10 @@ struct fl_flow_key {
>> >                 struct flow_dissector_key_ipv4_addrs ipv4;
>> >                 struct flow_dissector_key_ipv6_addrs ipv6;
>> >         };
>> > -       struct flow_dissector_key_ports tp;
>> > +       union {
>> > +               struct flow_dissector_key_ports tp;
>> > +               struct flow_dissector_key_icmp icmp;
>> > +       };
>> 
>> When an ICMP packet is encapsulated within UDP then icmp overwrites
>> valid port information with is a stronger signal of the flow (unless
>> FLOW_DISSECTOR_F_STOP_AT_ENCAP is set). This is another reason not to
>> use a union with ports.
>
>...
>
>Hi Tom,
>
>thanks for your explanation. I think I have a clearer picture now.
>
>I have reworked things to try to address your concerns.
>In particular:
>
>* FLOW_DISSECTOR_KEY_ICMP is no longer added to flow_keys_dissector_keys.
>  I don't think it is needed at all for the use-case I currently have in
>  mind, which is classifying packets using tc_flower. And adding it there
>  was an error on my part.

I was just about to ask why you are adding it there. Good.


>
>* I stopped using a union for ports and icmp. At the very least this seems
>  to make it easier to reason about things as we no longer need to consider
>  that a port value may in fact be an ICMP type or code.

This should be within csl_flower code. You can easily have it as a union
in struct fl_flow_key. 


>
>  This seems to allow us avoid adding a number of is_icmp checks (as I believe
>  you pointed out earlier). And should also address the problem you state
>  immediately above.

I don't understand the check you are talking about. The union just allow
to share the mem. I don't see any checks needed.


>
>* I have also placed icmp outside of the range flow_keys_hash_start(keys)
>  to flow_keys_hash_length(keys), keyval). This is because I now see no
>  value of having it inside that range and it both avoids any chance of
>  contaminating the hash with ICMP values and hashing over unwanted
>  (hopefully zero) values.

Okay, now I'm confused again. You don't want to add this to
flow_keys_dissector_keys. Why would you?


>
>  This required an update to flow_keys_hash_length() as the bound
>  of the end of fields the hash is no longer the end of struct flow_keys.
>  It seems clean but I wonder if there are similar assumptions lurking
>  elsewhere.
>
>I have lightly tested this for the tc_flower case (only).
>
>Incremental patch on top of this series:
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index a6f75cfb2bf7..8029dd4912b6 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3181,8 +3181,7 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
> 	} else {
> 		return false;
> 	}
>-	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
>-	    proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
>+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
> 		fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
> 
> 	return true;
>@@ -3210,8 +3209,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
> 		return bond_eth_hash(skb);
> 
> 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
>-	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
>-	    flow_keys_are_icmp_any(&flow))
>+	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
> 		hash = bond_eth_hash(skb);
> 	else
> 		hash = (__force u32)flow.ports.ports;
>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>index 44a8f69a9198..9c535fbccf2c 100644
>--- a/include/linux/skbuff.h
>+++ b/include/linux/skbuff.h
>@@ -1094,11 +1094,6 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
> __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
> 			    void *data, int hlen_proto);
> 
>-static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto)
>-{
>-	return flow_protos_are_icmp_any(skb->protocol, ip_proto);
>-}
>-
> static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
> 					int thoff, u8 ip_proto)
> {
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index 5540dfa18872..058c9df8a4d8 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
> 
> /**
>  * flow_dissector_key_ports:
>- *	@ports: port numbers of Transport header or
>- *		type and code of ICMP header
>+ *	@ports: port numbers of Transport header
>  *		ports: source (high) and destination (low) port numbers
>  *		src: source port number
>  *		dst: destination port number
>- *		icmp: ICMP type (high) and code (low)
>- *		type: ICMP type
>- *		type: ICMP code
>  */
> struct flow_dissector_key_ports {
> 	union {
>@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
> 			__be16 src;
> 			__be16 dst;
> 		};
>+	};
>+};
>+
>+/**
>+ * flow_dissector_key_icmp:
>+ *	@ports: type and code of ICMP header
>+ *		icmp: ICMP type (high) and code (low)
>+ *		type: ICMP type
>+ *		code: ICMP code
>+ */
>+struct flow_dissector_key_icmp {
>+	union {
> 		__be16 icmp;
> 		struct {
> 			u8 type;
>@@ -115,7 +123,6 @@ struct flow_dissector_key_ports {
> 	};
> };
> 
>-
> /**
>  * struct flow_dissector_key_eth_addrs:
>  * @src: source Ethernet address
>@@ -133,6 +140,7 @@ enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
>+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
> 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
> 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
> 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
>@@ -173,11 +181,16 @@ struct flow_keys {
> 	struct flow_dissector_key_keyid keyid;
> 	struct flow_dissector_key_ports ports;
> 	struct flow_dissector_key_addrs addrs;
>+#define FLOW_KEYS_HASH_END_FIELD addrs
>+	struct flow_dissector_key_icmp icmp;
> };
> 
> #define FLOW_KEYS_HASH_OFFSET		\
> 	offsetof(struct flow_keys, FLOW_KEYS_HASH_START_FIELD)
> 
>+#define FLOW_KEYS_HASH_END		\
>+	offsetofend(struct flow_keys, FLOW_KEYS_HASH_END_FIELD)
>+
> __be32 flow_get_u32_src(const struct flow_keys *flow);
> __be32 flow_get_u32_dst(const struct flow_keys *flow);
> 
>@@ -233,8 +246,7 @@ static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
> 
> static inline bool flow_keys_have_l4(const struct flow_keys *keys)
> {
>-	return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) ||
>-		keys->tags.flow_label;
>+	return (keys->ports.ports || keys->tags.flow_label);
> }
> 
> u32 flow_hash_from_keys(struct flow_keys *keys);
>diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>index 0584b4bb4390..ed6d46475343 100644
>--- a/net/core/flow_dissector.c
>+++ b/net/core/flow_dissector.c
>@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 	struct flow_dissector_key_basic *key_basic;
> 	struct flow_dissector_key_addrs *key_addrs;
> 	struct flow_dissector_key_ports *key_ports;
>+	struct flow_dissector_key_icmp *key_icmp;
> 	struct flow_dissector_key_tags *key_tags;
> 	struct flow_dissector_key_vlan *key_vlan;
> 	struct flow_dissector_key_keyid *key_keyid;
>@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 		break;
> 	}
> 
>-	if (dissector_uses_key(flow_dissector,
>-			       FLOW_DISSECTOR_KEY_PORTS)) {
>-		key_ports = skb_flow_dissector_target(flow_dissector,
>-						      FLOW_DISSECTOR_KEY_PORTS,
>-						      target_container);
>-		if (flow_protos_are_icmp_any(proto, ip_proto))
>-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
>-							    hlen);
>-		else
>+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_ICMP)) {
>+			key_icmp = skb_flow_dissector_target(flow_dissector,
>+							     FLOW_DISSECTOR_KEY_ICMP,
>+							     target_container);
>+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
>+							   hlen);
>+		}
>+	} else {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_PORTS)) {
>+			key_ports = skb_flow_dissector_target(flow_dissector,
>+							      FLOW_DISSECTOR_KEY_PORTS,
>+							      target_container);
> 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
> 								ip_proto, data,
> 								hlen);
>+		}
> 	}
> 
> out_good:
>@@ -613,9 +621,10 @@ static inline const u32 *flow_keys_hash_start(const struct flow_keys *flow)
> static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
> {
> 	size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs);
>-	BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32));
>+	BUILD_BUG_ON((FLOW_KEYS_HASH_END - FLOW_KEYS_HASH_OFFSET) %
>+		     sizeof(u32));
> 	BUILD_BUG_ON(offsetof(typeof(*flow), addrs) !=
>-		     sizeof(*flow) - sizeof(flow->addrs));
>+		     FLOW_KEYS_HASH_END - sizeof(flow->addrs));
> 
> 	switch (flow->control.addr_type) {
> 	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
>@@ -628,7 +637,7 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
> 		diff -= sizeof(flow->addrs.tipcaddrs);
> 		break;
> 	}
>-	return (sizeof(*flow) - diff) / sizeof(u32);
>+	return (FLOW_KEYS_HASH_END - diff) / sizeof(u32);
> }
> 
> __be32 flow_get_u32_src(const struct flow_keys *flow)
>@@ -745,8 +754,7 @@ void make_flow_keys_digest(struct flow_keys_digest *digest,
> 
> 	data->n_proto = flow->basic.n_proto;
> 	data->ip_proto = flow->basic.ip_proto;
>-	if (flow_keys_have_l4(flow))
>-		data->ports = flow->ports.ports;
>+	data->ports = flow->ports.ports;
> 	data->src = flow->addrs.v4addrs.src;
> 	data->dst = flow->addrs.v4addrs.dst;
> }
>diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
>index 408313e33bbe..6575aba87630 100644
>--- a/net/sched/cls_flow.c
>+++ b/net/sched/cls_flow.c
>@@ -96,7 +96,7 @@ static u32 flow_get_proto(const struct sk_buff *skb,
> static u32 flow_get_proto_src(const struct sk_buff *skb,
> 			      const struct flow_keys *flow)
> {
>-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
>+	if (flow->ports.ports)
> 		return ntohs(flow->ports.src);
> 
> 	return addr_fold(skb->sk);
>@@ -105,7 +105,7 @@ static u32 flow_get_proto_src(const struct sk_buff *skb,
> static u32 flow_get_proto_dst(const struct sk_buff *skb,
> 			      const struct flow_keys *flow)
> {
>-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
>+	if (flow->ports.ports)
> 		return ntohs(flow->ports.dst);
> 
> 	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index c96b2a349779..56df0368125a 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -39,6 +39,7 @@ struct fl_flow_key {
> 		struct flow_dissector_key_ipv6_addrs ipv6;
> 	};
> 	struct flow_dissector_key_ports tp;
>+	struct flow_dissector_key_icmp icmp;
> 	struct flow_dissector_key_keyid enc_key_id;
> 	union {
> 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
>@@ -511,19 +512,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
> 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
> 			       sizeof(key->tp.dst));
> 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	}
> 
> 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
>@@ -634,6 +639,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_PORTS, tp);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+			     FLOW_DISSECTOR_KEY_ICMP, icmp);
>+	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
>@@ -1000,24 +1007,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 				  sizeof(key->tp.dst))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 
> 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply

* Re: [PATCH] net:switchdev: fix to release the lock before function return to avoid deadlock
From: Jiri Pirko @ 2016-12-06 12:29 UTC (permalink / raw)
  To: Feng Deng; +Cc: davem, netdev, linux-kernel, cxdx2006
In-Reply-To: <1481018964-47554-1-git-send-email-feng.deng@cortina-access.com>

Tue, Dec 06, 2016 at 11:09:24AM CET, feng.deng@cortina-access.com wrote:
>before switchdev_deferred_dequeue() normal return,show release the lock,
>if not,maybe there will be deadlock sometimes
>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>
>---
> net/switchdev/switchdev.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>index 3b95fe9..c0a1ad4 100644
>--- a/net/switchdev/switchdev.c
>+++ b/net/switchdev/switchdev.c
>@@ -120,6 +120,7 @@ static struct switchdev_deferred_item *switchdev_deferred_dequeue(void)
> 	dfitem = list_first_entry(&deferred,
> 				  struct switchdev_deferred_item, list);
> 	list_del(&dfitem->list);
>+	spin_unlock_bh(&deferred_lock);
> unlock:
> 	spin_unlock_bh(&deferred_lock);

You are joking right?

^ permalink raw reply

* Re: [PATCH]net:switchdev:add judgement and proper return vlaue to make switchdev_port_vlan_fill() more robust
From: Jiri Pirko @ 2016-12-06 12:32 UTC (permalink / raw)
  To: Feng Deng; +Cc: David S. Miller, netdev, linux-kernel, feng.deng
In-Reply-To: <CAEX1niU24TLaRff+TMPu4ek4vT=R2mv9yaE6j1W1ZsFYh99QGg@mail.gmail.com>

Tue, Dec 06, 2016 at 05:46:36AM CET, cxdx2006@gmail.com wrote:
>From: Feng Deng<cxdx2006@gmail.com>
>
>1.add judgement to make sure switchdev_port_vlan_fill() could
>  return right,  even when switchdev_port_vlan_dump_put() failed  ,
>  which will make switchdev_port_vlan_fill()  more robust
>2.add proper return vlaue at the end of the switchdev_port_vlan_fill()
>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>

Please don't send patches like this. You are wasting everyone's time.

Thanks!

^ permalink raw reply

* [PATCH net-next 2/3] net/sched: cls_flower: Add support for matching on flags
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

Add UAPI to provide set of flags for matching, where the flags
provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
---
 include/uapi/linux/pkt_cls.h |  7 ++++
 net/sched/cls_flower.c       | 83 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 86786d4..f114277 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -457,11 +457,18 @@ enum {
 	TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK,	/* be16 */
 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT,	/* be16 */
 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,	/* be16 */
+
+	TCA_FLOWER_KEY_FLAGS,		/* be32 */
+	TCA_FLOWER_KEY_FLAGS_MASK,	/* be32 */
 	__TCA_FLOWER_MAX,
 };
 
 #define TCA_FLOWER_MAX (__TCA_FLOWER_MAX - 1)
 
+enum {
+	TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT = BIT(0),
+};
+
 /* Match-all classifier */
 
 enum {
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c5cea78..d9f4124 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -383,6 +383,8 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK]	= { .type = NLA_U16 },
+	[TCA_FLOWER_KEY_FLAGS]		= { .type = NLA_U32 },
+	[TCA_FLOWER_KEY_FLAGS_MASK]	= { .type = NLA_U32 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -417,6 +419,40 @@ static void fl_set_key_vlan(struct nlattr **tb,
 	}
 }
 
+static void set_flags(u32 flower_key, u32 flower_mask,
+		      u32 *dissector_key, u32 *dissector_mask,
+		      u32 flower_flag_bit, u32 dissector_flag_bit)
+{
+	if (flower_mask & flower_flag_bit) {
+		*dissector_mask |= dissector_flag_bit;
+		if (flower_key & flower_flag_bit)
+			*dissector_key |= dissector_flag_bit;
+	}
+}
+
+static void fl_set_key_flags(struct nlattr **tb,
+			     u32 *flags_key,
+			     u32 *flags_mask)
+{
+	u32 key, mask;
+
+	if (!tb[TCA_FLOWER_KEY_FLAGS])
+		return;
+
+	key = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS]));
+
+	if (!tb[TCA_FLOWER_KEY_FLAGS_MASK])
+		mask = ~0;
+	else
+		mask = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS_MASK]));
+
+	*flags_key  = 0;
+	*flags_mask = 0;
+
+	set_flags(key, mask, flags_key, flags_mask,
+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
+}
+
 static int fl_set_key(struct net *net, struct nlattr **tb,
 		      struct fl_flow_key *key, struct fl_flow_key *mask)
 {
@@ -543,6 +579,8 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 		       &mask->enc_tp.dst, TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
 		       sizeof(key->enc_tp.dst));
 
+	fl_set_key_flags(tb, &key->control.flags, &mask->control.flags);
+
 	return 0;
 }
 
@@ -877,6 +915,48 @@ static int fl_dump_key_vlan(struct sk_buff *skb,
 	return 0;
 }
 
+static void get_flags(u32 dissector_key, u32 dissector_mask,
+		      u32 *flower_key, u32 *flower_mask,
+		      u32 flower_flag_bit, u32 dissector_flag_bit)
+{
+	if (dissector_mask & dissector_flag_bit) {
+		*flower_mask |= flower_flag_bit;
+		if (dissector_key & dissector_flag_bit)
+			*flower_key |= flower_flag_bit;
+	}
+}
+
+static int fl_dump_key_flags(struct sk_buff *skb,
+			     u32 flags_key,
+			     u32 flags_mask)
+{
+	u32 key, mask;
+	__be32 _key, _mask;
+	int err;
+
+	if (!memchr_inv(&flags_mask, 0, sizeof(flags_mask)))
+		return 0;
+
+	key  = 0;
+	mask = 0;
+
+	get_flags(flags_key, flags_mask, &key, &mask,
+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
+
+	_key  = cpu_to_be32(key);
+	_mask = cpu_to_be32(mask);
+
+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS, 4, &_key);
+	if (err)
+		return err;
+
+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 		   struct sk_buff *skb, struct tcmsg *t)
 {
@@ -1012,6 +1092,9 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 			    sizeof(key->enc_tp.dst)))
 		goto nla_put_failure;
 
+	if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
+		goto nla_put_failure;
+
 	nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags);
 
 	if (tcf_exts_dump(skb, &f->exts))
-- 
2.3.7

^ permalink raw reply related

* [PATCH net-next 1/3] net/flow_dissector: Enable matching on flags for TC filter consumers
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz,
	Tom Herbert
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

This is a pre-step to allow TC classifiers that makes use of the
flow-dissector (e.g Flower) to match on flow-dissector flags which
are located in the control struct.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/net/flow_dissector.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index c4f3166..a679e6c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -154,8 +154,8 @@ struct flow_dissector {
 };
 
 struct flow_keys {
+#define FLOW_KEYS_HASH_START_FIELD control
 	struct flow_dissector_key_control control;
-#define FLOW_KEYS_HASH_START_FIELD basic
 	struct flow_dissector_key_basic basic;
 	struct flow_dissector_key_tags tags;
 	struct flow_dissector_key_vlan vlan;
-- 
2.3.7
Cc: Tom Herbert <tom@herbertland.com>

^ permalink raw reply related

* [PATCH net-next 3/3] net/mlx5e: Offload TC matching on packets being IP fragments
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

Enable offloading of matching on packets being fragments.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index f07ef8c..f8829b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -31,6 +31,7 @@
  */
 
 #include <net/flow_dissector.h>
+#include <net/sch_generic.h>
 #include <net/pkt_cls.h>
 #include <net/tc_act/tc_gact.h>
 #include <net/tc_act/tc_skbedit.h>
@@ -363,7 +364,18 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			skb_flow_dissector_target(f->dissector,
 						  FLOW_DISSECTOR_KEY_CONTROL,
 						  f->key);
+
+		struct flow_dissector_key_control *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_CONTROL,
+						  f->mask);
 		addr_type = key->addr_type;
+
+		if (mask->flags & FLOW_DIS_IS_FRAGMENT) {
+			MLX5_SET(fte_match_set_lyr_2_4, headers_c, frag, 1);
+			MLX5_SET(fte_match_set_lyr_2_4, headers_v, frag,
+				 key->flags & FLOW_DIS_IS_FRAGMENT);
+		}
 	}
 
 	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-- 
2.3.7

^ permalink raw reply related

* [PATCH net-next 0/3] net/sched: cls_flower: Add support for matching on dissection flags
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz

This series add the UAPI to provide set of flags for matching, where the 
flags provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Or

Or Gerlitz (3):
  net/flow_dissector: Enable matching on flags for TC filter consumers
  net/sched: cls_flower: Add support for matching on flags
  net/mlx5e: Offload TC matching on packets being IP fragments

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 12 ++++
 include/net/flow_dissector.h                    |  2 +-
 include/uapi/linux/pkt_cls.h                    |  7 +++
 net/sched/cls_flower.c                          | 83 +++++++++++++++++++++++++
 4 files changed, 103 insertions(+), 1 deletion(-)

-- 
2.3.7

^ permalink raw reply

* Re: [PATCH net-next 2/3] net/sched: cls_flower: Add support for matching on flags
From: Jiri Pirko @ 2016-12-06 12:40 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David S. Miller, netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion
In-Reply-To: <1481027579-23195-3-git-send-email-ogerlitz@mellanox.com>

Tue, Dec 06, 2016 at 01:32:58PM CET, ogerlitz@mellanox.com wrote:
>Add UAPI to provide set of flags for matching, where the flags
>provided from user-space are mapped to flow-dissector flags.
>
>The 1st flag allows to match on whether the packet is an
>IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.
>
>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>Reviewed-by: Paul Blakey <paulb@mellanox.com>
>---
> include/uapi/linux/pkt_cls.h |  7 ++++
> net/sched/cls_flower.c       | 83 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 90 insertions(+)
>
>diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>index 86786d4..f114277 100644
>--- a/include/uapi/linux/pkt_cls.h
>+++ b/include/uapi/linux/pkt_cls.h
>@@ -457,11 +457,18 @@ enum {
> 	TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK,	/* be16 */
> 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT,	/* be16 */
> 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,	/* be16 */
>+
>+	TCA_FLOWER_KEY_FLAGS,		/* be32 */
>+	TCA_FLOWER_KEY_FLAGS_MASK,	/* be32 */
> 	__TCA_FLOWER_MAX,
> };
> 
> #define TCA_FLOWER_MAX (__TCA_FLOWER_MAX - 1)
> 
>+enum {
>+	TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT = BIT(0),
>+};
>+
> /* Match-all classifier */
> 
> enum {
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index c5cea78..d9f4124 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -383,6 +383,8 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
> 	[TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK]	= { .type = NLA_U16 },
> 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT]	= { .type = NLA_U16 },
> 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK]	= { .type = NLA_U16 },
>+	[TCA_FLOWER_KEY_FLAGS]		= { .type = NLA_U32 },
>+	[TCA_FLOWER_KEY_FLAGS_MASK]	= { .type = NLA_U32 },
> };
> 
> static void fl_set_key_val(struct nlattr **tb,
>@@ -417,6 +419,40 @@ static void fl_set_key_vlan(struct nlattr **tb,
> 	}
> }
> 
>+static void set_flags(u32 flower_key, u32 flower_mask,

Use some prefix for helpers like this, or at least __



>+		      u32 *dissector_key, u32 *dissector_mask,
>+		      u32 flower_flag_bit, u32 dissector_flag_bit)
>+{
>+	if (flower_mask & flower_flag_bit) {
>+		*dissector_mask |= dissector_flag_bit;
>+		if (flower_key & flower_flag_bit)
>+			*dissector_key |= dissector_flag_bit;
>+	}
>+}
>+
>+static void fl_set_key_flags(struct nlattr **tb,
>+			     u32 *flags_key,
>+			     u32 *flags_mask)
>+{
>+	u32 key, mask;
>+
>+	if (!tb[TCA_FLOWER_KEY_FLAGS])
>+		return;
>+
>+	key = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS]));
>+
>+	if (!tb[TCA_FLOWER_KEY_FLAGS_MASK])
>+		mask = ~0;
>+	else
>+		mask = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS_MASK]));
>+
>+	*flags_key  = 0;
>+	*flags_mask = 0;
>+
>+	set_flags(key, mask, flags_key, flags_mask,
>+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
>+}
>+
> static int fl_set_key(struct net *net, struct nlattr **tb,
> 		      struct fl_flow_key *key, struct fl_flow_key *mask)
> {
>@@ -543,6 +579,8 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
> 		       &mask->enc_tp.dst, TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
> 		       sizeof(key->enc_tp.dst));
> 
>+	fl_set_key_flags(tb, &key->control.flags, &mask->control.flags);
>+
> 	return 0;
> }
> 
>@@ -877,6 +915,48 @@ static int fl_dump_key_vlan(struct sk_buff *skb,
> 	return 0;
> }
> 
>+static void get_flags(u32 dissector_key, u32 dissector_mask,

Same here.


>+		      u32 *flower_key, u32 *flower_mask,
>+		      u32 flower_flag_bit, u32 dissector_flag_bit)
>+{
>+	if (dissector_mask & dissector_flag_bit) {
>+		*flower_mask |= flower_flag_bit;
>+		if (dissector_key & dissector_flag_bit)
>+			*flower_key |= flower_flag_bit;
>+	}
>+}
>+
>+static int fl_dump_key_flags(struct sk_buff *skb,
>+			     u32 flags_key,
>+			     u32 flags_mask)

over-wrapped :)


>+{
>+	u32 key, mask;
>+	__be32 _key, _mask;
>+	int err;
>+
>+	if (!memchr_inv(&flags_mask, 0, sizeof(flags_mask)))
>+		return 0;
>+
>+	key  = 0;
            ^ remove the extra space


>+	mask = 0;
>+
>+	get_flags(flags_key, flags_mask, &key, &mask,
>+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
>+
>+	_key  = cpu_to_be32(key);
             ^ remove the extra space


>+	_mask = cpu_to_be32(mask);
>+
>+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS, 4, &_key);
>+	if (err)
>+		return err;
>+
>+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask);

Just return nla_put...


>+	if (err)
>+		return err;
>+
>+	return 0;
>+}
>+
> static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 		   struct sk_buff *skb, struct tcmsg *t)
> {
>@@ -1012,6 +1092,9 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 			    sizeof(key->enc_tp.dst)))
> 		goto nla_put_failure;
> 
>+	if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
>+		goto nla_put_failure;
>+
> 	nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags);
> 
> 	if (tcf_exts_dump(skb, &f->exts))
>-- 
>2.3.7
>

^ permalink raw reply

* [PATCH RFC v1 0/2] ethtool: add support for VF in ring_cookie
From: Jacob Keller @ 2016-12-06 12:40 UTC (permalink / raw)
  To: John W . Linville; +Cc: netdev, Jacob Keller

The ring_cookie value for an ntuple filter is used by the driver to
determine what queue to direct the traffic towards. Since the main
function has cnotrol over all queues it is possible to use this to
direct traffic to a queue on a virtual function rather than a queue on
the physical function. However, this was cumbersome to do if you did not
know exactly how the device laid out virtual function queues.

To alleviate this, the kernel side of the ethtool API gained a couple of
functions used to partition the ring_cookie value into a queue index,
and a virtual function index. It was assumed that no device was likely
to have more than 2^32 queues, and currently no device has more than 2^8
virtual functions, so the split was chosen such that the lower 32 bits
of the ring_cookie represent the queue, and the next 8 bits would
represent a virtual function id.

This is useful, but lacked support on the ethtool command line side. It
is possible to simply use hex notation, and to "know" that the driver
works in this manner. However, this makes use of and knowledge of the
feature not widespread.

This series attempts to fix that with two patches. First, we make use of
the access functions when displaying the queue. Since no device
currently supports queues beyond 2^32 it is unlikely that any driver
which does not honor this partitioning of the ring_cookie would have its
index incorrectly interpreted. This makes it so that the user will see
the queue index and virtual function index split apart instead of being
displayed as a combined base-10 number.

Additionally, the second patch adds support for a new notation for the
action. Instead of just using a number, we add support for the special
syntax "X/Y" where X would be a number from 0-256 which is the virtual
function index, while Y would be a number from 0 to 2^32 which is the
queue index. This makes it much easier to write the encoded ring_cookie
value.

This series is sent as an RFC so that others may comment on the design
of the new syntax, and offer alternative suggestions. Some of the ones I
have considered:

a) just leave the second part alone, and display the VF+queue split, but
do not provide an easier syntax for writing the split value. This is
easy to do, but leaves the feature as a sort of hidden gem.
Additionally, the user would most easily do this by writing it as hex
notation such as

  action 0x100000005

However, this is cumbersome. It can be partially alleviated by using
shell scripting on top of ethtool, but again is somewhat of a pain.

b) actually adding a new key work so that you would do something like
"queue x" and "vf y" or similar. However, this is much more difficult to
correctly implement. The current implementation of handling the values
assumes that each type argument is followed by a single argument which
fits into a well defined section of the structure. Because the structure
is not specified as le64 or be64, we can't use a union or similar to
make it correct (since placement in the union ordering would depend on
byte ordering of the field!) so we would have to re-architect the
handling of values to allow multiple keywords to combine into a single
value. This seemed incredibly cumbersome.

The disadvantage of the new syntax is that it *is* new syntax that we
have to parse ourselves, and that may not be obvious to users. However,
we can document it, and the current syntax will work for all drivers,
and the new syntax will generally only work on drivers which use the
ring_cookie partitioning in this manner (as no current driver has queues
beyond 2^32, as was decided when the partitioning scheme was added to
the ethtool header file).

I like the use of / but am open to alternative suggestions for a simple
sequence. Note that we cannot use spaces, as this would move into the
argv solution of (b) which I do not have time to implement, and would be
a much more massive project.

Jacob Keller (2):
  rxclass: use ethtool_get_flow_spec_ring to display queue and VF
  ethtool: add support to specify ring_cookie as VF/queue

 ethtool.8.in |  3 ++-
 rxclass.c    | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 65 insertions(+), 12 deletions(-)

-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply

* [PATCH RFC v1 1/2] rxclass: use ethtool_get_flow_spec_ring to display queue and VF
From: Jacob Keller @ 2016-12-06 12:40 UTC (permalink / raw)
  To: John W . Linville; +Cc: netdev, Jacob Keller
In-Reply-To: <20161206124031.32346-1-jacob.e.keller@intel.com>

Recent kernels have made the ring_cookie store both the Queue index as
well as a VF identifier. This allows for drivers to direct traffic to
specific VF queues, without having the user have to understand the
physical queue layout. Add support to display this notation so that
users do not have to manually parse the value.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 rxclass.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/rxclass.c b/rxclass.c
index c7bfebaf6e22..7f0e765d3b47 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -247,11 +247,19 @@ static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
 
 	rxclass_print_nfc_spec_ext(fsp);
 
-	if (fsp->ring_cookie != RX_CLS_FLOW_DISC)
-		fprintf(stdout, "\tAction: Direct to queue %llu\n",
-			fsp->ring_cookie);
-	else
+	if (fsp->ring_cookie != RX_CLS_FLOW_DISC) {
+		u64 vf = ethtool_get_flow_spec_ring_vf(fsp->ring_cookie);
+		u64 queue = ethtool_get_flow_spec_ring(fsp->ring_cookie);
+
+		if (vf)
+			fprintf(stdout, "\tAction: Direct to queue %llu\n",
+				queue);
+		else
+			fprintf(stdout, "\tAction: Direct to VF %llu queue %llu\n",
+				vf, queue);
+	} else {
 		fprintf(stdout, "\tAction: Drop\n");
+	}
 
 	fprintf(stdout, "\n");
 }
-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply related

* [PATCH RFC v1 2/2] ethtool: add support to specify ring_cookie as VF/queue
From: Jacob Keller @ 2016-12-06 12:40 UTC (permalink / raw)
  To: John W . Linville; +Cc: netdev, Jacob Keller
In-Reply-To: <20161206124031.32346-1-jacob.e.keller@intel.com>

Add support to allow users to specify the ring_cookie as a VF/queue
value, ie: '0/25' means direct to the physical function queue 25, while
'3/10' means to direct to VF id 3, queue 10. It is driver dependent to
determine what the VF id is, excepting that 0 indicates the physical
function. This allows the user to more easily specify the matching
notation as it is understood by the kernel and some drivers. Since no
driver exists today with over 2,147,483,647 queues, this notation will
simply fail on drivers which don't recognize the new partitioning.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 ethtool.8.in |  3 ++-
 rxclass.c    | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/ethtool.8.in b/ethtool.8.in
index 5c36c06385f6..cb165fa4c77a 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -826,7 +826,8 @@ Specifies the Rx queue to send packets to, or some other action.
 nokeep;
 lB	l.
 -1	Drop the matched flow
-0 or higher	Rx queue to route the flow
+0 or higher	Route flow to specific Rx queue on the physical function
+N/M	Route flow to Rx queue M of specific virtual function N (0 indicates the physical function)
 .TE
 .TP
 .BI loc \ N
diff --git a/rxclass.c b/rxclass.c
index 7f0e765d3b47..de25550529b4 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -614,6 +614,7 @@ typedef enum {
 	OPT_IP4,
 	OPT_IP6,
 	OPT_MAC,
+	OPT_ACTION,
 } rule_opt_type_t;
 
 #define NFC_FLAG_RING		0x001
@@ -654,7 +655,7 @@ static const struct rule_opts rule_nfc_tcp_ip4[] = {
 	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.pdst),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.pdst) },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -685,7 +686,7 @@ static const struct rule_opts rule_nfc_esp_ip4[] = {
 	{ "spi", OPT_BE32, NFC_FLAG_SPI,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip4_spec.spi),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip4_spec.spi) },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -728,7 +729,7 @@ static const struct rule_opts rule_nfc_usr_ip4[] = {
 	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.l4_4_bytes) + 2,
 	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.l4_4_bytes) + 2 },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -762,7 +763,7 @@ static const struct rule_opts rule_nfc_tcp_ip6[] = {
 	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip6_spec.pdst),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip6_spec.pdst) },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -793,7 +794,7 @@ static const struct rule_opts rule_nfc_esp_ip6[] = {
 	{ "spi", OPT_BE32, NFC_FLAG_SPI,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip6_spec.spi),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip6_spec.spi) },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -836,7 +837,7 @@ static const struct rule_opts rule_nfc_usr_ip6[] = {
 	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip6_spec.l4_4_bytes) + 2,
 	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip6_spec.l4_4_bytes) + 2 },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -864,7 +865,7 @@ static const struct rule_opts rule_nfc_ether[] = {
 	{ "proto", OPT_BE16, NFC_FLAG_PROTO,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.ether_spec.h_proto),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.ether_spec.h_proto) },
-	{ "action", OPT_U64, NFC_FLAG_RING,
+	{ "action", OPT_ACTION, NFC_FLAG_RING,
 	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
 	{ "loc", OPT_U32, NFC_FLAG_LOC,
 	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
@@ -948,6 +949,40 @@ static int rxclass_get_ether(char *str, unsigned char *val)
 	return 0;
 }
 
+static int rxclass_get_action(char *str, unsigned long long *val)
+{
+	const long long max_queue = ~0ULL >> 32;
+	const long long max_vf = ~0ULL >> 8;
+	char *end_vf, *end_queue;
+	unsigned long long queue, vf;
+
+	/* If there is no '/' just read the full value like a u64 */
+	if (!strchr(str, '/'))
+		return rxclass_get_ulong(str, val, 64);
+
+	errno = 0;
+
+	vf = strtoull(str, &end_vf, 0);
+
+	/*
+	 * Ensure that we consume everything up to the slash, and that there
+	 * is more than the empty string following the slash.
+	 */
+	if ((*end_vf != '/') || !(*(end_vf + 1)) || errno || (vf > max_vf))
+		return -1;
+
+	errno = 0;
+
+	queue = strtoull(end_vf + 1, &end_queue, 0);
+
+	if (*end_queue || errno || (queue > max_queue))
+		return -1;
+
+	*val = (vf << ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF) | queue;
+
+	return 0;
+}
+
 static int rxclass_get_val(char *str, unsigned char *p, u32 *flags,
 			   const struct rule_opts *opt)
 {
@@ -1070,6 +1105,14 @@ static int rxclass_get_val(char *str, unsigned char *p, u32 *flags,
 			memcpy(&p[opt->moffset], &mask, ETH_ALEN);
 		break;
 	}
+	case OPT_ACTION: {
+		unsigned long long val;
+		err = rxclass_get_action(str, &val);
+		if (err)
+			return -1;
+		*(u64 *)&p[opt->offset] = (u64)val;
+		break;
+	}
 	case OPT_NONE:
 	default:
 		return -1;
@@ -1183,6 +1226,7 @@ static int rxclass_get_mask(char *str, unsigned char *p,
 		memcpy(&p[opt->moffset], val, ETH_ALEN);
 		break;
 	}
+	case OPT_ACTION:
 	case OPT_NONE:
 	default:
 		return -1;
-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply related

* Re: [PATCH net-next 1/3] net/flow_dissector: Enable matching on flags for TC filter consumers
From: Jiri Pirko @ 2016-12-06 12:43 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion,
	Tom Herbert
In-Reply-To: <1481027579-23195-2-git-send-email-ogerlitz@mellanox.com>

Tue, Dec 06, 2016 at 01:32:57PM CET, ogerlitz@mellanox.com wrote:
>This is a pre-step to allow TC classifiers that makes use of the
>flow-dissector (e.g Flower) to match on flow-dissector flags which
>are located in the control struct.
>
>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>---
> include/net/flow_dissector.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index c4f3166..a679e6c 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -154,8 +154,8 @@ struct flow_dissector {
> };
> 
> struct flow_keys {
>+#define FLOW_KEYS_HASH_START_FIELD control
> 	struct flow_dissector_key_control control;
>-#define FLOW_KEYS_HASH_START_FIELD basic

How is this hashing related to cls_flower dissections? What am I
missing?

^ permalink raw reply

* Re: [PATCH 0/5] cpsw: add per channel shaper configuration
From: Ivan Khoronzhuk @ 2016-12-06 13:35 UTC (permalink / raw)
  To: Grygorii Strashko; +Cc: netdev, linux-kernel, mugunthanvnm, linux-omap
In-Reply-To: <4640ce88-75b0-8508-05f0-849e8fd4927f@ti.com>

On Mon, Dec 05, 2016 at 02:33:40PM -0600, Grygorii Strashko wrote:
> Hi Ivan,

Hi, Grygorii

I've sent patch that fixes issue in question.
https://lkml.org/lkml/2016/12/5/811
Could you please review it. Also I'm preparing several patches
to sophisticate cases with unexpected speed value.

> 
> On 11/29/2016 09:00 AM, Ivan Khoronzhuk wrote:
> > This series is intended to allow user to set rate for per channel
> > shapers at cpdma level. This patchset doesn't have impact on performance.
> > The rate can be set with:
> > 
> > echo 100 > /sys/class/net/ethX/queues/tx-0/tx_maxrate
> > 
> > Tested on am572xx
> > Based on net-next/master
> > 
> > Ivan Khoronzhuk (5):
> >   net: ethernet: ti: davinci_cpdma: add weight function for channels
> >   net: ethernet: ti: davinci_cpdma: add set rate for a channel
> >   net: ethernet: ti: cpsw: add .ndo to set per-queue rate
> >   net: ethernet: ti: cpsw: optimize end of poll cycle
> >   net: ethernet: ti: cpsw: split tx budget according between channels
> > 
> >  drivers/net/ethernet/ti/cpsw.c          | 264 +++++++++++++++----
> >  drivers/net/ethernet/ti/davinci_cpdma.c | 453 ++++++++++++++++++++++++++++----
> >  drivers/net/ethernet/ti/davinci_cpdma.h |   6 +
> >  3 files changed, 624 insertions(+), 99 deletions(-)
> > 
> 
> 
> I've just tried net-next on BBB and got below back-trace:
> INIT: Entering runlevel: 5
> Configuring network interfaces... [   15.018356] net eth0: initializing cpsw version 1.12 (0)
> [   15.120153] SMSC LAN8710/LAN8720 4a101000.mdio:00: attached PHY driver [SMSC LAN8710/LAN8720] (mii_bus:phy_addr=4a101000.mdio:00, irq=-1)
> [   15.138578] Division by zero in kernel.
> [   15.142667] CPU: 0 PID: 755 Comm: ifconfig Not tainted 4.9.0-rc7-01617-g6ea3f00 #5
> [   15.150277] Hardware name: Generic AM33XX (Flattened Device Tree)
> [   15.156399] Backtrace: 
> [   15.158898] [<c010bbf8>] (dump_backtrace) from [<c010beb4>] (show_stack+0x18/0x1c)
> [   15.166508]  r7:00000000 r6:600f0013 r5:00000000 r4:c0d395d0
> [   15.172200] [<c010be9c>] (show_stack) from [<c03ce6cc>] (dump_stack+0x8c/0xa0)
> [   15.179460] [<c03ce640>] (dump_stack) from [<c010bd60>] (__div0+0x1c/0x20)
> [   15.186368]  r7:00000000 r6:ddf1c010 r5:00000001 r4:00000001
> [   15.192068] [<c010bd44>] (__div0) from [<c03cd17c>] (Ldiv0+0x8/0x14)
> [   15.198503] [<bf07ba38>] (cpsw_split_budget [ti_cpsw]) from [<bf07ee8c>] (cpsw_ndo_open+0x4b8/0x5e4 [ti_cpsw])
> [   15.208554]  r10:ddf1c010 r9:00000000 r8:00000000 r7:ddf1c010 r6:dcc88000 r5:dcc88500
> [   15.216418]  r4:ddf1c0b0
> [   15.218985] [<bf07e9d4>] (cpsw_ndo_open [ti_cpsw]) from [<c0731c68>] (__dev_open+0xb0/0x114)
> [   15.227466]  r10:00000000 r9:00000000 r8:00000000 r7:dcc88030 r6:bf080364 r5:00000000
> [   15.235330]  r4:dcc88000
> [   15.237880] [<c0731bb8>] (__dev_open) from [<c0731f24>] (__dev_change_flags+0x9c/0x14c)
> [   15.245923]  r7:00001002 r6:00000001 r5:00001043 r4:dcc88000
> [   15.251613] [<c0731e88>] (__dev_change_flags) from [<c0731ff4>] (dev_change_flags+0x20/0x50)
> [   15.260093]  r9:00000000 r8:00000000 r7:dcbabf0c r6:00001002 r5:dcc8813c r4:dcc88000
> [   15.267886] [<c0731fd4>] (dev_change_flags) from [<c07a3168>] (devinet_ioctl+0x6d4/0x794)
> [   15.276105]  r9:00000000 r8:bee8cc64 r7:dcbabf0c r6:00000000 r5:dc921e80 r4:00000000
> [   15.283891] [<c07a2a94>] (devinet_ioctl) from [<c07a54bc>] (inet_ioctl+0x19c/0x1c8)
> [   15.291587]  r10:00000000 r9:dc920000 r8:bee8cc64 r7:c0d64bc0 r6:bee8cc64 r5:dd2728e0
> [   15.299450]  r4:00008914
> [   15.302002] [<c07a5320>] (inet_ioctl) from [<c07112dc>] (sock_ioctl+0x14c/0x300)
> [   15.309443] [<c0711190>] (sock_ioctl) from [<c0241868>] (do_vfs_ioctl+0xa8/0x98c)
> [   15.316962]  r7:00000003 r6:ddf0a780 r5:dd2728e0 r4:bee8cc64
> [   15.322653] [<c02417c0>] (do_vfs_ioctl) from [<c0242188>] (SyS_ioctl+0x3c/0x64)
> [   15.330000]  r10:00000000 r9:dc920000 r8:bee8cc64 r7:00008914 r6:ddf0a780 r5:00000003
> [   15.337864]  r4:ddf0a780
> [   15.340416] [<c024214c>] (SyS_ioctl) from [<c0107d60>] (ret_fast_syscall+0x0/0x3c)
> [   15.348024]  r9:dc920000 r8:c0107f24 r7:00000036 r6:bee8cf4d r5:bee8ce4c r4:000949f0
> [   15.361174] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> udhcpc (v1.23.1) started
> 
> 
> -- 
> regards,
> -grygorii

^ permalink raw reply

* Re: [PATCH v4 09/13] net: ethernet: ti: cpts: rework initialization/deinitialization
From: Richard Cochran @ 2016-12-06 13:40 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA, Mugunthan V N,
	Sekhar Nori, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-omap-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Murali Karicheri, Wingman Kwok,
	Thomas Gleixner
In-Reply-To: <20161205200525.16664-10-grygorii.strashko-l0cyMroinI0@public.gmane.org>

On Mon, Dec 05, 2016 at 02:05:21PM -0600, Grygorii Strashko wrote:
> @@ -372,34 +354,27 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL_GPL(cpts_tx_timestamp);
>  
> -int cpts_register(struct device *dev, struct cpts *cpts,
> -		  u32 mult, u32 shift)
> +int cpts_register(struct cpts *cpts)
>  {
>  	int err, i;
>  
> -	cpts->info = cpts_info;
> -	spin_lock_init(&cpts->lock);
> -
> -	cpts->cc.read = cpts_systim_read;
> -	cpts->cc.mask = CLOCKSOURCE_MASK(32);
> -	cpts->cc_mult = mult;
> -	cpts->cc.mult = mult;
> -	cpts->cc.shift = shift;
> -
>  	INIT_LIST_HEAD(&cpts->events);
>  	INIT_LIST_HEAD(&cpts->pool);
>  	for (i = 0; i < CPTS_MAX_EVENTS; i++)
>  		list_add(&cpts->pool_data[i].list, &cpts->pool);
>  
> -	cpts_clk_init(dev, cpts);
> +	clk_enable(cpts->refclk);
> +
>  	cpts_write32(cpts, CPTS_EN, control);
>  	cpts_write32(cpts, TS_PEND_EN, int_enable);
>  
> +	/* reinitialize cc.mult to original value as it can be modified
> +	 * by cpts_ptp_adjfreq().
> +	 */
> +	cpts->cc.mult = cpts->cc_mult;

This still isn't quite right.  First of all, you shouldn't clobber the
learned cc.mult value in cpts_register().  Presumably, if PTP had been
run on this port before, then the learned frequency is approximately
correct, and it should be left alone.

[ BTW, resetting the timecounter here makes no sense either.  Why
  reset the clock just because the interface goes down?  ]

Secondly, you have made the initialization order of these fields hard
to follow.  With the whole series applied:

probe()
	cpts_create()
		cpts_of_parse()
		{
			/* Set cc_mult but not cc.mult! */
			set cc_mult
			set cc.shift
		}
		cpts_calc_mult_shift()
		{
			/* Set them both. */
			cpts->cc_mult = mult;
			cpts->cc.mult = mult;
			cpts->cc.shift = shift;
		}
/* later on */
cpts_register()
	cpts->cc.mult = cpts->cc_mult;

There is no need for such complexity.  Simply set cc.mult in
cpts_create() _once_, immediately after the call to
cpts_calc_mult_shift().

You can remove the assignment from cpts_calc_mult_shift() and
cpts_register().

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] net: stmmac: do not call phy_ethtool_ksettings_set from atomic context
From: Niklas Cassel @ 2016-12-06 13:47 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue; +Cc: Niklas Cassel, netdev, linux-kernel

From: Niklas Cassel <niklas.cassel@axis.com>

>From what I can tell, spin_lock(&priv->lock) is not needed, since the
phy_ethtool_ksettings_set call is not given the priv struct.

phy_start_aneg takes the phydev->lock. Calls to phy_adjust_link
from phy_state_machine also takes the phydev->lock.

[   13.718319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97
[   13.726717] in_atomic(): 1, irqs_disabled(): 0, pid: 1307, name: ethtool
[   13.742115] Hardware name: Axis ARTPEC-6 Platform
[   13.746829] [<80110568>] (unwind_backtrace) from [<8010c2bc>] (show_stack+0x18/0x1c)
[   13.754575] [<8010c2bc>] (show_stack) from [<80433484>] (dump_stack+0x80/0xa0)
[   13.761801] [<80433484>] (dump_stack) from [<80145428>] (___might_sleep+0x108/0x170)
[   13.769554] [<80145428>] (___might_sleep) from [<806c9b50>] (mutex_lock+0x24/0x44)
[   13.777128] [<806c9b50>] (mutex_lock) from [<8050cbc0>] (phy_start_aneg+0x1c/0x13c)
[   13.784783] [<8050cbc0>] (phy_start_aneg) from [<8050d338>] (phy_ethtool_ksettings_set+0x98/0xd0)
[   13.793656] [<8050d338>] (phy_ethtool_ksettings_set) from [<80517adc>] (stmmac_ethtool_set_link_ksettings+0xa0/0xb4)
[   13.804184] [<80517adc>] (stmmac_ethtool_set_link_ksettings) from [<805c5138>] (ethtool_set_settings+0xd4/0x13c)
[   13.814358] [<805c5138>] (ethtool_set_settings) from [<805c9718>] (dev_ethtool+0x13c4/0x211c)
[   13.822882] [<805c9718>] (dev_ethtool) from [<805dc7c0>] (dev_ioctl+0x480/0x8e0)
[   13.830291] [<805dc7c0>] (dev_ioctl) from [<80260e34>] (do_vfs_ioctl+0x94/0xa00)
[   13.837699] [<80260e34>] (do_vfs_ioctl) from [<802617dc>] (SyS_ioctl+0x3c/0x60)
[   13.845011] [<802617dc>] (SyS_ioctl) from [<801088bc>] (__sys_trace_return+0x0/0x10)

Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
index d5a8122b6033..6ab7e2bdcadd 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
@@ -406,9 +406,7 @@ stmmac_ethtool_set_link_ksettings(struct net_device *dev,
 		return 0;
 	}
 
-	spin_lock(&priv->lock);
 	rc = phy_ethtool_ksettings_set(phy, cmd);
-	spin_unlock(&priv->lock);
 
 	return rc;
 }
-- 
2.1.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox